1
|
Westrin KJ, Kretzschmar WW, Emanuelsson O. ClusTrast: a short read de novo transcript isoform assembler guided by clustered contigs. BMC Bioinformatics 2024; 25:54. [PMID: 38302873 PMCID: PMC10836024 DOI: 10.1186/s12859-024-05663-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2022] [Accepted: 01/18/2024] [Indexed: 02/03/2024] Open
Abstract
BACKGROUND Transcriptome assembly from RNA-sequencing data in species without a reliable reference genome has to be performed de novo, but studies have shown that de novo methods often have inadequate ability to reconstruct transcript isoforms. We address this issue by constructing an assembly pipeline whose main purpose is to produce a comprehensive set of transcript isoforms. RESULTS We present the de novo transcript isoform assembler ClusTrast, which takes short read RNA-seq data as input, assembles a primary assembly, clusters a set of guiding contigs, aligns the short reads to the guiding contigs, assembles each clustered set of short reads individually, and merges the primary and clusterwise assemblies into the final assembly. We tested ClusTrast on real datasets from six eukaryotic species, and showed that ClusTrast reconstructed more expressed known isoforms than any of the other tested de novo assemblers, at a moderate reduction in precision. For recall, ClusTrast was on top in the lower end of expression levels (<15% percentile) for all tested datasets, and over the entire range for almost all datasets. Reference transcripts were often (35-69% for the six datasets) reconstructed to at least 95% of their length by ClusTrast, and more than half of reference transcripts (58-81%) were reconstructed with contigs that exhibited polymorphism, measuring on a subset of reliably predicted contigs. ClusTrast recall increased when using a union of assembled transcripts from more than one assembly tool as primary assembly. CONCLUSION We suggest that ClusTrast can be a useful tool for studying isoforms in species without a reliable reference genome, in particular when the goal is to produce a comprehensive transcriptome set with polymorphic variants.
Collapse
Affiliation(s)
- Karl Johan Westrin
- Science for Life Laboratory, Department of Gene Technology, KTH Royal Institute of Technology, 171 65, Solna, Sweden
| | - Warren W Kretzschmar
- Science for Life Laboratory, Department of Gene Technology, KTH Royal Institute of Technology, 171 65, Solna, Sweden
- Department of Medicine Huddinge, Center for Hematology and Regenerative Medicine (HERM), Karolinska Institute, 141 52, Flemingsberg, Sweden
| | - Olof Emanuelsson
- Science for Life Laboratory, Department of Gene Technology, KTH Royal Institute of Technology, 171 65, Solna, Sweden.
| |
Collapse
|
2
|
Correia JC, Jannig PR, Gosztyla ML, Cervenka I, Ducommun S, Præstholm SM, Dumont K, Liu Z, Liang Q, Edsgärd D, Emanuelsson O, Gregorevic P, Westerblad H, Venckunas T, Brazaitis M, Kamandulis S, Lanner JT, Yeo GW, Ruas JL. Zfp697 is an RNA-binding protein that regulates skeletal muscle inflammation and regeneration. bioRxiv 2023:2023.06.12.544338. [PMID: 37398033 PMCID: PMC10312635 DOI: 10.1101/2023.06.12.544338] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/04/2023]
Abstract
Muscular atrophy is a mortality risk factor that happens with disuse, chronic disease, and aging. Recovery from atrophy requires changes in several cell types including muscle fibers, and satellite and immune cells. Here we show that Zfp697/ZNF697 is a damage-induced regulator of muscle regeneration, during which its expression is transiently elevated. Conversely, sustained Zfp697 expression in mouse muscle leads to a gene expression signature of chemokine secretion, immune cell recruitment, and extracellular matrix remodeling. Myofiber-specific Zfp697 ablation hinders the inflammatory and regenerative response to muscle injury, compromising functional recovery. We uncover Zfp697 as an essential interferon gamma mediator in muscle cells, interacting primarily with ncRNAs such as the pro-regenerative miR-206. In sum, we identify Zfp697 as an integrator of cell-cell communication necessary for tissue regeneration.
Collapse
Affiliation(s)
- Jorge C. Correia
- Molecular and Cellular Exercise Physiology, Department of Physiology and Pharmacology, Biomedicum. Karolinska. SE-171 77, Stockholm, Sweden
| | - Paulo R. Jannig
- Molecular and Cellular Exercise Physiology, Department of Physiology and Pharmacology, Biomedicum. Karolinska. SE-171 77, Stockholm, Sweden
| | - Maya L. Gosztyla
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA, USA; Biomedical Sciences Graduate Program, University of California, San Diego, La Jolla, CA, USA; Institute for Genomic Medicine, University of California, San Diego, La Jolla, CA, USA
| | - Igor Cervenka
- Molecular and Cellular Exercise Physiology, Department of Physiology and Pharmacology, Biomedicum. Karolinska. SE-171 77, Stockholm, Sweden
| | - Serge Ducommun
- Molecular and Cellular Exercise Physiology, Department of Physiology and Pharmacology, Biomedicum. Karolinska. SE-171 77, Stockholm, Sweden
| | - Stine M. Præstholm
- Molecular and Cellular Exercise Physiology, Department of Physiology and Pharmacology, Biomedicum. Karolinska. SE-171 77, Stockholm, Sweden
| | - Kyle Dumont
- Molecular and Cellular Exercise Physiology, Department of Physiology and Pharmacology, Biomedicum. Karolinska. SE-171 77, Stockholm, Sweden
| | - Zhengye Liu
- Molecular Muscle Physiology and Pathophysiology. Department of Physiology and Pharmacology, Biomedicum. Karolinska Institutet. SE-171 77, Stockholm. Sweden
| | - Qishan Liang
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA, USA; Biomedical Sciences Graduate Program, University of California, San Diego, La Jolla, CA, USA; Institute for Genomic Medicine, University of California, San Diego, La Jolla, CA, USA
| | - Daniel Edsgärd
- Science for Life Laboratory, Department of Gene Technology, School of Engineering Sciences in Biotechnology, Chemistry and Health, KTH Royal Institute of Technology, Stockholm, Sweden
| | - Olof Emanuelsson
- Science for Life Laboratory, Department of Gene Technology, School of Engineering Sciences in Biotechnology, Chemistry and Health, KTH Royal Institute of Technology, Stockholm, Sweden
| | - Paul Gregorevic
- Centre for Muscle Research, Department of Anatomy and Physiology, School of Biomedical Sciences, The University of Melbourne, Melbourne, VIC, Australia
| | - Håkan Westerblad
- Muscle Physiology, Department of Physiology and Pharmacology, Biomedicum. Karolinska. SE-171 77, Stockholm, Sweden
| | - Tomas Venckunas
- Institute of Sports Science and Innovations, Lithuanian Sports University, 44221 Kaunas, Lithuania
| | - Marius Brazaitis
- Institute of Sports Science and Innovations, Lithuanian Sports University, 44221 Kaunas, Lithuania
| | - Sigitas Kamandulis
- Institute of Sports Science and Innovations, Lithuanian Sports University, 44221 Kaunas, Lithuania
| | - Johanna T. Lanner
- Molecular Muscle Physiology and Pathophysiology. Department of Physiology and Pharmacology, Biomedicum. Karolinska Institutet. SE-171 77, Stockholm. Sweden
| | - Gene W. Yeo
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA, USA; Biomedical Sciences Graduate Program, University of California, San Diego, La Jolla, CA, USA; Institute for Genomic Medicine, University of California, San Diego, La Jolla, CA, USA
| | - Jorge L. Ruas
- Molecular and Cellular Exercise Physiology, Department of Physiology and Pharmacology, Biomedicum. Karolinska. SE-171 77, Stockholm, Sweden
| |
Collapse
|
3
|
Akhter S, Westrin KJ, Zivi N, Nordal V, Kretzschmar WW, Delhomme N, Street NR, Nilsson O, Emanuelsson O, Sundström JF. Cone-setting in spruce is regulated by conserved elements of the age-dependent flowering pathway. New Phytol 2022; 236:1951-1963. [PMID: 36076311 PMCID: PMC9825996 DOI: 10.1111/nph.18449] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/28/2022] [Accepted: 08/23/2022] [Indexed: 06/15/2023]
Abstract
Reproductive phase change is well characterized in angiosperm model species, but less studied in gymnosperms. We utilize the early cone-setting acrocona mutant to study reproductive phase change in the conifer Picea abies (Norway spruce), a gymnosperm. The acrocona mutant frequently initiates cone-like structures, called transition shoots, in positions where wild-type P. abies always produces vegetative shoots. We collect acrocona and wild-type samples, and RNA-sequence their messenger RNA (mRNA) and microRNA (miRNA) fractions. We establish gene expression patterns and then use allele-specific transcript assembly to identify mutations in acrocona. We genotype a segregating population of inbred acrocona trees. A member of the SQUAMOSA BINDING PROTEIN-LIKE (SPL) gene family, PaSPL1, is active in reproductive meristems, whereas two putative negative regulators of PaSPL1, miRNA156 and the conifer specific miRNA529, are upregulated in vegetative and transition shoot meristems. We identify a mutation in a putative miRNA156/529 binding site of the acrocona PaSPL1 allele and show that the mutation renders the acrocona allele tolerant to these miRNAs. We show co-segregation between the early cone-setting phenotype and trees homozygous for the acrocona mutation. In conclusion, we demonstrate evolutionary conservation of the age-dependent flowering pathway and involvement of this pathway in regulating reproductive phase change in the conifer P. abies.
Collapse
Affiliation(s)
- Shirin Akhter
- Department of Plant Biology, Linnean Center for Plant Biology, Uppsala BioCentreSwedish University of Agricultural Sciences (SLU)SE‐750 07UppsalaSweden
| | - Karl Johan Westrin
- Science for Life Laboratory, Department of Gene TechnologyKTH Royal Institute of TechnologySE‐171 65SolnaSweden
| | - Nathan Zivi
- Department of Plant Biology, Linnean Center for Plant Biology, Uppsala BioCentreSwedish University of Agricultural Sciences (SLU)SE‐750 07UppsalaSweden
- Skogforsk, Uppsala Science ParkUppsalaSE‐751 83Sweden
| | - Veronika Nordal
- Department of Plant Biology, Linnean Center for Plant Biology, Uppsala BioCentreSwedish University of Agricultural Sciences (SLU)SE‐750 07UppsalaSweden
| | - Warren W. Kretzschmar
- Science for Life Laboratory, Department of Gene TechnologyKTH Royal Institute of TechnologySE‐171 65SolnaSweden
| | - Nicolas Delhomme
- Department of Forest Genetics and Plant Physiology, Umeå Plant Science CentreSwedish University of Agricultural Sciences (SLU)SE‐901 83UmeåSweden
| | - Nathaniel R. Street
- Department of Plant Physiology, Umeå Plant Science CentreUmeå UniversitySE‐901 87UmeåSweden
| | - Ove Nilsson
- Department of Forest Genetics and Plant Physiology, Umeå Plant Science CentreSwedish University of Agricultural Sciences (SLU)SE‐901 83UmeåSweden
| | - Olof Emanuelsson
- Science for Life Laboratory, Department of Gene TechnologyKTH Royal Institute of TechnologySE‐171 65SolnaSweden
| | - Jens F. Sundström
- Department of Plant Biology, Linnean Center for Plant Biology, Uppsala BioCentreSwedish University of Agricultural Sciences (SLU)SE‐750 07UppsalaSweden
| |
Collapse
|
4
|
Almagro Armenteros JJ, Salvatore M, Emanuelsson O, Winther O, von Heijne G, Elofsson A, Nielsen H. Detecting sequence signals in targeting peptides using deep learning. Life Sci Alliance 2019; 2:2/5/e201900429. [PMID: 31570514 DOI: 10.1101/639203] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2019] [Revised: 09/18/2019] [Accepted: 09/18/2019] [Indexed: 05/25/2023] Open
Abstract
In bioinformatics, machine learning methods have been used to predict features embedded in the sequences. In contrast to what is generally assumed, machine learning approaches can also provide new insights into the underlying biology. Here, we demonstrate this by presenting TargetP 2.0, a novel state-of-the-art method to identify N-terminal sorting signals, which direct proteins to the secretory pathway, mitochondria, and chloroplasts or other plastids. By examining the strongest signals from the attention layer in the network, we find that the second residue in the protein, that is, the one following the initial methionine, has a strong influence on the classification. We observe that two-thirds of chloroplast and thylakoid transit peptides have an alanine in position 2, compared with 20% in other plant proteins. We also note that in fungi and single-celled eukaryotes, less than 30% of the targeting peptides have an amino acid that allows the removal of the N-terminal methionine compared with 60% for the proteins without targeting peptide. The importance of this feature for predictions has not been highlighted before.
Collapse
Affiliation(s)
- Jose Juan Almagro Armenteros
- Department of Health Technology, Section for Bioinformatics, Technical University of Denmark, Kongen Lyngby, Denmark
| | - Marco Salvatore
- Science for Life Laboratory, Solna, Sweden
- Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
| | - Olof Emanuelsson
- Science for Life Laboratory, Solna, Sweden
- Department of Gene Technology, School of Engineering Sciences in Biotechnology, Chemistry and Health, KTH-Royal Institute of Technology, Stockholm, Sweden
| | - Ole Winther
- DTU Compute, Technical University of Denmark, Kongen Lyngby, Denmark
- Computational and RNA Biology, University of Copenhagen, Copenhagen, Denmark
- Centre for Genomic Medicine, Rigshospitalet, Copenhagen University Hospital, Copenhagen, Denmark
| | - Gunnar von Heijne
- Science for Life Laboratory, Solna, Sweden
- Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
| | - Arne Elofsson
- Science for Life Laboratory, Solna, Sweden
- Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
| | - Henrik Nielsen
- Department of Health Technology, Section for Bioinformatics, Technical University of Denmark, Kongen Lyngby, Denmark
| |
Collapse
|
5
|
Almagro Armenteros JJ, Salvatore M, Emanuelsson O, Winther O, von Heijne G, Elofsson A, Nielsen H. Detecting sequence signals in targeting peptides using deep learning. Life Sci Alliance 2019; 2:2/5/e201900429. [PMID: 31570514 PMCID: PMC6769257 DOI: 10.26508/lsa.201900429] [Citation(s) in RCA: 399] [Impact Index Per Article: 79.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2019] [Revised: 09/18/2019] [Accepted: 09/18/2019] [Indexed: 11/24/2022] Open
Abstract
In bioinformatics, machine learning methods have been used to predict features embedded in the sequences. In contrast to what is generally assumed, machine learning approaches can also provide new insights into the underlying biology. Here, we demonstrate this by presenting TargetP 2.0, a novel state-of-the-art method to identify N-terminal sorting signals, which direct proteins to the secretory pathway, mitochondria, and chloroplasts or other plastids. By examining the strongest signals from the attention layer in the network, we find that the second residue in the protein, that is, the one following the initial methionine, has a strong influence on the classification. We observe that two-thirds of chloroplast and thylakoid transit peptides have an alanine in position 2, compared with 20% in other plant proteins. We also note that in fungi and single-celled eukaryotes, less than 30% of the targeting peptides have an amino acid that allows the removal of the N-terminal methionine compared with 60% for the proteins without targeting peptide. The importance of this feature for predictions has not been highlighted before.
Collapse
Affiliation(s)
- Jose Juan Almagro Armenteros
- Department of Health Technology, Section for Bioinformatics, Technical University of Denmark, Kongen Lyngby, Denmark
| | - Marco Salvatore
- Science for Life Laboratory, Solna, Sweden.,Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
| | - Olof Emanuelsson
- Science for Life Laboratory, Solna, Sweden.,Department of Gene Technology, School of Engineering Sciences in Biotechnology, Chemistry and Health, KTH-Royal Institute of Technology, Stockholm, Sweden
| | - Ole Winther
- DTU Compute, Technical University of Denmark, Kongen Lyngby, Denmark.,Computational and RNA Biology, University of Copenhagen, Copenhagen, Denmark.,Centre for Genomic Medicine, Rigshospitalet, Copenhagen University Hospital, Copenhagen, Denmark
| | - Gunnar von Heijne
- Science for Life Laboratory, Solna, Sweden.,Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
| | - Arne Elofsson
- Science for Life Laboratory, Solna, Sweden .,Department of Biochemistry and Biophysics, Stockholm University, Stockholm, Sweden
| | - Henrik Nielsen
- Department of Health Technology, Section for Bioinformatics, Technical University of Denmark, Kongen Lyngby, Denmark
| |
Collapse
|
6
|
Akhter S, Kretzschmar WW, Nordal V, Delhomme N, Street NR, Nilsson O, Emanuelsson O, Sundström JF. Integrative Analysis of Three RNA Sequencing Methods Identifies Mutually Exclusive Exons of MADS-Box Isoforms During Early Bud Development in Picea abies. Front Plant Sci 2018; 9:1625. [PMID: 30483285 PMCID: PMC6243048 DOI: 10.3389/fpls.2018.01625] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/15/2018] [Accepted: 10/18/2018] [Indexed: 05/06/2023]
Abstract
Recent efforts to sequence the genomes and transcriptomes of several gymnosperm species have revealed an increased complexity in certain gene families in gymnosperms as compared to angiosperms. One example of this is the gymnosperm sister clade to angiosperm TM3-like MADS-box genes, which at least in the conifer lineage has expanded in number of genes. We have previously identified a member of this sub-clade, the conifer gene DEFICIENS AGAMOUS LIKE 19 (DAL19), as being specifically upregulated in cone-setting shoots. Here, we show through Sanger sequencing of mRNA-derived cDNA and mapping to assembled conifer genomic sequences that DAL19 produces six mature mRNA splice variants in Picea abies. These splice variants use alternate first and last exons, while their four central exons constitute a core region present in all six transcripts. Thus, they are likely to be transcript isoforms. Quantitative Real-Time PCR revealed that two mutually exclusive first DAL19 exons are differentially expressed across meristems that will form either male or female cones, or vegetative shoots. Furthermore, mRNA in situ hybridization revealed that two mutually exclusive last DAL19 exons were expressed in a cell-specific pattern within bud meristems. Based on these findings in DAL19, we developed a sensitive approach to transcript isoform assembly from short-read sequencing of mRNA. We applied this method to 42 putative MADS-box core regions in P. abies, from which we assembled 1084 putative transcripts. We manually curated these transcripts to arrive at 933 assembled transcript isoforms of 38 putative MADS-box genes. 152 of these isoforms, which we assign to 28 putative MADS-box genes, were differentially expressed across eight female, male, and vegetative buds. We further provide evidence of the expression of 16 out of the 38 putative MADS-box genes by mapping PacBio Iso-Seq circular consensus reads derived from pooled sample sequencing to assembled transcripts. In summary, our analyses reveal the use of mutually exclusive exons of MADS-box gene isoforms during early bud development in P. abies, and we find that the large number of identified MADS-box transcripts in P. abies results not only from expansion of the gene family through gene duplication events but also from the generation of numerous splice variants.
Collapse
Affiliation(s)
- Shirin Akhter
- Linnean Center for Plant Biology, Uppsala BioCenter, Department of Plant Biology, Swedish University of Agricultural Sciences, Uppsala, Sweden
| | - Warren W. Kretzschmar
- Science for Life Laboratory, Department of Gene Technology, School of Engineering Sciences in Biotechnology, Chemistry and Health, KTH Royal Institute of Technology, Solna, Sweden
| | - Veronika Nordal
- Linnean Center for Plant Biology, Uppsala BioCenter, Department of Plant Biology, Swedish University of Agricultural Sciences, Uppsala, Sweden
| | - Nicolas Delhomme
- Umeå Plant Science Centre, Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural Sciences, Umeå, Sweden
| | - Nathaniel R. Street
- Umeå Plant Science Centre, Department of Plant Physiology, Umeå University, Umeå, Sweden
| | - Ove Nilsson
- Umeå Plant Science Centre, Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural Sciences, Umeå, Sweden
| | - Olof Emanuelsson
- Science for Life Laboratory, Department of Gene Technology, School of Engineering Sciences in Biotechnology, Chemistry and Health, KTH Royal Institute of Technology, Solna, Sweden
| | - Jens F. Sundström
- Linnean Center for Plant Biology, Uppsala BioCenter, Department of Plant Biology, Swedish University of Agricultural Sciences, Uppsala, Sweden
- *Correspondence: Jens F. Sundström,
| |
Collapse
|
7
|
Reimegård J, Kundu S, Pendle A, Irish VF, Shaw P, Nakayama N, Sundström JF, Emanuelsson O. Genome-wide identification of physically clustered genes suggests chromatin-level co-regulation in male reproductive development in Arabidopsis thaliana. Nucleic Acids Res 2017; 45:3253-3265. [PMID: 28175342 PMCID: PMC5389543 DOI: 10.1093/nar/gkx087] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2017] [Accepted: 01/31/2017] [Indexed: 12/02/2022] Open
Abstract
Co-expression of physically linked genes occurs surprisingly frequently in eukaryotes. Such chromosomal clustering may confer a selective advantage as it enables coordinated gene regulation at the chromatin level. We studied the chromosomal organization of genes involved in male reproductive development in Arabidopsis thaliana. We developed an in-silico tool to identify physical clusters of co-regulated genes from gene expression data. We identified 17 clusters (96 genes) involved in stamen development and acting downstream of the transcriptional activator MS1 (MALE STERILITY 1), which contains a PHD domain associated with chromatin re-organization. The clusters exhibited little gene homology or promoter element similarity, and largely overlapped with reported repressive histone marks. Experiments on a subset of the clusters suggested a link between expression activation and chromatin conformation: qRT-PCR and mRNA in situ hybridization showed that the clustered genes were up-regulated within 48 h after MS1 induction; out of 14 chromatin-remodeling mutants studied, expression of clustered genes was consistently down-regulated only in hta9/hta11, previously associated with metabolic cluster activation; DNA fluorescence in situ hybridization confirmed that transcriptional activation of the clustered genes was correlated with open chromatin conformation. Stamen development thus appears to involve transcriptional activation of physically clustered genes through chromatin de-condensation.
Collapse
Affiliation(s)
- Johan Reimegård
- Science for Life Laboratory, School of Biotechnology, Division of Gene Technology, KTH Royal Institute of Technology, Solna SE-171 65, Sweden
| | - Snehangshu Kundu
- Department of Plant Biology, Uppsala BioCenter, Linnean Center for Plant Biology, Swedish University of Agricultural Sciences, Uppsala SE-750 07, Sweden
| | - Ali Pendle
- Department of Cell and Developmental Biology, John Innes Centre, Norwich Research Park, Norwich NR4 7UH, UK
| | - Vivian F Irish
- Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, CT 06520, USA
| | - Peter Shaw
- Department of Cell and Developmental Biology, John Innes Centre, Norwich Research Park, Norwich NR4 7UH, UK
| | - Naomi Nakayama
- Institute of Molecular Plant Science, SynthSys Centre for Synthetic and Systems Biology, and Centre for Science at Extreme Conditions, University of Edinburgh, King's Buildings, Edinburgh, UK
| | - Jens F Sundström
- Department of Plant Biology, Uppsala BioCenter, Linnean Center for Plant Biology, Swedish University of Agricultural Sciences, Uppsala SE-750 07, Sweden
| | - Olof Emanuelsson
- Science for Life Laboratory, School of Biotechnology, Division of Gene Technology, KTH Royal Institute of Technology, Solna SE-171 65, Sweden
| |
Collapse
|
8
|
Edsgärd D, Iglesias MJ, Reilly SJ, Hamsten A, Tornvall P, Odeberg J, Emanuelsson O. GeneiASE: Detection of condition-dependent and static allele-specific expression from RNA-seq data without haplotype information. Sci Rep 2016; 6:21134. [PMID: 26887787 PMCID: PMC4758070 DOI: 10.1038/srep21134] [Citation(s) in RCA: 38] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2015] [Accepted: 01/18/2016] [Indexed: 12/20/2022] Open
Abstract
Allele-specific expression (ASE) is the imbalance in transcription between maternal and paternal alleles at a locus and can be probed in single individuals using massively parallel DNA sequencing technology. Assessing ASE within a single sample provides a static picture of the ASE, but the magnitude of ASE for a given transcript may vary between different biological conditions in an individual. Such condition-dependent ASE could indicate a genetic variation with a functional role in the phenotypic difference. We investigated ASE through RNA-sequencing of primary white blood cells from eight human individuals before and after the controlled induction of an inflammatory response, and detected condition-dependent and static ASE at 211 and 13021 variants, respectively. We developed a method, GeneiASE, to detect genes exhibiting static or condition-dependent ASE in single individuals. GeneiASE performed consistently over a range of read depths and ASE effect sizes, and did not require phasing of variants to estimate haplotypes. We observed condition-dependent ASE related to the inflammatory response in 19 genes, and static ASE in 1389 genes. Allele-specific expression was confirmed by validation of variants through real-time quantitative RT-PCR, with RNA-seq and RT-PCR ASE effect-size correlations r = 0.67 and r = 0.94 for static and condition-dependent ASE, respectively.
Collapse
Affiliation(s)
- Daniel Edsgärd
- KTH Royal Institute of Technology, Science for Life Laboratory, School of Biotechnology, Division of Gene Technology, SE-171 65, Solna, Sweden
| | - Maria Jesus Iglesias
- Atherosclerosis Research Unit, Department of Medicine Solna, Karolinska Institutet, Center for Molecular Medicine, and Department of Cardiology, Karolinska University Hospital, Stockholm, Sweden.,KTH Royal Institute of Technology, Science for Life Laboratory, School of Biotechnology, Division of Proteomics, SE-171 65, Solna, Sweden
| | - Sarah-Jayne Reilly
- Atherosclerosis Research Unit, Department of Medicine Solna, Karolinska Institutet, Center for Molecular Medicine, and Department of Cardiology, Karolinska University Hospital, Stockholm, Sweden
| | - Anders Hamsten
- Atherosclerosis Research Unit, Department of Medicine Solna, Karolinska Institutet, Center for Molecular Medicine, and Department of Cardiology, Karolinska University Hospital, Stockholm, Sweden
| | - Per Tornvall
- Department of Clinical Science and Education, Södersjukhuset, Karolinska Institutet, Stockholm, Sweden
| | - Jacob Odeberg
- Atherosclerosis Research Unit, Department of Medicine Solna, Karolinska Institutet, Center for Molecular Medicine, and Department of Cardiology, Karolinska University Hospital, Stockholm, Sweden.,KTH Royal Institute of Technology, Science for Life Laboratory, School of Biotechnology, Division of Proteomics, SE-171 65, Solna, Sweden.,Department of Medicine, Centre for Hematology, Karolinska University Hospital and Karolinska Institutet, Solna, Sweden
| | - Olof Emanuelsson
- KTH Royal Institute of Technology, Science for Life Laboratory, School of Biotechnology, Division of Gene Technology, SE-171 65, Solna, Sweden
| |
Collapse
|
9
|
Song Y, Giske CG, Gille-Johnson P, Emanuelsson O, Lundeberg J, Gyarmati P. Nuclease-assisted suppression of human DNA background in sepsis. PLoS One 2014; 9:e103610. [PMID: 25076135 PMCID: PMC4116218 DOI: 10.1371/journal.pone.0103610] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2014] [Accepted: 06/29/2014] [Indexed: 11/18/2022] Open
Abstract
Sepsis is a severe medical condition characterized by a systemic inflammatory response of the body caused by pathogenic microorganisms in the bloodstream. Blood or plasma is typically used for diagnosis, both containing large amount of human DNA, greatly exceeding the DNA of microbial origin. In order to enrich bacterial DNA, we applied the C0t effect to reduce human DNA background: a model system was set up with human and Escherichia coli (E. coli) DNA to mimic the conditions of bloodstream infections; and this system was adapted to plasma and blood samples from septic patients. As a consequence of the C0t effect, abundant DNA hybridizes faster than rare DNA. Following denaturation and re-hybridization, the amount of abundant DNA can be decreased with the application of double strand specific nucleases, leaving the non-hybridized rare DNA intact. Our experiments show that human DNA concentration can be reduced approximately 100,000-fold without affecting the E. coli DNA concentration in a model system with similarly sized amplicons. With clinical samples, the human DNA background was decreased 100-fold, as bacterial genomes are approximately 1,000-fold smaller compared to the human genome. According to our results, background suppression can be a valuable tool to enrich rare DNA in clinical samples where a high amount of background DNA can be found.
Collapse
Affiliation(s)
- Yajing Song
- Royal Institute of Technology, Science for Life Laboratory, Stockholm, Sweden
| | - Christian G. Giske
- Karolinska Institutet, Department of Microbiology, Tumor and Cell Biology, Stockholm, Sweden
- Karolinska University Hospital, Department of Clinical Microbiology, Stockholm, Sweden
| | - Patrik Gille-Johnson
- Division of Infectious Diseases, Department of Medicine, Karolinska Institutet, Stockholm, Sweden
| | - Olof Emanuelsson
- Royal Institute of Technology, Science for Life Laboratory, Stockholm, Sweden
| | - Joakim Lundeberg
- Royal Institute of Technology, Science for Life Laboratory, Stockholm, Sweden
| | - Peter Gyarmati
- Royal Institute of Technology, Science for Life Laboratory, Stockholm, Sweden
- Karolinska Institutet, Department of Microbiology, Tumor and Cell Biology, Stockholm, Sweden
- Karolinska University Hospital, Department of Clinical Microbiology, Stockholm, Sweden
- * E-mail:
| |
Collapse
|
10
|
Sigurgeirsson B, Emanuelsson O, Lundeberg J. Analysis of stranded information using an automated procedure for strand specific RNA sequencing. BMC Genomics 2014; 15:631. [PMID: 25070246 PMCID: PMC4247151 DOI: 10.1186/1471-2164-15-631] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2014] [Accepted: 07/10/2014] [Indexed: 01/19/2023] Open
Abstract
Background Strand specific RNA sequencing is rapidly replacing conventional cDNA sequencing as an approach for assessing information about the transcriptome. Alongside improved laboratory protocols the development of bioinformatical tools is steadily progressing. In the current procedure the Illumina TruSeq library preparation kit is used, along with additional reagents, to make stranded libraries in an automated fashion which are then sequenced on Illumina HiSeq 2000. By the use of freely available bioinformatical tools we show, through quality metrics, that the protocol is robust and reproducible. We further highlight the practicality of strand specific libraries by comparing expression of strand specific libraries to non-stranded libraries, by looking at known antisense transcription of pseudogenes and by identifying novel transcription. Furthermore, two ribosomal depletion kits, RiboMinus and RiboZero, are compared and two sequence aligners, Tophat2 and STAR, are also compared. Results The, non-stranded, Illumina TruSeq kit can be adapted to generate strand specific libraries and can be used to access detailed information on the transcriptome. The RiboZero kit is very effective in removing ribosomal RNA from total RNA and the STAR aligner produces high mapping yield in a short time. Strand specific data gives more detailed and correct results than does non-stranded data as we show when estimating expression values and in assembling transcripts. Even well annotated genomes need improvements and corrections which can be achieved using strand specific data. Conclusions Researchers in the field should strive to use strand specific data; it allows for more confidence in the data analysis and is less likely to lead to false conclusions. If faced with analysing non-stranded data, researchers should be well aware of the caveats of that approach. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-15-631) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
| | | | - Joakim Lundeberg
- Science for Life Laboratory, School of Biotechnology, Royal Institute of Technology (KTH), Tomtebodavägen 23A, 17165 Solna, Stockholm, Sweden.
| |
Collapse
|
11
|
Abstract
RNA sequencing has become widely used in gene expression profiling experiments. Prior to any RNA sequencing experiment the quality of the RNA must be measured to assess whether or not it can be used for further downstream analysis. The RNA integrity number (RIN) is a scale used to measure the quality of RNA that runs from 1 (completely degraded) to 10 (intact). Ideally, samples with high RIN (8) are used in RNA sequencing experiments. RNA, however, is a fragile molecule which is susceptible to degradation and obtaining high quality RNA is often hard, or even impossible when extracting RNA from certain clinical tissues. Thus, occasionally, working with low quality RNA is the only option the researcher has. Here we investigate the effects of RIN on RNA sequencing and suggest a computational method to handle data from samples with low quality RNA which also enables reanalysis of published datasets. Using RNA from a human cell line we generated and sequenced samples with varying RINs and illustrate what effect the RIN has on the basic procedure of RNA sequencing; both quality aspects and differential expression. We show that the RIN has systematic effects on gene coverage, false positives in differential expression and the quantification of duplicate reads. We introduce 3' tag counting (3TC) as a computational approach to reliably estimate differential expression for samples with low RIN. We show that using the 3TC method in differential expression analysis significantly reduces false positives when comparing samples with different RIN, while retaining reasonable sensitivity.
Collapse
Affiliation(s)
- Benjamín Sigurgeirsson
- Science for Life Laboratory, School of Biotechnology, Royal Institute of Technology (KTH), Stockholm, Solna, Sweden
| | - Olof Emanuelsson
- Science for Life Laboratory, School of Biotechnology, Royal Institute of Technology (KTH), Stockholm, Solna, Sweden
| | - Joakim Lundeberg
- Science for Life Laboratory, School of Biotechnology, Royal Institute of Technology (KTH), Stockholm, Solna, Sweden
- * E-mail:
| |
Collapse
|
12
|
Wiegand S, Meier D, Seehafer C, Malicki M, Hofmann P, Schmith A, Winckler T, Földesi B, Boesler B, Nellen W, Reimegård J, Käller M, Hällman J, Emanuelsson O, Avesson L, Söderbom F, Hammann C. The Dictyostelium discoideum RNA-dependent RNA polymerase RrpC silences the centromeric retrotransposon DIRS-1 post-transcriptionally and is required for the spreading of RNA silencing signals. Nucleic Acids Res 2013; 42:3330-45. [PMID: 24369430 PMCID: PMC3950715 DOI: 10.1093/nar/gkt1337] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Dictyostelium intermediate repeat sequence 1 (DIRS-1) is the founding member of a poorly characterized class of retrotransposable elements that contain inverse long terminal repeats and tyrosine recombinase instead of DDE-type integrase enzymes. In Dictyostelium discoideum, DIRS-1 forms clusters that adopt the function of centromeres, rendering tight retrotransposition control critical to maintaining chromosome integrity. We report that in deletion strains of the RNA-dependent RNA polymerase RrpC, full-length and shorter DIRS-1 messenger RNAs are strongly enriched. Shorter versions of a hitherto unknown long non-coding RNA in DIRS-1 antisense orientation are also enriched in rrpC– strains. Concurrent with the accumulation of long transcripts, the vast majority of small (21 mer) DIRS-1 RNAs vanish in rrpC– strains. RNASeq reveals an asymmetric distribution of the DIRS-1 small RNAs, both along DIRS-1 and with respect to sense and antisense orientation. We show that RrpC is required for post-transcriptional DIRS-1 silencing and also for spreading of RNA silencing signals. Finally, DIRS-1 mis-regulation in the absence of RrpC leads to retrotransposon mobilization. In summary, our data reveal RrpC as a key player in the silencing of centromeric retrotransposon DIRS-1. RrpC acts at the post-transcriptional level and is involved in spreading of RNA silencing signals, both in the 5′ and 3′ directions.
Collapse
Affiliation(s)
- Stephan Wiegand
- Ribogenetics@Biochemistry Lab, School of Engineering and Science, Molecular Life Sciences Research Center, Jacobs University Bremen, Campus Ring 1, DE-28759 Bremen, Germany, Abteilung Genetik, Universität Kassel, Heinrich-Plett-Strasse 40, DE-34132 Kassel, Germany, Friedrich-Schiller-Universität Jena, Institut für Pharmazie, Lehrstuhl für Pharmazeutische Biologie, Semmelweisstraße 10, DE-07743 Jena, Germany, Division of Gene Technology, KTH Royal Institute of Technology, Science for Life Laboratory (SciLifeLab Stockholm), School of Biotechnology, SE-171 65 Solna, Sweden, Garvan Institute of Medical Research, 384 Victoria St Darlinghurst, NSW 2010, Australia, Department of Cell and Molecular Biology, Biomedical Center, Uppsala University, PO Box 596, S-75124 Uppsala, Sweden and Science for Life Laboratory, SE-75124 Uppsala, Sweden
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
13
|
Uddenberg D, Reimegård J, Clapham D, Almqvist C, von Arnold S, Emanuelsson O, Sundström JF. Early cone setting in Picea abies acrocona is associated with increased transcriptional activity of a MADS box transcription factor. Plant Physiol 2013; 161:813-23. [PMID: 23221834 PMCID: PMC3561021 DOI: 10.1104/pp.112.207746] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/20/2012] [Accepted: 12/06/2012] [Indexed: 05/07/2023]
Abstract
Conifers normally go through a long juvenile period, for Norway spruce (Picea abies) around 20 to 25 years, before developing male and female cones. We have grown plants from inbred crosses of a naturally occurring spruce mutant (acrocona). One-fourth of the segregating acrocona plants initiate cones already in their second growth cycle, suggesting control by a single locus. The early cone-setting properties of the acrocona mutant were utilized to identify candidate genes involved in vegetative-to-reproductive phase change in Norway spruce. Poly(A(+)) RNA samples from apical and basal shoots of cone-setting and non-cone-setting plants were subjected to high-throughput sequencing (RNA-seq). We assembled and investigated 33,383 expressed putative protein-coding acrocona transcripts. Eight transcripts were differentially expressed between selected sample pairs. One of these (Acr42124_1) was significantly up-regulated in apical shoot samples from cone-setting acrocona plants, and the encoded protein belongs to the MADS box gene family of transcription factors. Using quantitative real-time polymerase chain reaction with independently derived plant material, we confirmed that the MADS box gene is up-regulated in both needles and buds of cone-inducing shoots when reproductive identity is determined. Our results constitute important steps for the development of a rapid cycling model system that can be used to study gene function in conifers. In addition, our data suggest the involvement of a MADS box transcription factor in the vegetative-to-reproductive phase change in Norway spruce.
Collapse
Affiliation(s)
| | | | - David Clapham
- Department of Plant Biology and Forest Genetics, Uppsala BioCenter, Swedish University of Agricultural Sciences and Linnean Center for Plant Biology, SE–750 07 Uppsala, Sweden (D.U., D.C., S.v.A., J.F.S.)
- KTH-Royal Institute of Technology, Science for Life Laboratory, School of Biotechnology, Division of Gene Technology, SE–171 65 Solna, Sweden (J.R., O.E.)
- Forestry Research Institute of Sweden, Uppsala Science Park, SE–751 83 Uppsala, Sweden (C.A.)
| | - Curt Almqvist
- Department of Plant Biology and Forest Genetics, Uppsala BioCenter, Swedish University of Agricultural Sciences and Linnean Center for Plant Biology, SE–750 07 Uppsala, Sweden (D.U., D.C., S.v.A., J.F.S.)
- KTH-Royal Institute of Technology, Science for Life Laboratory, School of Biotechnology, Division of Gene Technology, SE–171 65 Solna, Sweden (J.R., O.E.)
- Forestry Research Institute of Sweden, Uppsala Science Park, SE–751 83 Uppsala, Sweden (C.A.)
| | - Sara von Arnold
- Department of Plant Biology and Forest Genetics, Uppsala BioCenter, Swedish University of Agricultural Sciences and Linnean Center for Plant Biology, SE–750 07 Uppsala, Sweden (D.U., D.C., S.v.A., J.F.S.)
- KTH-Royal Institute of Technology, Science for Life Laboratory, School of Biotechnology, Division of Gene Technology, SE–171 65 Solna, Sweden (J.R., O.E.)
- Forestry Research Institute of Sweden, Uppsala Science Park, SE–751 83 Uppsala, Sweden (C.A.)
| | | | | |
Collapse
|
14
|
Höiom V, Edsgärd D, Helgadottir H, Eriksson H, All-Ericsson C, Tuominen R, Ivanova I, Lundeberg J, Emanuelsson O, Hansson J. Hereditary uveal melanoma: a report of a germline mutation in BAP1. Genes Chromosomes Cancer 2013; 52:378-84. [PMID: 23341325 DOI: 10.1002/gcc.22035] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2012] [Accepted: 10/29/2012] [Indexed: 11/07/2022] Open
Abstract
Melanoma of the eye is a rare and distinct subtype of melanoma, which only rarely are familial. However, cases of uveal melanoma (UM) have been found in families with mixed cancer syndromes. Here, we describe a comprehensive search for inherited genetic variation in a family with multiple cases of UM but no aggregation of other cancer diagnoses. The proband is a woman diagnosed with UM at 16 years who within 6 months developed liver metastases. We also identified two older paternal relatives of the proband who had died from UM. We performed exome sequencing of germline DNA from members of the affected family. Exome-wide analysis identified a novel loss-of-function mutation in the BAP1 gene, previously suggested as a tumor suppressor. The mutation segregated with the UM phenotype in this family, and we detected a loss of the wild-type allele in the UM tumor of the proband, strongly supporting a causative association with UM. Screening of BAP1 germline mutations in families predisposed for UM may be used to identify individuals at increased risk of disease. Such individuals may then be enrolled in preventive programs and regular screenings to facilitate early detection and thereby improve prognosis.
Collapse
Affiliation(s)
- Veronica Höiom
- Department of Oncology and Pathology, Karolinska Institutet, Karolinska University Hospital, Solna, Stockholm, Sweden.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
15
|
Hansson J, All-Eriksson C, Helgadottir H, Edsgard D, Tuominen R, Lundeberg J, Ivanova I, Emanuelsson O, Hoiom V. BAP1: The first mutated gene causing familial uveal melanoma. J Clin Oncol 2012. [DOI: 10.1200/jco.2012.30.15_suppl.10521] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
10521 Background: Uveal melanoma (UM) is a rare malignancy with a poor prognosis. Familial predisposition to UM is rare and accounts for only a few percent of all cases. The genetic background of hereditary UM is unknown and the aim of our project was to identify susceptibility gene(s) for UM. Methods: We identified a family with hereditary predisposition for UM – the proband of which is a young female diagnosed with UM at age 16 who within 6 months developed liver metastases. We also identified two older paternal relatives who were diagnosed with UM at 39 and 44 years of age, respectively. We performed massively parallel sequencing using the Illumina Hiseq2000 technology on germline DNA from the proband, her parents and a healthy sibling. After QC and mapping against the human reference genome the average coverage across the exome was between 35 and 86 for the four sequenced samples. Results: Out of more than 260,000 single nucleotide variants (SNVs) and small insertion / deletion variants (indels), 51 gene variants were filtered out by being novel, shared by the affected proband and her father (considered an obligate mutation carrier), but not by the healthy mother, of predicted functional importance and /or located within strongly conserved regions. The strongest candidate among these was a loss of function-variant in the BAP1 gene, since BAP1 has been suggested as a tumor suppressor in several cancer-related syndromes, including cases of UM. The sequence data indicated an insertion of one base-pair in exon 3 of the BAP1 genecausing a frame-shift and subsequently a truncated protein lacking all its functional domains. The mutation was also present in UM tumor tissue from the two deceased paternal relatives and was found to segregate with the UM phenotype in the family. We also detected loss of heterozygosity in the tumor of the proband, supporting BAP1 as the causative gene in this family. Conclusions: The identification of BAP1 as the gene responsible for this syndrome is the first demonstration of a germline mutation causing UM. This enables us to identify and monitor risk individuals belonging to mutation positive families with predisposition to UM, and possibly other cancer syndromes. We are continuously screening other cases of familial UM for mutations in BAP1.
Collapse
|
16
|
Iglesias MJ, Reilly SJ, Emanuelsson O, Sennblad B, Pirmoradian Najafabadi M, Folkersen L, Mälarstig A, Lagergren J, Eriksson P, Hamsten A, Odeberg J. Combined chromatin and expression analysis reveals specific regulatory mechanisms within cytokine genes in the macrophage early immune response. PLoS One 2012; 7:e32306. [PMID: 22384210 PMCID: PMC3288078 DOI: 10.1371/journal.pone.0032306] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2011] [Accepted: 01/26/2012] [Indexed: 11/19/2022] Open
Abstract
Macrophages play a critical role in innate immunity, and the expression of early response genes orchestrate much of the initial response of the immune system. Macrophages undergo extensive transcriptional reprogramming in response to inflammatory stimuli such as Lipopolysaccharide (LPS).To identify gene transcription regulation patterns involved in early innate immune responses, we used two genome-wide approaches--gene expression profiling and chromatin immunoprecipitation-sequencing (ChIP-seq) analysis. We examined the effect of 2 hrs LPS stimulation on early gene expression and its relation to chromatin remodeling (H3 acetylation; H3Ac) and promoter binding of Sp1 and RNA polymerase II phosphorylated at serine 5 (S5P RNAPII), which is a marker for transcriptional initiation. Our results indicate novel and alternative gene regulatory mechanisms for certain proinflammatory genes. We identified two groups of up-regulated inflammatory genes with respect to chromatin modification and promoter features. One group, including highly up-regulated genes such as tumor necrosis factor (TNF), was characterized by H3Ac, high CpG content and lack of TATA boxes. The second group, containing inflammatory mediators (interleukins and CCL chemokines), was up-regulated upon LPS stimulation despite lacking H3Ac in their annotated promoters, which were low in CpG content but did contain TATA boxes. Genome-wide analysis showed that few H3Ac peaks were unique to either +/-LPS condition. However, within these, an unpacking/expansion of already existing H3Ac peaks was observed upon LPS stimulation. In contrast, a significant proportion of S5P RNAPII peaks (approx 40%) was unique to either condition. Furthermore, data indicated a large portion of previously unannotated TSSs, particularly in LPS-stimulated macrophages, where only 28% of unique S5P RNAPII peaks overlap annotated promoters. The regulation of the inflammatory response appears to occur in a very specific manner at the chromatin level for specific genes and this study highlights the level of fine-tuning that occurs in the immune response.
Collapse
Affiliation(s)
- Maria Jesus Iglesias
- Atherosclerosis Research Unit, Department of Medicine, Centre for Molecular Medicine, Karolinska Institute, Stockholm, Sweden.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
17
|
Klevebring D, Fagerberg L, Lundberg E, Emanuelsson O, Uhlén M, Lundeberg J. Analysis of transcript and protein overlap in a human osteosarcoma cell line. BMC Genomics 2010; 11:684. [PMID: 21126332 PMCID: PMC3014981 DOI: 10.1186/1471-2164-11-684] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2010] [Accepted: 12/02/2010] [Indexed: 11/23/2022] Open
Abstract
Background An interesting field of research in genomics and proteomics is to compare the overlap between the transcriptome and the proteome. Recently, the tools to analyse gene and protein expression on a whole-genome scale have been improved, including the availability of the new generation sequencing instruments and high-throughput antibody-based methods to analyze the presence and localization of proteins. In this study, we used massive transcriptome sequencing (RNA-seq) to investigate the transcriptome of a human osteosarcoma cell line and compared the expression levels with in situ protein data obtained in-situ from antibody-based immunohistochemistry (IHC) and immunofluorescence microscopy (IF). Results A large-scale analysis based on 2749 genes was performed, corresponding to approximately 13% of the protein coding genes in the human genome. We found the presence of both RNA and proteins to a large fraction of the analyzed genes with 60% of the analyzed human genes detected by all three methods. Only 34 genes (1.2%) were not detected on the transcriptional or protein level with any method. Our data suggest that the majority of the human genes are expressed at detectable transcript or protein levels in this cell line. Since the reliability of antibodies depends on possible cross-reactivity, we compared the RNA and protein data using antibodies with different reliability scores based on various criteria, including Western blot analysis. Gene products detected in all three platforms generally have good antibody validation scores, while those detected only by antibodies, but not by RNA sequencing, generally consist of more low-scoring antibodies. Conclusion This suggests that some antibodies are staining the cells in an unspecific manner, and that assessment of transcript presence by RNA-seq can provide guidance for validation of the corresponding antibodies.
Collapse
Affiliation(s)
- Daniel Klevebring
- Science for Life Laboratory, School of Biotechnology, Division of Gene Technology, Royal Institute of Technology, SE-171 65 Solna, Sweden
| | | | | | | | | | | |
Collapse
|
18
|
Klevebring D, Bjursell M, Emanuelsson O, Lundeberg J. In-depth transcriptome analysis reveals novel TARs and prevalent antisense transcription in human cell lines. PLoS One 2010; 5:e9762. [PMID: 20360838 PMCID: PMC2845605 DOI: 10.1371/journal.pone.0009762] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2009] [Accepted: 02/22/2010] [Indexed: 01/05/2023] Open
Abstract
Several recent studies have indicated that transcription is pervasive in regions outside of protein coding genes and that short antisense transcripts can originate from the promoter and terminator regions of genes. Here we investigate transcription of fragments longer than 200 nucleotides, focusing on antisense transcription for known protein coding genes and intergenic transcription. We find that roughly 12% to 16% of all reads that originate from promoter and terminator regions, respectively, map antisense to the gene in question. Furthermore, we detect a high number of novel transcriptionally active regions (TARs) that are generally expressed at a lower level than protein coding genes. We find that the correlation between RNA-seq data and microarray data is dependent on the gene length, with longer genes showing a better correlation. We detect high antisense transcriptional activity from promoter, terminator and intron regions of protein-coding genes and identify a vast number of previously unidentified TARs, including putative novel EGFR transcripts. This shows that in-depth analysis of the transcriptome using RNA-seq is a valuable tool for understanding complex transcriptional events. Furthermore, the development of new algorithms for estimation of gene expression from RNA-seq data is necessary to minimize length bias.
Collapse
Affiliation(s)
- Daniel Klevebring
- Division of Gene Technology, School of Biotechnology, AlbaNova University Center, Royal Institute of Technology, Stockholm, Sweden
| | - Magnus Bjursell
- Division of Gene Technology, School of Biotechnology, AlbaNova University Center, Royal Institute of Technology, Stockholm, Sweden
| | - Olof Emanuelsson
- Division of Gene Technology, School of Biotechnology, AlbaNova University Center, Royal Institute of Technology, Stockholm, Sweden
| | - Joakim Lundeberg
- Division of Gene Technology, School of Biotechnology, AlbaNova University Center, Royal Institute of Technology, Stockholm, Sweden
- * E-mail:
| |
Collapse
|
19
|
Zybailov B, Rutschow H, Friso G, Rudella A, Emanuelsson O, Sun Q, van Wijk KJ. Sorting signals, N-terminal modifications and abundance of the chloroplast proteome. PLoS One 2008; 3:e1994. [PMID: 18431481 PMCID: PMC2291561 DOI: 10.1371/journal.pone.0001994] [Citation(s) in RCA: 525] [Impact Index Per Article: 32.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2008] [Accepted: 03/06/2008] [Indexed: 01/24/2023] Open
Abstract
Characterization of the chloroplast proteome is needed to understand the essential contribution of the chloroplast to plant growth and development. Here we present a large scale analysis by nanoLC-Q-TOF and nanoLC-LTQ-Orbitrap mass spectrometry (MS) of ten independent chloroplast preparations from Arabidopsis thaliana which unambiguously identified 1325 proteins. Novel proteins include various kinases and putative nucleotide binding proteins. Based on repeated and independent MS based protein identifications requiring multiple matched peptide sequences, as well as literature, 916 nuclear-encoded proteins were assigned with high confidence to the plastid, of which 86% had a predicted chloroplast transit peptide (cTP). The protein abundance of soluble stromal proteins was calculated from normalized spectral counts from LTQ-Obitrap analysis and was found to cover four orders of magnitude. Comparison to gel-based quantification demonstrates that ‘spectral counting’ can provide large scale protein quantification for Arabidopsis. This quantitative information was used to determine possible biases for protein targeting prediction by TargetP and also to understand the significance of protein contaminants. The abundance data for 550 stromal proteins was used to understand abundance of metabolic pathways and chloroplast processes. We highlight the abundance of 48 stromal proteins involved in post-translational proteome homeostasis (including aminopeptidases, proteases, deformylases, chaperones, protein sorting components) and discuss the biological implications. N-terminal modifications were identified for a subset of nuclear- and chloroplast-encoded proteins and a novel N-terminal acetylation motif was discovered. Analysis of cTPs and their cleavage sites of Arabidopsis chloroplast proteins, as well as their predicted rice homologues, identified new species-dependent features, which will facilitate improved subcellular localization prediction. No evidence was found for suggested targeting via the secretory system. This study provides the most comprehensive chloroplast proteome analysis to date and an expanded Plant Proteome Database (PPDB) in which all MS data are projected on identified gene models.
Collapse
Affiliation(s)
- Boris Zybailov
- Department of Plant Biology, Cornell University, Ithaca, New York, United States of America
| | - Heidi Rutschow
- Department of Plant Biology, Cornell University, Ithaca, New York, United States of America
| | - Giulia Friso
- Department of Plant Biology, Cornell University, Ithaca, New York, United States of America
| | - Andrea Rudella
- Department of Plant Biology, Cornell University, Ithaca, New York, United States of America
| | - Olof Emanuelsson
- Stockholm Bioinformatics Center, AlbaNova, Stockholm University, Stockholm, Sweden
| | - Qi Sun
- Computation Biology Service Unit, Cornell Theory Center, Cornell University, Ithaca, New York, United States of America
| | - Klaas J. van Wijk
- Department of Plant Biology, Cornell University, Ithaca, New York, United States of America
- * To whom correspondence should be addressed. E-mail:
| |
Collapse
|
20
|
Euskirchen GM, Rozowsky JS, Wei CL, Lee WH, Zhang ZD, Hartman S, Emanuelsson O, Stolc V, Weissman S, Gerstein MB, Ruan Y, Snyder M. Mapping of transcription factor binding regions in mammalian cells by ChIP: comparison of array- and sequencing-based technologies. Genome Res 2007; 17:898-909. [PMID: 17568005 PMCID: PMC1891348 DOI: 10.1101/gr.5583007] [Citation(s) in RCA: 170] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Recent progress in mapping transcription factor (TF) binding regions can largely be credited to chromatin immunoprecipitation (ChIP) technologies. We compared strategies for mapping TF binding regions in mammalian cells using two different ChIP schemes: ChIP with DNA microarray analysis (ChIP-chip) and ChIP with DNA sequencing (ChIP-PET). We first investigated parameters central to obtaining robust ChIP-chip data sets by analyzing STAT1 targets in the ENCODE regions of the human genome, and then compared ChIP-chip to ChIP-PET. We devised methods for scoring and comparing results among various tiling arrays and examined parameters such as DNA microarray format, oligonucleotide length, hybridization conditions, and the use of competitor Cot-1 DNA. The best performance was achieved with high-density oligonucleotide arrays, oligonucleotides >/=50 bases (b), the presence of competitor Cot-1 DNA and hybridizations conducted in microfluidics stations. When target identification was evaluated as a function of array number, 80%-86% of targets were identified with three or more arrays. Comparison of ChIP-chip with ChIP-PET revealed strong agreement for the highest ranked targets with less overlap for the low ranked targets. With advantages and disadvantages unique to each approach, we found that ChIP-chip and ChIP-PET are frequently complementary in their relative abilities to detect STAT1 targets for the lower ranked targets; each method detected validated targets that were missed by the other method. The most comprehensive list of STAT1 binding regions is obtained by merging results from ChIP-chip and ChIP-sequencing. Overall, this study provides information for robust identification, scoring, and validation of TF targets using ChIP-based technologies.
Collapse
Affiliation(s)
- Ghia M. Euskirchen
- Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, Connecticut 06520-8103, USA
| | - Joel S. Rozowsky
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520-8114, USA
| | | | | | - Zhengdong D. Zhang
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520-8114, USA
| | - Stephen Hartman
- Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, Connecticut 06520-8103, USA
| | - Olof Emanuelsson
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520-8114, USA
| | - Viktor Stolc
- Center for Nanotechnology, NASA Ames Research Center, Moffett Field, California 94035, USA
| | - Sherman Weissman
- Department of Genetics, Yale University School of Medicine, New Haven, Connecticut 06520-8005, USA
| | - Mark B. Gerstein
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520-8114, USA
| | - Yijun Ruan
- Genome Institute of Singapore, Singapore 138672
| | - Michael Snyder
- Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, Connecticut 06520-8103, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520-8114, USA
- Corresponding author.E-mail ; fax (203) 432-6161
| |
Collapse
|
21
|
Abstract
Determining the subcellular localization of a protein is an important first step toward understanding its function. Here, we describe the properties of three well-known N-terminal sequence motifs directing proteins to the secretory pathway, mitochondria and chloroplasts, and sketch a brief history of methods to predict subcellular localization based on these sorting signals and other sequence properties. We then outline how to use a number of internet-accessible tools to arrive at a reliable subcellular localization prediction for eukaryotic and prokaryotic proteins. In particular, we provide detailed step-by-step instructions for the coupled use of the amino-acid sequence-based predictors TargetP, SignalP, ChloroP and TMHMM, which are all hosted at the Center for Biological Sequence Analysis, Technical University of Denmark. In addition, we describe and provide web references to other useful subcellular localization predictors. Finally, we discuss predictive performance measures in general and the performance of TargetP and SignalP in particular.
Collapse
Affiliation(s)
- Olof Emanuelsson
- Stockholm Bioinformatics Center, Albanova, Stockholm University, SE-10691 Stockholm, Sweden
| | | | | | | |
Collapse
|
22
|
Birney E, Stamatoyannopoulos JA, Dutta A, Guigó R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE, Kuehn MS, Taylor CM, Neph S, Koch CM, Asthana S, Malhotra A, Adzhubei I, Greenbaum JA, Andrews RM, Flicek P, Boyle PJ, Cao H, Carter NP, Clelland GK, Davis S, Day N, Dhami P, Dillon SC, Dorschner MO, Fiegler H, Giresi PG, Goldy J, Hawrylycz M, Haydock A, Humbert R, James KD, Johnson BE, Johnson EM, Frum TT, Rosenzweig ER, Karnani N, Lee K, Lefebvre GC, Navas PA, Neri F, Parker SCJ, Sabo PJ, Sandstrom R, Shafer A, Vetrie D, Weaver M, Wilcox S, Yu M, Collins FS, Dekker J, Lieb JD, Tullius TD, Crawford GE, Sunyaev S, Noble WS, Dunham I, Denoeud F, Reymond A, Kapranov P, Rozowsky J, Zheng D, Castelo R, Frankish A, Harrow J, Ghosh S, Sandelin A, Hofacker IL, Baertsch R, Keefe D, Dike S, Cheng J, Hirsch HA, Sekinger EA, Lagarde J, Abril JF, Shahab A, Flamm C, Fried C, Hackermüller J, Hertel J, Lindemeyer M, Missal K, Tanzer A, Washietl S, Korbel J, Emanuelsson O, Pedersen JS, Holroyd N, Taylor R, Swarbreck D, Matthews N, Dickson MC, Thomas DJ, Weirauch MT, Gilbert J, Drenkow J, Bell I, Zhao X, Srinivasan KG, Sung WK, Ooi HS, Chiu KP, Foissac S, Alioto T, Brent M, Pachter L, Tress ML, Valencia A, Choo SW, Choo CY, Ucla C, Manzano C, Wyss C, Cheung E, Clark TG, Brown JB, Ganesh M, Patel S, Tammana H, Chrast J, Henrichsen CN, Kai C, Kawai J, Nagalakshmi U, Wu J, Lian Z, Lian J, Newburger P, Zhang X, Bickel P, Mattick JS, Carninci P, Hayashizaki Y, Weissman S, Hubbard T, Myers RM, Rogers J, Stadler PF, Lowe TM, Wei CL, Ruan Y, Struhl K, Gerstein M, Antonarakis SE, Fu Y, Green ED, Karaöz U, Siepel A, Taylor J, Liefer LA, Wetterstrand KA, Good PJ, Feingold EA, Guyer MS, Cooper GM, Asimenos G, Dewey CN, Hou M, Nikolaev S, Montoya-Burgos JI, Löytynoja A, Whelan S, Pardi F, Massingham T, Huang H, Zhang NR, Holmes I, Mullikin JC, Ureta-Vidal A, Paten B, Seringhaus M, Church D, Rosenbloom K, Kent WJ, Stone EA, Batzoglou S, Goldman N, Hardison RC, Haussler D, Miller W, Sidow A, Trinklein ND, Zhang ZD, Barrera L, Stuart R, King DC, Ameur A, Enroth S, Bieda MC, Kim J, Bhinge AA, Jiang N, Liu J, Yao F, Vega VB, Lee CWH, Ng P, Shahab A, Yang A, Moqtaderi Z, Zhu Z, Xu X, Squazzo S, Oberley MJ, Inman D, Singer MA, Richmond TA, Munn KJ, Rada-Iglesias A, Wallerman O, Komorowski J, Fowler JC, Couttet P, Bruce AW, Dovey OM, Ellis PD, Langford CF, Nix DA, Euskirchen G, Hartman S, Urban AE, Kraus P, Van Calcar S, Heintzman N, Kim TH, Wang K, Qu C, Hon G, Luna R, Glass CK, Rosenfeld MG, Aldred SF, Cooper SJ, Halees A, Lin JM, Shulha HP, Zhang X, Xu M, Haidar JNS, Yu Y, Ruan Y, Iyer VR, Green RD, Wadelius C, Farnham PJ, Ren B, Harte RA, Hinrichs AS, Trumbower H, Clawson H, Hillman-Jackson J, Zweig AS, Smith K, Thakkapallayil A, Barber G, Kuhn RM, Karolchik D, Armengol L, Bird CP, de Bakker PIW, Kern AD, Lopez-Bigas N, Martin JD, Stranger BE, Woodroffe A, Davydov E, Dimas A, Eyras E, Hallgrímsdóttir IB, Huppert J, Zody MC, Abecasis GR, Estivill X, Bouffard GG, Guan X, Hansen NF, Idol JR, Maduro VVB, Maskeri B, McDowell JC, Park M, Thomas PJ, Young AC, Blakesley RW, Muzny DM, Sodergren E, Wheeler DA, Worley KC, Jiang H, Weinstock GM, Gibbs RA, Graves T, Fulton R, Mardis ER, Wilson RK, Clamp M, Cuff J, Gnerre S, Jaffe DB, Chang JL, Lindblad-Toh K, Lander ES, Koriabine M, Nefedov M, Osoegawa K, Yoshinaga Y, Zhu B, de Jong PJ. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 2007; 447:799-816. [PMID: 17571346 PMCID: PMC2212820 DOI: 10.1038/nature05874] [Citation(s) in RCA: 3782] [Impact Index Per Article: 222.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
We report the generation and analysis of functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project. These data have been further integrated and augmented by a number of evolutionary and computational analyses. Together, our results advance the collective knowledge about human genome function in several major areas. First, our studies provide convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts, including non-protein-coding transcripts, and those that extensively overlap one another. Second, systematic examination of transcriptional regulation has yielded new understanding about transcription start sites, including their relationship to specific regulatory sequences and features of chromatin accessibility and histone modification. Third, a more sophisticated view of chromatin structure has emerged, including its inter-relationship with DNA replication and transcriptional regulation. Finally, integration of these new sources of information, in particular with respect to mammalian evolution based on inter- and intra-species sequence comparisons, has yielded new mechanistic and evolutionary insights concerning the functional landscape of the human genome. Together, these studies are defining a path for pursuit of a more comprehensive characterization of human genome function.
Collapse
|
23
|
Gerstein MB, Bruce C, Rozowsky JS, Zheng D, Du J, Korbel JO, Emanuelsson O, Zhang ZD, Weissman S, Snyder M. What is a gene, post-ENCODE? History and updated definition. Genome Res 2007; 17:669-81. [PMID: 17567988 DOI: 10.1101/gr.6339607] [Citation(s) in RCA: 457] [Impact Index Per Article: 26.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
While sequencing of the human genome surprised us with how many protein-coding genes there are, it did not fundamentally change our perspective on what a gene is. In contrast, the complex patterns of dispersed regulation and pervasive transcription uncovered by the ENCODE project, together with non-genic conservation and the abundance of noncoding RNA genes, have challenged the notion of the gene. To illustrate this, we review the evolution of operational definitions of a gene over the past century--from the abstract elements of heredity of Mendel and Morgan to the present-day ORFs enumerated in the sequence databanks. We then summarize the current ENCODE findings and provide a computational metaphor for the complexity. Finally, we propose a tentative update to the definition of a gene: A gene is a union of genomic sequences encoding a coherent set of potentially overlapping functional products. Our definition side-steps the complexities of regulation and transcription by removing the former altogether from the definition and arguing that final, functional gene products (rather than intermediate transcripts) should be used to group together entities associated with a single gene. It also manifests how integral the concept of biological function is in defining genes.
Collapse
Affiliation(s)
- Mark B Gerstein
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06511, USA.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
24
|
Emanuelsson O, Nagalakshmi U, Zheng D, Rozowsky JS, Urban AE, Du J, Lian Z, Stolc V, Weissman S, Snyder M, Gerstein MB. Assessing the performance of different high-density tiling microarray strategies for mapping transcribed regions of the human genome. Genome Res 2006; 17:886-97. [PMID: 17119069 PMCID: PMC1891347 DOI: 10.1101/gr.5014606] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Genomic tiling microarrays have become a popular tool for interrogating the transcriptional activity of large regions of the genome in an unbiased fashion. There are several key parameters associated with each tiling experiment (e.g., experimental protocols and genomic tiling density). Here, we assess the role of these parameters as they are manifest in different tiling-array platforms used for transcription mapping. First, we analyze how a number of published tiling-array experiments agree with established gene annotation on human chromosome 22. We observe that the transcription detected from high-density arrays correlates substantially better with annotation than that from other array types. Next, we analyze the transcription-mapping performance of the two main high-density oligonucleotide array platforms in the ENCODE regions of the human genome. We hybridize identical biological samples and develop several ways of scoring the arrays and segmenting the genome into transcribed and nontranscribed regions, with the aim of making the platforms most comparable to each other. Finally, we develop a platform comparison approach based on agreement with known annotation. Overall, we find that the performance improves with more data points per locus, coupled with statistical scoring approaches that properly take advantage of this, where this larger number of data points arises from higher genomic tiling density and the use of replicate arrays and mismatches. While we do find significant differences in the performance of the two high-density platforms, we also find that they complement each other to some extent. Finally, our experiments reveal a significant amount of novel transcription outside of known genes, and an appreciable sample of this was validated by independent experiments.
Collapse
Affiliation(s)
- Olof Emanuelsson
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520-8114, USA
| | - Ugrappa Nagalakshmi
- Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, Connecticut 06520-8103, USA
| | - Deyou Zheng
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520-8114, USA
| | - Joel S. Rozowsky
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520-8114, USA
| | - Alexander E. Urban
- Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, Connecticut 06520-8103, USA
- Department of Genetics, Yale University School of Medicine, New Haven, Connecticut 06520–8005, USA
| | - Jiang Du
- Department of Computer Science, Yale University, New Haven, Connecticut 06520-8285, USA
| | - Zheng Lian
- Department of Genetics, Yale University School of Medicine, New Haven, Connecticut 06520–8005, USA
| | - Viktor Stolc
- Center for Nanotechnology, NASA Ames Research Center, Moffett Field, California 94035, USA
| | - Sherman Weissman
- Department of Genetics, Yale University School of Medicine, New Haven, Connecticut 06520–8005, USA
| | - Michael Snyder
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520-8114, USA
- Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, Connecticut 06520-8103, USA
- Corresponding authors.E-mail ; fax (360) 838-7861.E-mail ; fax: (360) 838-7861
| | - Mark B. Gerstein
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520-8114, USA
- Department of Computer Science, Yale University, New Haven, Connecticut 06520-8285, USA
- Corresponding authors.E-mail ; fax (360) 838-7861.E-mail ; fax: (360) 838-7861
| |
Collapse
|
25
|
Royce TE, Rozowsky JS, Luscombe NM, Emanuelsson O, Yu H, Zhu X, Snyder M, Gerstein MB. [15] Extrapolating Traditional DNA Microarray Statistics to Tiling and Protein Microarray Technologies. Methods Enzymol 2006; 411:282-311. [PMID: 16939796 DOI: 10.1016/s0076-6879(06)11015-0] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
A credit to microarray technology is its broad application. Two experiments--the tiling microarray experiment and the protein microarray experiment--are exemplars of the versatility of the microarrays. With the technology's expanding list of uses, the corresponding bioinformatics must evolve in step. There currently exists a rich literature developing statistical techniques for analyzing traditional gene-centric DNA microarrays, so the first challenge in analyzing the advanced technologies is to identify which of the existing statistical protocols are relevant and where and when revised methods are needed. A second challenge is making these often very technical ideas accessible to the broader microarray community. The aim of this chapter is to present some of the most widely used statistical techniques for normalizing and scoring traditional microarray data and indicate their potential utility for analyzing the newer protein and tiling microarray experiments. In so doing, we will assume little or no prior training in statistics of the reader. Areas covered include background correction, intensity normalization, spatial normalization, and the testing of statistical significance.
Collapse
Affiliation(s)
- Thomas E Royce
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA
| | | | | | | | | | | | | | | |
Collapse
|
26
|
Bertone P, Trifonov V, Rozowsky JS, Schubert F, Emanuelsson O, Karro J, Kao MY, Snyder M, Gerstein M. Design optimization methods for genomic DNA tiling arrays. Genome Res 2005; 16:271-81. [PMID: 16365382 PMCID: PMC1361723 DOI: 10.1101/gr.4452906] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
A recent development in microarray research entails the unbiased coverage, or tiling, of genomic DNA for the large-scale identification of transcribed sequences and regulatory elements. A central issue in designing tiling arrays is that of arriving at a single-copy tile path, as significant sequence cross-hybridization can result from the presence of non-unique probes on the array. Due to the fragmentation of genomic DNA caused by the widespread distribution of repetitive elements, the problem of obtaining adequate sequence coverage increases with the sizes of subsequence tiles that are to be included in the design. This becomes increasingly problematic when considering complex eukaryotic genomes that contain many thousands of interspersed repeats. The general problem of sequence tiling can be framed as finding an optimal partitioning of non-repetitive subsequences over a prescribed range of tile sizes, on a DNA sequence comprising repetitive and non-repetitive regions. Exact solutions to the tiling problem become computationally infeasible when applied to large genomes, but successive optimizations are developed that allow their practical implementation. These include an efficient method for determining the degree of similarity of many oligonucleotide sequences over large genomes, and two algorithms for finding an optimal tile path composed of longer sequence tiles. The first algorithm, a dynamic programming approach, finds an optimal tiling in linear time and space; the second applies a heuristic search to reduce the space complexity to a constant requirement. A Web resource has also been developed, accessible at http://tiling.gersteinlab.org, to generate optimal tile paths from user-provided DNA sequences.
Collapse
Affiliation(s)
- Paul Bertone
- Department of Molecular, Cellular, and Developmental Biology, Yale University, New Haven, CN 06520, USA. P50 HG02357
| | | | | | | | | | | | | | | | | |
Collapse
|
27
|
White EJ, Emanuelsson O, Scalzo D, Royce T, Kosak S, Oakeley EJ, Weissman S, Gerstein M, Groudine M, Snyder M, Schübeler D. DNA replication-timing analysis of human chromosome 22 at high resolution and different developmental states. Proc Natl Acad Sci U S A 2004; 101:17771-6. [PMID: 15591350 PMCID: PMC539744 DOI: 10.1073/pnas.0408170101] [Citation(s) in RCA: 97] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Duplication of the genome during the S phase of the cell cycle does not occur simultaneously; rather, different sequences are replicated at different times. The replication timing of specific sequences can change during development; however, the determinants of this dynamic process are poorly understood. To gain insights into the contribution of developmental state, genomic sequence, and transcriptional activity to replication timing, we investigated the timing of DNA replication at high resolution along an entire human chromosome (chromosome 22) in two different cell types. The pattern of replication timing was correlated with respect to annotated genes, gene expression, novel transcribed regions of unknown function, sequence composition, and cytological features. We observed that chromosome 22 contains regions of early- and late-replicating domains of 100 kb to 2 Mb, many (but not all) of which are associated with previously described chromosomal bands. In both cell types, expressed sequences are replicated earlier than nontranscribed regions. However, several highly transcribed regions replicate late. Overall, the DNA replication-timing profiles of the two different cell types are remarkably similar, with only nine regions of difference observed. In one case, this difference reflects the differential expression of an annotated gene that resides in this region. Novel transcribed regions with low coding potential exhibit a strong propensity for early DNA replication. Although the cellular function of such transcripts is poorly understood, our results suggest that their activity is linked to the replication-timing program.
Collapse
Affiliation(s)
- Eric J White
- Department of Molecular, Cellular, and Developmental Biology, Yale University, New Haven, CT 06520-8103, USA
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
28
|
Sun Q, Emanuelsson O, van Wijk KJ. Analysis of curated and predicted plastid subproteomes of Arabidopsis. Subcellular compartmentalization leads to distinctive proteome properties. Plant Physiol 2004; 135:723-34. [PMID: 15208420 PMCID: PMC514110 DOI: 10.1104/pp.104.040717] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/13/2004] [Revised: 03/25/2004] [Accepted: 04/14/2004] [Indexed: 05/17/2023]
Abstract
Carefully curated proteomes of the inner envelope membrane, the thylakoid membrane, and the thylakoid lumen of chloroplasts from Arabidopsis were assembled based on published, well-documented localizations. These curated proteomes were evaluated for distribution of physical-chemical parameters, with the goal of extracting parameters for improved subcellular prediction and subsequent identification of additional (low abundant) components of each membrane system. The assembly of rigorously curated subcellular proteomes is in itself also important as a parts list for plant and systems biology. Transmembrane and subcellular prediction strategies were evaluated using the curated data sets. The three curated proteomes differ strongly in average isoelectric point and protein size, as well as transmembrane distribution. Removal of the cleavable, N-terminal transit peptide sequences greatly affected isoelectric point and size distribution. Unexpectedly, the Cys content was much lower for the thylakoid proteomes than for the inner envelope. This likely relates to the role of the thylakoid membrane in light-driven electron transport and helps to avoid unwanted oxidation-reduction reactions. A rule of thumb for discriminating between the predicted integral inner envelope membrane and integral thylakoid membrane proteins is suggested. Using a combination of predictors and experimentally derived parameters, four plastid subproteomes were predicted from the fully annotated Arabidopsis genome. These predicted subproteomes were analyzed for their properties and compared to the curated proteomes. The sensitivity and accuracy of the prediction strategies are discussed. Data can be extracted from the new plastid proteome database (http://ppdb.tc.cornell.edu).
Collapse
Affiliation(s)
- Qi Sun
- Computational Biology Service Unit, Cornell Theory Center, Cornell University, Ithaca, New York, USA
| | | | | |
Collapse
|
29
|
Westerlund I, Von Heijne G, Emanuelsson O. LumenP--a neural network predictor for protein localization in the thylakoid lumen. Protein Sci 2003; 12:2360-6. [PMID: 14500894 PMCID: PMC2366911 DOI: 10.1110/ps.0306003] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2003] [Revised: 06/19/2003] [Accepted: 07/11/2003] [Indexed: 10/27/2022]
Abstract
We report the development of LumenP, a new neural network-based predictor for the identification of proteins targeted to the thylakoid lumen of plant chloroplasts and prediction of their cleavage sites. When used together with the previously developed TargetP predictor, LumenP reaches a significantly better performance than what has been recorded for previous attempts at predicting thylakoid lumen location, mostly due to a lower false positive rate. The combination of TargetP and LumenP predicts around 1.5%-3% of all proteins encoded in the genomes of Arabidopsis thaliana and Oryza sativa to be located in the lumen of the thylakoid.
Collapse
Affiliation(s)
- Isabelle Westerlund
- Stockholm Bioinformatics Center, and Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden
| | | | | |
Collapse
|
30
|
Abstract
In an attempt to improve our abilities to predict peroxisomal proteins, we have combined machine-learning techniques for analyzing peroxisomal targeting signals (PTS1) with domain-based cross-species comparisons between eight eukaryotic genomes. Our results indicate that this combined approach has a significantly higher specificity than earlier attempts to predict peroxisomal localization, without a loss in sensitivity. This allowed us to predict 430 peroxisomal proteins that almost completely lack a localization annotation. These proteins can be grouped into 29 families covering most of the known steps in all known peroxisomal pathways. In general, plants have the highest number of predicted peroxisomal proteins, and fungi the smallest number.
Collapse
Affiliation(s)
- Olof Emanuelsson
- Stockholm Bioinformatics Center, AlbaNova University Center, Department of Biochemistry and Biophysics, Stockholm University, S-106 91, Stockholm, Sweden
| | | | | | | |
Collapse
|
31
|
Abstract
Predicting the subcellular localisation of proteins is an important part of the elucidation of their functions and interactions. Here, the amino acid sequence motifs that direct proteins to their proper subcellular compartment are surveyed, different methods for localisation prediction are discussed, and some benchmarks for the more commonly used predictors are presented.
Collapse
Affiliation(s)
- Olof Emanuelsson
- Stockholm Bioinformatics Center, Stockholm University, Stockholm, Sweden.
| |
Collapse
|
32
|
Affiliation(s)
- O Emanuelsson
- Stockholm Bioinformatics Center, Stockholm University, S-10691 Stockholm, Sweden
| | | | | |
Collapse
|
33
|
Peltier JB, Emanuelsson O, Kalume DE, Ytterberg J, Friso G, Rudella A, Liberles DA, Söderberg L, Roepstorff P, von Heijne G, van Wijk KJ. Central functions of the lumenal and peripheral thylakoid proteome of Arabidopsis determined by experimentation and genome-wide prediction. Plant Cell 2002; 14:211-36. [PMID: 11826309 PMCID: PMC150561 DOI: 10.1105/tpc.010304] [Citation(s) in RCA: 311] [Impact Index Per Article: 14.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/24/2001] [Accepted: 10/12/2001] [Indexed: 05/17/2023]
Abstract
Experimental proteome analysis was combined with a genome-wide prediction screen to characterize the protein content of the thylakoid lumen of Arabidopsis chloroplasts. Soluble thylakoid proteins were separated by two-dimensional electrophoresis and identified by mass spectrometry. The identities of 81 proteins were established, and N termini were sequenced to validate localization prediction. Gene annotation of the identified proteins was corrected by experimental data, and an interesting case of alternative splicing was discovered. Expression of a surprising number of paralogs was detected. Expression of five isomerases of different classes suggests strong (un)folding activity in the thylakoid lumen. These isomerases possibly are connected to a network of peripheral and lumenal proteins involved in antioxidative response, including peroxiredoxins, m-type thioredoxins, and a lumenal ascorbate peroxidase. Characteristics of the experimentally identified lumenal proteins and their orthologs were used for a genome-wide prediction of the lumenal proteome. Lumenal proteins with a typical twin-arginine translocation motif were predicted with good accuracy and sensitivity and included additional isomerases and proteases. Thus, prime functions of the lumenal proteome include assistance in the folding and proteolysis of thylakoid proteins as well as protection against oxidative stress. Many of the predicted lumenal proteins must be present at concentrations at least 10,000-fold lower than proteins of the photosynthetic apparatus.
Collapse
Affiliation(s)
- Jean-Benoît Peltier
- Department of Plant Biology, Cornell University, Ithaca, New York 14853, USA
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
34
|
Abstract
The subcellular location of a protein is an important characteristic with functional implications, and hence the problem of predicting subcellular localization from the amino acid sequence has received a fair amount of attention from the bioinformatics community. This review attempts to summarize the present state of the art in the field.
Collapse
Affiliation(s)
- O Emanuelsson
- Stockholm Bioinformatics Center, Stockholm University, S-10691, Stockholm, Sweden
| | | |
Collapse
|
35
|
Abstract
A neural network-based tool, TargetP, for large-scale subcellular location prediction of newly identified proteins has been developed. Using N-terminal sequence information only, it discriminates between proteins destined for the mitochondrion, the chloroplast, the secretory pathway, and "other" localizations with a success rate of 85% (plant) or 90% (non-plant) on redundancy-reduced test sets. From a TargetP analysis of the recently sequenced Arabidopsis thaliana chromosomes 2 and 4 and the Ensembl Homo sapiens protein set, we estimate that 10% of all plant proteins are mitochondrial and 14% chloroplastic, and that the abundance of secretory proteins, in both Arabidopsis and Homo, is around 10%. TargetP also predicts cleavage sites with levels of correctly predicted sites ranging from approximately 40% to 50% (chloroplastic and mitochondrial presequences) to above 70% (secretory signal peptides). TargetP is available as a web-server at http://www.cbs.dtu.dk/services/TargetP/.
Collapse
Affiliation(s)
- O Emanuelsson
- Stockholm Bioinformatics Center, Department of Biochemistry, Stockholm University, Stockholm, S-106 91, Sweden
| | | | | | | |
Collapse
|
36
|
Abstract
We present a neural network based method (ChloroP) for identifying chloroplast transit peptides and their cleavage sites. Using cross-validation, 88% of the sequences in our homology reduced training set were correctly classified as transit peptides or nontransit peptides. This performance level is well above that of the publicly available chloroplast localization predictor PSORT. Cleavage sites are predicted using a scoring matrix derived by an automatic motif-finding algorithm. Approximately 60% of the known cleavage sites in our sequence collection were predicted to within +/-2 residues from the cleavage sites given in SWISS-PROT. An analysis of 715 Arabidopsis thaliana sequences from SWISS-PROT suggests that the ChloroP method should be useful for the identification of putative transit peptides in genome-wide sequence data. The ChloroP predictor is available as a web-server at http://www.cbs.dtu.dk/services/ChloroP/.
Collapse
Affiliation(s)
- O Emanuelsson
- Department of Biochemistry, Stockholm University, Sweden
| | | | | |
Collapse
|
37
|
Abstract
We present a neural network based method (ChloroP) for identifying chloroplast transit peptides and their cleavage sites. Using cross-validation, 88% of the sequences in our homology reduced training set were correctly classified as transit peptides or nontransit peptides. This performance level is well above that of the publicly available chloroplast localization predictor PSORT. Cleavage sites are predicted using a scoring matrix derived by an automatic motif-finding algorithm. Approximately 60% of the known cleavage sites in our sequence collection were predicted to within +/-2 residues from the cleavage sites given in SWISS-PROT. An analysis of 715 Arabidopsis thaliana sequences from SWISS-PROT suggests that the ChloroP method should be useful for the identification of putative transit peptides in genome-wide sequence data. The ChloroP predictor is available as a web-server at http://www.cbs.dtu.dk/services/ChloroP/.
Collapse
Affiliation(s)
- O Emanuelsson
- Department of Biochemistry, Stockholm University, Sweden
| | | | | |
Collapse
|