1
|
Carlyle BC, Kitchen RR, Zhang J, Wilson R, Lam TT, Rozowsky JS, Williams KR, Sestan N, Gerstein M, Nairn AC. Isoform-Level Interpretation of High-Throughput Proteomics Data Enabled by Deep Integration with RNA-seq. J Proteome Res 2018; 17:3431-3444. [PMID: 30125121 PMCID: PMC6392456 DOI: 10.1021/acs.jproteome.8b00310] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
Cellular control of gene expression is a complex process that is subject to multiple levels of regulation, but ultimately it is the protein produced that determines the biosynthetic state of the cell. One way that a cell can regulate the protein output from each gene is by expressing alternate isoforms with distinct amino acid sequences. These isoforms may exhibit differences in localization and binding interactions that can have profound functional implications. High-throughput liquid chromatography tandem mass spectrometry proteomics (LC-MS/MS) relies on enzymatic digestion and has lower coverage and sensitivity than transcriptomic profiling methods such as RNA-seq. Digestion results in predictable fragmentation of a protein, which can limit the generation of peptides capable of distinguishing between isoforms. Here we exploit transcript-level expression from RNA-seq to set prior likelihoods and enable protein isoform abundances to be directly estimated from LC-MS/MS, an approach derived from the principle that most genes appear to be expressed as a single dominant isoform in a given cell type or tissue. Through this deep integration of RNA-seq and LC-MS/MS data from the same sample, we show that a principal isoform can be identified in >80% of gene products in homogeneous HEK293 cell culture and >70% of proteins detected in complex human brain tissue. We demonstrate that the incorporation of translatome data from ribosome profiling further refines this process. Defining isoforms in experiments with matched RNA-seq/translatome and proteomic data increases the functional relevance of such data sets and will further broaden our understanding of multilevel control of gene expression.
Collapse
Affiliation(s)
- Becky C. Carlyle
- Department of Psychiatry, Yale School of Medicine, Connecticut Mental Health Center, 34 Park St, New Haven, CT 06519
| | - Robert R. Kitchen
- Department of Psychiatry, Yale School of Medicine, Connecticut Mental Health Center, 34 Park St, New Haven, CT 06519
- Department of Molecular Biophysics & Biochemistry, Yale School of Medicine, PO Box 208114, New Haven, CT, 06520
| | - Jing Zhang
- Department of Molecular Biophysics & Biochemistry, Yale School of Medicine, PO Box 208114, New Haven, CT, 06520
| | - Rashaun Wilson
- Yale/NIDA Neuroproteomics Center, Yale School of Medicine, 300 George Street, New Haven, CT 06510
| | - Tukiet T Lam
- Department of Molecular Biophysics & Biochemistry, Yale School of Medicine, PO Box 208114, New Haven, CT, 06520
- Yale/NIDA Neuroproteomics Center, Yale School of Medicine, 300 George Street, New Haven, CT 06510
- W.M. Keck Biotechnology Resource Laboratory, Yale School of Medicine, 300 George Street, New Haven, CT 06510
| | - Joel S Rozowsky
- Department of Molecular Biophysics & Biochemistry, Yale School of Medicine, PO Box 208114, New Haven, CT, 06520
| | - Kenneth R Williams
- Department of Molecular Biophysics & Biochemistry, Yale School of Medicine, PO Box 208114, New Haven, CT, 06520
- Yale/NIDA Neuroproteomics Center, Yale School of Medicine, 300 George Street, New Haven, CT 06510
| | - Nenad Sestan
- Department of Neuroscience and Kavli Institute for Neuroscience, Departments of Genetics and Psychiatry, Section of Comparative Medicine, and Yale Child Study Center, Program in Cellular Neuroscience, Neurodegeneration and Repair, Yale School of Medicine, New Haven, CT 06510
| | - Mark Gerstein
- Department of Molecular Biophysics & Biochemistry, Yale School of Medicine, PO Box 208114, New Haven, CT, 06520
| | - Angus C Nairn
- Department of Psychiatry, Yale School of Medicine, Connecticut Mental Health Center, 34 Park St, New Haven, CT 06519
| |
Collapse
|
2
|
Subramanian SL, Kitchen RR, Alexander R, Carter BS, Cheung KH, Laurent LC, Pico A, Roberts LR, Roth ME, Rozowsky JS, Su AI, Gerstein MB, Milosavljevic A. Integration of extracellular RNA profiling data using metadata, biomedical ontologies and Linked Data technologies. J Extracell Vesicles 2015; 4:27497. [PMID: 26320941 PMCID: PMC4553261 DOI: 10.3402/jev.v4.27497] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2015] [Revised: 06/26/2015] [Accepted: 07/24/2015] [Indexed: 12/27/2022] Open
Abstract
The large diversity and volume of extracellular RNA (exRNA) data that will form the basis of the exRNA Atlas generated by the Extracellular RNA Communication Consortium pose a substantial data integration challenge. We here present the strategy that is being implemented by the exRNA Data Management and Resource Repository, which employs metadata, biomedical ontologies and Linked Data technologies, such as Resource Description Framework to integrate a diverse set of exRNA profiles into an exRNA Atlas and enable integrative exRNA analysis. We focus on the following three specific data integration tasks: (a) selection of samples from a virtual biorepository for exRNA profiling and for inclusion in the exRNA Atlas; (b) retrieval of a data slice from the exRNA Atlas for integrative analysis and (c) interpretation of exRNA analysis results in the context of pathways and networks. As exRNA profiling gains wide adoption in the research community, we anticipate that the strategies discussed here will increasingly be required to enable data reuse and to facilitate integrative analysis of exRNA data.
Collapse
Affiliation(s)
- Sai Lakshmi Subramanian
- Bioinformatics Research Laboratory, Department of Molecular & Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - Robert R Kitchen
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA.,Division of Molecular Psychiatry, Abraham Ribicoff Research Facilities, Connecticut Mental Health Center, Yale University School of Medicine, New Haven, CT, USA.,Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA
| | - Roger Alexander
- Pacific Northwest Diabetes Research Institute, Seattle, WA, USA
| | - Bob S Carter
- Division of Neurosurgery, UC San Diego School of Medicine, UC San Diego Health System, La Jolla, CA, USA
| | - Kei-Hoi Cheung
- Department of Emergency Medicine, Yale Center for Medical Informatics, Yale University School of Medicine, New Haven, CT, USA
| | - Louise C Laurent
- Department of Reproductive Medicine, University of California, San Diego, La Jolla, CA, USA
| | | | - Lewis R Roberts
- Division of Gastroenterology and Hepatology, Mayo Clinic College of Medicine, Rochester, MN, USA
| | - Matthew E Roth
- Bioinformatics Research Laboratory, Department of Molecular & Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - Joel S Rozowsky
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA.,Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA
| | - Andrew I Su
- Department of Molecular and Experimental Medicine, The Scripps Research Institute, La Jolla, CA, USA
| | - Mark B Gerstein
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA.,Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA.,Department of Computer Science, Yale University, New Haven, CT, USA
| | - Aleksandar Milosavljevic
- Bioinformatics Research Laboratory, Department of Molecular & Human Genetics, Baylor College of Medicine, Houston, TX, USA;
| |
Collapse
|
3
|
Sboner A, Habegger L, Pflueger D, Terry S, Chen DZ, Rozowsky JS, Tewari AK, Kitabayashi N, Moss BJ, Chee MS, Demichelis F, Rubin MA, Gerstein MB. FusionSeq: a modular framework for finding gene fusions by analyzing paired-end RNA-sequencing data. Genome Biol 2010; 11:R104. [PMID: 20964841 PMCID: PMC3218660 DOI: 10.1186/gb-2010-11-10-r104] [Citation(s) in RCA: 130] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2010] [Revised: 08/12/2010] [Accepted: 10/21/2010] [Indexed: 12/03/2022] Open
Abstract
We have developed FusionSeq to identify fusion transcripts from paired-end RNA-sequencing. FusionSeq includes filters to remove spurious candidate fusions with artifacts, such as misalignment or random pairing of transcript fragments, and it ranks candidates according to several statistics. It also has a module to identify exact sequences at breakpoint junctions. FusionSeq detected known and novel fusions in a specially sequenced calibration data set, including eight cancers with and without known rearrangements.
Collapse
Affiliation(s)
- Andrea Sboner
- Program in Computational Biology and Bioinformatics, Yale University, 300 George Street, New Haven, CT 06511, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
4
|
Royce TE, Rozowsky JS, Gerstein MB. Toward a universal microarray: prediction of gene expression through nearest-neighbor probe sequence identification. Nucleic Acids Res 2007; 35:e99. [PMID: 17686789 PMCID: PMC1976448 DOI: 10.1093/nar/gkm549] [Citation(s) in RCA: 61] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
A generic DNA microarray design applicable to any species would greatly benefit comparative genomics. We have addressed the feasibility of such a design by leveraging the great feature densities and relatively unbiased nature of genomic tiling microarrays. Specifically, we first divided each Homo sapiens Refseq-derived gene's spliced nucleotide sequence into all of its possible contiguous 25 nt subsequences. For each of these 25 nt subsequences, we searched a recent human transcript mapping experiment's probe design for the 25 nt probe sequence having the fewest mismatches with the subsequence, but that did not match the subsequence exactly. Signal intensities measured with each gene's nearest-neighbor features were subsequently averaged to predict their gene expression levels in each of the experiment's thirty-three hybridizations. We examined the fidelity of this approach in terms of both sensitivity and specificity for detecting actively transcribed genes, for transcriptional consistency between exons of the same gene, and for reproducibility between tiling array designs. Taken together, our results provide proof-of-principle for probing nucleic acid targets with off-target, nearest-neighbor features.
Collapse
Affiliation(s)
- Thomas E Royce
- Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, USA
| | | | | |
Collapse
|
5
|
Euskirchen GM, Rozowsky JS, Wei CL, Lee WH, Zhang ZD, Hartman S, Emanuelsson O, Stolc V, Weissman S, Gerstein MB, Ruan Y, Snyder M. Mapping of transcription factor binding regions in mammalian cells by ChIP: comparison of array- and sequencing-based technologies. Genome Res 2007; 17:898-909. [PMID: 17568005 PMCID: PMC1891348 DOI: 10.1101/gr.5583007] [Citation(s) in RCA: 170] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Recent progress in mapping transcription factor (TF) binding regions can largely be credited to chromatin immunoprecipitation (ChIP) technologies. We compared strategies for mapping TF binding regions in mammalian cells using two different ChIP schemes: ChIP with DNA microarray analysis (ChIP-chip) and ChIP with DNA sequencing (ChIP-PET). We first investigated parameters central to obtaining robust ChIP-chip data sets by analyzing STAT1 targets in the ENCODE regions of the human genome, and then compared ChIP-chip to ChIP-PET. We devised methods for scoring and comparing results among various tiling arrays and examined parameters such as DNA microarray format, oligonucleotide length, hybridization conditions, and the use of competitor Cot-1 DNA. The best performance was achieved with high-density oligonucleotide arrays, oligonucleotides >/=50 bases (b), the presence of competitor Cot-1 DNA and hybridizations conducted in microfluidics stations. When target identification was evaluated as a function of array number, 80%-86% of targets were identified with three or more arrays. Comparison of ChIP-chip with ChIP-PET revealed strong agreement for the highest ranked targets with less overlap for the low ranked targets. With advantages and disadvantages unique to each approach, we found that ChIP-chip and ChIP-PET are frequently complementary in their relative abilities to detect STAT1 targets for the lower ranked targets; each method detected validated targets that were missed by the other method. The most comprehensive list of STAT1 binding regions is obtained by merging results from ChIP-chip and ChIP-sequencing. Overall, this study provides information for robust identification, scoring, and validation of TF targets using ChIP-based technologies.
Collapse
Affiliation(s)
- Ghia M. Euskirchen
- Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, Connecticut 06520-8103, USA
| | - Joel S. Rozowsky
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520-8114, USA
| | | | | | - Zhengdong D. Zhang
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520-8114, USA
| | - Stephen Hartman
- Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, Connecticut 06520-8103, USA
| | - Olof Emanuelsson
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520-8114, USA
| | - Viktor Stolc
- Center for Nanotechnology, NASA Ames Research Center, Moffett Field, California 94035, USA
| | - Sherman Weissman
- Department of Genetics, Yale University School of Medicine, New Haven, Connecticut 06520-8005, USA
| | - Mark B. Gerstein
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520-8114, USA
| | - Yijun Ruan
- Genome Institute of Singapore, Singapore 138672
| | - Michael Snyder
- Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, Connecticut 06520-8103, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520-8114, USA
- Corresponding author.E-mail ; fax (203) 432-6161
| |
Collapse
|
6
|
Rozowsky JS, Newburger D, Sayward F, Wu J, Jordan G, Korbel JO, Nagalakshmi U, Yang J, Zheng D, Guigó R, Gingeras TR, Weissman S, Miller P, Snyder M, Gerstein MB. The DART classification of unannotated transcription within the ENCODE regions: associating transcription with known and novel loci. Genome Res 2007; 17:732-45. [PMID: 17567993 PMCID: PMC1891334 DOI: 10.1101/gr.5696007] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
For the approximately 1% of the human genome in the ENCODE regions, only about half of the transcriptionally active regions (TARs) identified with tiling microarrays correspond to annotated exons. Here we categorize this large amount of "unannotated transcription." We use a number of disparate features to classify the 6988 novel TARs-array expression profiles across cell lines and conditions, sequence composition, phylogenetic profiles (presence/absence of syntenic conservation across 17 species), and locations relative to genes. In the classification, we first filter out TARs with unusual sequence composition and those likely resulting from cross-hybridization. We then associate some of those remaining with proximal exons having correlated expression profiles. Finally, we cluster unclassified TARs into putative novel loci, based on similar expression and phylogenetic profiles. To encapsulate our classification, we construct a Database of Active Regions and Tools (DART.gersteinlab.org). DART has special facilities for rapidly handling and comparing many sets of TARs and their heterogeneous features, synchronizing across builds, and interfacing with other resources. Overall, we find that approximately 14% of the novel TARs can be associated with known genes, while approximately 21% can be clustered into approximately 200 novel loci. We observe that TARs associated with genes are enriched in the potential to form structural RNAs and many novel TAR clusters are associated with nearby promoters. To benchmark our classification, we design a set of experiments for testing the connectivity of novel TARs. Overall, we find that 18 of the 46 connections tested validate by RT-PCR and four of five sequenced PCR products confirm connectivity unambiguously.
Collapse
Affiliation(s)
- Joel S. Rozowsky
- Molecular Biophysics and Biochemistry Department, Yale University, New Haven, Connecticut 06520-8114, USA
- Corresponding authors.E-mail ; fax (203) 432-5175.E-mail ; fax (360) 838-7861
| | - Daniel Newburger
- Molecular Biophysics and Biochemistry Department, Yale University, New Haven, Connecticut 06520-8114, USA
| | - Fred Sayward
- Center for Medical Informatics, Yale University, New Haven, Connecticut 06520-8009, USA
| | - Jiaqian Wu
- Molecular, Cellular, and Developmental Biology Department, Yale University, New Haven, Connecticut 06520, USA
| | - Greg Jordan
- Molecular Biophysics and Biochemistry Department, Yale University, New Haven, Connecticut 06520-8114, USA
| | - Jan O. Korbel
- Molecular Biophysics and Biochemistry Department, Yale University, New Haven, Connecticut 06520-8114, USA
| | - Ugrappa Nagalakshmi
- Molecular, Cellular, and Developmental Biology Department, Yale University, New Haven, Connecticut 06520, USA
| | - Jin Yang
- Center for Medical Informatics, Yale University, New Haven, Connecticut 06520-8009, USA
| | - Deyou Zheng
- Molecular Biophysics and Biochemistry Department, Yale University, New Haven, Connecticut 06520-8114, USA
| | - Roderic Guigó
- Grup de Recerca en Informática Biomèdica, Institut Municipal d’Investigació Mèdica/Universitat Pompeu Fabra, 37-49, 08003, Barcelona, Catalonia, Spain
| | | | - Sherman Weissman
- Department of Genetics, Yale University, New Haven, Connecticut 06520, USA
| | - Perry Miller
- Center for Medical Informatics, Yale University, New Haven, Connecticut 06520-8009, USA
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA
| | - Michael Snyder
- Molecular, Cellular, and Developmental Biology Department, Yale University, New Haven, Connecticut 06520, USA
| | - Mark B. Gerstein
- Molecular Biophysics and Biochemistry Department, Yale University, New Haven, Connecticut 06520-8114, USA
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA
- Corresponding authors.E-mail ; fax (203) 432-5175.E-mail ; fax (360) 838-7861
| |
Collapse
|
7
|
Gerstein MB, Bruce C, Rozowsky JS, Zheng D, Du J, Korbel JO, Emanuelsson O, Zhang ZD, Weissman S, Snyder M. What is a gene, post-ENCODE? History and updated definition. Genome Res 2007; 17:669-81. [PMID: 17567988 DOI: 10.1101/gr.6339607] [Citation(s) in RCA: 457] [Impact Index Per Article: 26.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
While sequencing of the human genome surprised us with how many protein-coding genes there are, it did not fundamentally change our perspective on what a gene is. In contrast, the complex patterns of dispersed regulation and pervasive transcription uncovered by the ENCODE project, together with non-genic conservation and the abundance of noncoding RNA genes, have challenged the notion of the gene. To illustrate this, we review the evolution of operational definitions of a gene over the past century--from the abstract elements of heredity of Mendel and Morgan to the present-day ORFs enumerated in the sequence databanks. We then summarize the current ENCODE findings and provide a computational metaphor for the complexity. Finally, we propose a tentative update to the definition of a gene: A gene is a union of genomic sequences encoding a coherent set of potentially overlapping functional products. Our definition side-steps the complexities of regulation and transcription by removing the former altogether from the definition and arguing that final, functional gene products (rather than intermediate transcripts) should be used to group together entities associated with a single gene. It also manifests how integral the concept of biological function is in defining genes.
Collapse
Affiliation(s)
- Mark B Gerstein
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06511, USA.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
8
|
Abstract
MOTIVATION Increases in microarray feature density allow the construction of so-called tiling microarrays. These arrays, or sets of arrays, contain probes targeting regions of sequenced genomes at regular genomic intervals. The unbiased nature of this approach allows for the identification of novel transcribed sequences, the localization of transcription factor binding sites (ChIP-chip), and high resolution comparative genomic hybridization, among other uses. These applications are quickly growing in popularity as tiling microarrays become more affordable. To reach maximum utility, the tiling microarray platform needs be developed to the point that 1 nt resolutions are achieved and that we have confidence in individual measurements taken at this fine of resolution. Any biases in tiling array signals must be systematically removed to achieve this goal. RESULTS Towards this end, we investigated the importance of probe sequence composition on the efficacy of tiling microarrays for identifying novel transcription and transcription factor binding sites. We found that intensities are highly sequence dependent and can greatly influence results. We developed three metrics for assessing this sequence dependence and use them in evaluating existing sequence-based normalizations from the tiling microarray literature. In addition, we applied three new techniques for addressing this problem; one method, adapted from similar work on GeneChip brand microarrays, is based on modeling array signal as a linear function of probe sequence, the second method extends this approach by iterative weighting and re-fitting of the model, and the third technique extrapolates the popular quantile normalization algorithm for between-array normalization to probe sequence space. These three methods perform favorably to existing strategies, based on the metrics defined here. AVAILABILITY http://tiling.gersteinlab.org/sequence_effects/
Collapse
Affiliation(s)
- Thomas E Royce
- Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA
| | | | | |
Collapse
|
9
|
Emanuelsson O, Nagalakshmi U, Zheng D, Rozowsky JS, Urban AE, Du J, Lian Z, Stolc V, Weissman S, Snyder M, Gerstein MB. Assessing the performance of different high-density tiling microarray strategies for mapping transcribed regions of the human genome. Genome Res 2006; 17:886-97. [PMID: 17119069 PMCID: PMC1891347 DOI: 10.1101/gr.5014606] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Genomic tiling microarrays have become a popular tool for interrogating the transcriptional activity of large regions of the genome in an unbiased fashion. There are several key parameters associated with each tiling experiment (e.g., experimental protocols and genomic tiling density). Here, we assess the role of these parameters as they are manifest in different tiling-array platforms used for transcription mapping. First, we analyze how a number of published tiling-array experiments agree with established gene annotation on human chromosome 22. We observe that the transcription detected from high-density arrays correlates substantially better with annotation than that from other array types. Next, we analyze the transcription-mapping performance of the two main high-density oligonucleotide array platforms in the ENCODE regions of the human genome. We hybridize identical biological samples and develop several ways of scoring the arrays and segmenting the genome into transcribed and nontranscribed regions, with the aim of making the platforms most comparable to each other. Finally, we develop a platform comparison approach based on agreement with known annotation. Overall, we find that the performance improves with more data points per locus, coupled with statistical scoring approaches that properly take advantage of this, where this larger number of data points arises from higher genomic tiling density and the use of replicate arrays and mismatches. While we do find significant differences in the performance of the two high-density platforms, we also find that they complement each other to some extent. Finally, our experiments reveal a significant amount of novel transcription outside of known genes, and an appreciable sample of this was validated by independent experiments.
Collapse
Affiliation(s)
- Olof Emanuelsson
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520-8114, USA
| | - Ugrappa Nagalakshmi
- Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, Connecticut 06520-8103, USA
| | - Deyou Zheng
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520-8114, USA
| | - Joel S. Rozowsky
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520-8114, USA
| | - Alexander E. Urban
- Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, Connecticut 06520-8103, USA
- Department of Genetics, Yale University School of Medicine, New Haven, Connecticut 06520–8005, USA
| | - Jiang Du
- Department of Computer Science, Yale University, New Haven, Connecticut 06520-8285, USA
| | - Zheng Lian
- Department of Genetics, Yale University School of Medicine, New Haven, Connecticut 06520–8005, USA
| | - Viktor Stolc
- Center for Nanotechnology, NASA Ames Research Center, Moffett Field, California 94035, USA
| | - Sherman Weissman
- Department of Genetics, Yale University School of Medicine, New Haven, Connecticut 06520–8005, USA
| | - Michael Snyder
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520-8114, USA
- Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, Connecticut 06520-8103, USA
- Corresponding authors.E-mail ; fax (360) 838-7861.E-mail ; fax: (360) 838-7861
| | - Mark B. Gerstein
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520-8114, USA
- Department of Computer Science, Yale University, New Haven, Connecticut 06520-8285, USA
- Corresponding authors.E-mail ; fax (360) 838-7861.E-mail ; fax: (360) 838-7861
| |
Collapse
|
10
|
Du J, Rozowsky JS, Korbel JO, Zhang ZD, Royce TE, Schultz MH, Snyder M, Gerstein M. A supervised hidden markov model framework for efficiently segmenting tiling array data in transcriptional and chIP-chip experiments: systematically incorporating validated biological knowledge. ACTA ACUST UNITED AC 2006; 22:3016-24. [PMID: 17038339 DOI: 10.1093/bioinformatics/btl515] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Large-scale tiling array experiments are becoming increasingly common in genomics. In particular, the ENCODE project requires the consistent segmentation of many different tiling array datasets into 'active regions' (e.g. finding transfrags from transcriptional data and putative binding sites from ChIP-chip experiments). Previously, such segmentation was done in an unsupervised fashion mainly based on characteristics of the signal distribution in the tiling array data itself. Here we propose a supervised framework for doing this. It has the advantage of explicitly incorporating validated biological knowledge into the model and allowing for formal training and testing. METHODOLOGY In particular, we use a hidden Markov model (HMM) framework, which is capable of explicitly modeling the dependency between neighboring probes and whose extended version (the generalized HMM) also allows explicit description of state duration density. We introduce a formal definition of the tiling-array analysis problem, and explain how we can use this to describe sampling small genomic regions for experimental validation to build up a gold-standard set for training and testing. We then describe various ideal and practical sampling strategies (e.g. maximizing signal entropy within a selected region versus using gene annotation or known promoters as positives for transcription or ChIP-chip data, respectively). RESULTS For the practical sampling and training strategies, we show how the size and noise in the validated training data affects the performance of an HMM applied to the ENCODE transcriptional and ChIP-chip experiments. In particular, we show that the HMM framework is able to efficiently process tiling array data as well as or better than previous approaches. For the idealized sampling strategies, we show how we can assess their performance in a simulation framework and how a maximum entropy approach, which samples sub-regions with very different signal intensities, gives the maximally performing gold-standard. This latter result has strong implications for the optimum way medium-scale validation experiments should be carried out to verify the results of the genome-scale tiling array experiments.
Collapse
Affiliation(s)
- Jiang Du
- Department of Computer Science, Yale University, New Haven, CT 06520, USA
| | | | | | | | | | | | | | | |
Collapse
|
11
|
Royce TE, Rozowsky JS, Luscombe NM, Emanuelsson O, Yu H, Zhu X, Snyder M, Gerstein MB. [15] Extrapolating Traditional DNA Microarray Statistics to Tiling and Protein Microarray Technologies. Methods Enzymol 2006; 411:282-311. [PMID: 16939796 DOI: 10.1016/s0076-6879(06)11015-0] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
A credit to microarray technology is its broad application. Two experiments--the tiling microarray experiment and the protein microarray experiment--are exemplars of the versatility of the microarrays. With the technology's expanding list of uses, the corresponding bioinformatics must evolve in step. There currently exists a rich literature developing statistical techniques for analyzing traditional gene-centric DNA microarrays, so the first challenge in analyzing the advanced technologies is to identify which of the existing statistical protocols are relevant and where and when revised methods are needed. A second challenge is making these often very technical ideas accessible to the broader microarray community. The aim of this chapter is to present some of the most widely used statistical techniques for normalizing and scoring traditional microarray data and indicate their potential utility for analyzing the newer protein and tiling microarray experiments. In so doing, we will assume little or no prior training in statistics of the reader. Areas covered include background correction, intensity normalization, spatial normalization, and the testing of statistical significance.
Collapse
Affiliation(s)
- Thomas E Royce
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA
| | | | | | | | | | | | | | | |
Collapse
|
12
|
Bertone P, Trifonov V, Rozowsky JS, Schubert F, Emanuelsson O, Karro J, Kao MY, Snyder M, Gerstein M. Design optimization methods for genomic DNA tiling arrays. Genome Res 2005; 16:271-81. [PMID: 16365382 PMCID: PMC1361723 DOI: 10.1101/gr.4452906] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
A recent development in microarray research entails the unbiased coverage, or tiling, of genomic DNA for the large-scale identification of transcribed sequences and regulatory elements. A central issue in designing tiling arrays is that of arriving at a single-copy tile path, as significant sequence cross-hybridization can result from the presence of non-unique probes on the array. Due to the fragmentation of genomic DNA caused by the widespread distribution of repetitive elements, the problem of obtaining adequate sequence coverage increases with the sizes of subsequence tiles that are to be included in the design. This becomes increasingly problematic when considering complex eukaryotic genomes that contain many thousands of interspersed repeats. The general problem of sequence tiling can be framed as finding an optimal partitioning of non-repetitive subsequences over a prescribed range of tile sizes, on a DNA sequence comprising repetitive and non-repetitive regions. Exact solutions to the tiling problem become computationally infeasible when applied to large genomes, but successive optimizations are developed that allow their practical implementation. These include an efficient method for determining the degree of similarity of many oligonucleotide sequences over large genomes, and two algorithms for finding an optimal tile path composed of longer sequence tiles. The first algorithm, a dynamic programming approach, finds an optimal tiling in linear time and space; the second applies a heuristic search to reduce the space complexity to a constant requirement. A Web resource has also been developed, accessible at http://tiling.gersteinlab.org, to generate optimal tile paths from user-provided DNA sequences.
Collapse
Affiliation(s)
- Paul Bertone
- Department of Molecular, Cellular, and Developmental Biology, Yale University, New Haven, CN 06520, USA. P50 HG02357
| | | | | | | | | | | | | | | | | |
Collapse
|
13
|
Royce TE, Rozowsky JS, Bertone P, Samanta M, Stolc V, Weissman S, Snyder M, Gerstein M. Issues in the analysis of oligonucleotide tiling microarrays for transcript mapping. Trends Genet 2005; 21:466-75. [PMID: 15979196 PMCID: PMC1855044 DOI: 10.1016/j.tig.2005.06.007] [Citation(s) in RCA: 64] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2005] [Revised: 05/17/2005] [Accepted: 06/08/2005] [Indexed: 10/25/2022]
Abstract
Traditional microarrays use probes complementary to known genes to quantitate the differential gene expression between two or more conditions. Genomic tiling microarray experiments differ in that probes that span a genomic region at regular intervals are used to detect the presence or absence of transcription. This difference means the same sets of biases and the methods for addressing them are unlikely to be relevant to both types of experiment. We introduce the informatics challenges arising in the analysis of tiling microarray experiments as open problems to the scientific community and present initial approaches for the analysis of this nascent technology.
Collapse
Affiliation(s)
- Thomas E Royce
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
| | | | | | | | | | | | | | | |
Collapse
|
14
|
Bertone P, Stolc V, Royce TE, Rozowsky JS, Urban AE, Zhu X, Rinn JL, Tongprasit W, Samanta M, Weissman S, Gerstein M, Snyder M. Global identification of human transcribed sequences with genome tiling arrays. Science 2004; 306:2242-6. [PMID: 15539566 DOI: 10.1126/science.1103388] [Citation(s) in RCA: 772] [Impact Index Per Article: 38.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
Elucidating the transcribed regions of the genome constitutes a fundamental aspect of human biology, yet this remains an outstanding problem. To comprehensively identify coding sequences, we constructed a series of high-density oligonucleotide tiling arrays representing sense and antisense strands of the entire nonrepetitive sequence of the human genome. Transcribed sequences were located across the genome via hybridization to complementary DNA samples, reverse-transcribed from polyadenylated RNA obtained from human liver tissue. In addition to identifying many known and predicted genes, we found 10,595 transcribed sequences not detected by other methods. A large fraction of these are located in intergenic regions distal from previously annotated genes and exhibit significant homology to other mammalian proteins.
Collapse
Affiliation(s)
- Paul Bertone
- Department of Molecular, Cellular, and Developmental Biology, Yale University, New Haven, CT 06520-8103, USA
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
15
|
Rinn JL, Rozowsky JS, Laurenzi IJ, Petersen PH, Zou K, Zhong W, Gerstein M, Snyder M. Major molecular differences between mammalian sexes are involved in drug metabolism and renal function. Dev Cell 2004; 6:791-800. [PMID: 15177028 DOI: 10.1016/j.devcel.2004.05.005] [Citation(s) in RCA: 129] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2004] [Revised: 05/05/2004] [Accepted: 05/05/2004] [Indexed: 11/17/2022]
Abstract
Many anatomical differences exist between males and females; these are manifested on a molecular level by different hormonal environments. Although several molecular differences in adult tissues have been identified, a comprehensive investigation of the gene expression differences between males and females has not been performed. We surveyed the expression patterns of 13,977 mouse genes in male and female hypothalamus, kidney, liver, and reproductive tissues. Extensive differential gene expression was observed not only in the reproductive tissues, but also in the kidney and liver. The differentially expressed genes are involved in drug and steroid metabolism, osmotic regulation, or as yet unresolved cellular roles. In contrast, very few molecular differences were observed between the male and female hypothalamus in both mice and humans. We conclude that there are persistent differences in gene expression between adult males and females. These molecular differences have important implications for the physiological differences between males and females.
Collapse
Affiliation(s)
- John L Rinn
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA.
| | | | | | | | | | | | | | | |
Collapse
|
16
|
Abstract
The nonrelativistic interpretation of quantum field theory achieved by quantization in an infinite momentum frame is spoiled by the inclusion of a mode of the field carrying p(+) = 0. We therefore explore the viability of doing without such a mode in the context of spontaneous symmetry breaking (SSB), where its presence would seem to be most needed. We show that the physics of SSB in scalar quantum field theory in 1+1 space-time dimensions is accurately described without a zero mode.
Collapse
Affiliation(s)
- JS Rozowsky
- Institute for Fundamental Theory, Department of Physics, University of Florida, Gainesville, Florida 32611, USA
| | | |
Collapse
|
17
|
|
18
|
|