1
|
Kluin RJC, Kemper K, Kuilman T, de Ruiter JR, Iyer V, Forment JV, Cornelissen-Steijger P, de Rink I, Ter Brugge P, Song JY, Klarenbeek S, McDermott U, Jonkers J, Velds A, Adams DJ, Peeper DS, Krijgsman O. XenofilteR: computational deconvolution of mouse and human reads in tumor xenograft sequence data. BMC Bioinformatics 2018; 19:366. [PMID: 30286710 PMCID: PMC6172735 DOI: 10.1186/s12859-018-2353-5] [Citation(s) in RCA: 52] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2018] [Accepted: 08/30/2018] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND Mouse xenografts from (patient-derived) tumors (PDX) or tumor cell lines are widely used as models to study various biological and preclinical aspects of cancer. However, analyses of their RNA and DNA profiles are challenging, because they comprise reads not only from the grafted human cancer but also from the murine host. The reads of murine origin result in false positives in mutation analysis of DNA samples and obscure gene expression levels when sequencing RNA. However, currently available algorithms are limited and improvements in accuracy and ease of use are necessary. RESULTS We developed the R-package XenofilteR, which separates mouse from human sequence reads based on the edit-distance between a sequence read and reference genome. To assess the accuracy of XenofilteR, we generated sequence data by in silico mixing of mouse and human DNA sequence data. These analyses revealed that XenofilteR removes > 99.9% of sequence reads of mouse origin while retaining human sequences. This allowed for mutation analysis of xenograft samples with accurate variant allele frequencies, and retrieved all non-synonymous somatic tumor mutations. CONCLUSIONS XenofilteR accurately dissects RNA and DNA sequences from mouse and human origin, thereby outperforming currently available tools. XenofilteR is open source and available at https://github.com/PeeperLab/XenofilteR .
Collapse
Affiliation(s)
- Roelof J C Kluin
- Central Genomic Facility, Netherlands Cancer Institute, Amsterdam, The Netherlands
| | - Kristel Kemper
- Division of Molecular Oncology and Immunology, Netherlands Cancer Institute, Plesmanlaan 121, 1066, CX, Amsterdam, The Netherlands
| | - Thomas Kuilman
- Division of Molecular Oncology and Immunology, Netherlands Cancer Institute, Plesmanlaan 121, 1066, CX, Amsterdam, The Netherlands
| | - Julian R de Ruiter
- Division of Molecular Pathology, Netherlands Cancer Institute, Amsterdam, The Netherlands
- Division of Molecular Carcinogenesis, Netherlands Cancer Institute, Amsterdam, The Netherlands
| | - Vivek Iyer
- Experimental Cancer Genetics, Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire, UK
| | - Josep V Forment
- The Wellcome Trust/Cancer Research UK (CRUK) Gurdon Institute, University of Cambridge, Cambridge, UK
- Present address: DNA Damage Response Biology, Bioscience Oncology IMED Biotech Unit, AstraZeneca, Cambridge, CB4 0WG, UK
| | - Paulien Cornelissen-Steijger
- Division of Molecular Oncology and Immunology, Netherlands Cancer Institute, Plesmanlaan 121, 1066, CX, Amsterdam, The Netherlands
| | - Iris de Rink
- Central Genomic Facility, Netherlands Cancer Institute, Amsterdam, The Netherlands
| | - Petra Ter Brugge
- Division of Molecular Pathology, Netherlands Cancer Institute, Amsterdam, The Netherlands
| | - Ji-Ying Song
- Division of Experimental Animal Pathology, Netherlands Cancer Institute, Amsterdam, The Netherlands
| | - Sjoerd Klarenbeek
- Division of Experimental Animal Pathology, Netherlands Cancer Institute, Amsterdam, The Netherlands
| | - Ultan McDermott
- Cancer Genome Project, The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA, UK
| | - Jos Jonkers
- Division of Molecular Pathology, Netherlands Cancer Institute, Amsterdam, The Netherlands
| | - Arno Velds
- Central Genomic Facility, Netherlands Cancer Institute, Amsterdam, The Netherlands
| | - David J Adams
- Division of Molecular Carcinogenesis, Netherlands Cancer Institute, Amsterdam, The Netherlands
| | - Daniel S Peeper
- Division of Molecular Oncology and Immunology, Netherlands Cancer Institute, Plesmanlaan 121, 1066, CX, Amsterdam, The Netherlands.
| | - Oscar Krijgsman
- Division of Molecular Oncology and Immunology, Netherlands Cancer Institute, Plesmanlaan 121, 1066, CX, Amsterdam, The Netherlands.
| |
Collapse
|
2
|
Newman JRB, Conesa A, Mika M, New FN, Onengut-Gumuscu S, Atkinson MA, Rich SS, McIntyre LM, Concannon P. Disease-specific biases in alternative splicing and tissue-specific dysregulation revealed by multitissue profiling of lymphocyte gene expression in type 1 diabetes. Genome Res 2017; 27:1807-1815. [PMID: 29025893 PMCID: PMC5668939 DOI: 10.1101/gr.217984.116] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2016] [Accepted: 09/13/2017] [Indexed: 12/22/2022]
Abstract
Genome-wide association studies (GWAS) have identified multiple, shared allelic associations with many autoimmune diseases. However, the pathogenic contributions of variants residing in risk loci remain unresolved. The location of the majority of shared disease-associated variants in noncoding regions suggests they contribute to risk of autoimmunity through effects on gene expression in the immune system. In the current study, we test this hypothesis by applying RNA sequencing to CD4+, CD8+, and CD19+ lymphocyte populations isolated from 81 subjects with type 1 diabetes (T1D). We characterize and compare the expression patterns across these cell types for three gene sets: all genes, the set of genes implicated in autoimmune disease risk by GWAS, and the subset of these genes specifically implicated in T1D. We performed RNA sequencing and aligned the reads to both the human reference genome and a catalog of all possible splicing events developed from the genome, thereby providing a comprehensive evaluation of the roles of gene expression and alternative splicing (AS) in autoimmunity. Autoimmune candidate genes displayed greater expression specificity in the three lymphocyte populations relative to other genes, with significantly increased levels of splicing events, particularly those predicted to have substantial effects on protein isoform structure and function (e.g., intron retention, exon skipping). The majority of single-nucleotide polymorphisms within T1D-associated loci were also associated with one or more cis-expression quantitative trait loci (cis-eQTLs) and/or splicing eQTLs. Our findings highlight a substantial, and previously underrecognized, role for AS in the pathogenesis of autoimmune disorders and particularly for T1D.
Collapse
Affiliation(s)
- Jeremy R B Newman
- Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, Florida 32610, USA
| | - Ana Conesa
- Department of Microbiology and Cell Science, Institute for Food and Agricultural Sciences, University of Florida, Gainesville, Florida 32610, USA
- Genetics Institute, University of Florida, Gainesville, Florida 32610, USA
| | - Matthew Mika
- Center for Public Health Genomics and Department of Public Health Sciences, University of Virginia, Charlottesville, Virginia 22908, USA
| | - Felicia N New
- Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, Florida 32610, USA
| | - Suna Onengut-Gumuscu
- Center for Public Health Genomics and Department of Public Health Sciences, University of Virginia, Charlottesville, Virginia 22908, USA
| | - Mark A Atkinson
- Diabetes Institute, University of Florida, Gainesville, Florida 32610, USA
- Department of Pathology, Immunology and Laboratory Medicine, University of Florida, Gainesville, Florida 32610, USA
| | - Stephen S Rich
- Center for Public Health Genomics and Department of Public Health Sciences, University of Virginia, Charlottesville, Virginia 22908, USA
| | - Lauren M McIntyre
- Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, Florida 32610, USA
- Genetics Institute, University of Florida, Gainesville, Florida 32610, USA
| | - Patrick Concannon
- Genetics Institute, University of Florida, Gainesville, Florida 32610, USA
- Department of Pathology, Immunology and Laboratory Medicine, University of Florida, Gainesville, Florida 32610, USA
| |
Collapse
|
3
|
Teng M, Love MI, Davis CA, Djebali S, Dobin A, Graveley BR, Li S, Mason CE, Olson S, Pervouchine D, Sloan CA, Wei X, Zhan L, Irizarry RA. A benchmark for RNA-seq quantification pipelines. Genome Biol 2016; 17:74. [PMID: 27107712 PMCID: PMC4842274 DOI: 10.1186/s13059-016-0940-1] [Citation(s) in RCA: 119] [Impact Index Per Article: 14.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2015] [Accepted: 04/08/2016] [Indexed: 02/07/2023] Open
Abstract
Obtaining RNA-seq measurements involves a complex data analytical process with a large number of competing algorithms as options. There is much debate about which of these methods provides the best approach. Unfortunately, it is currently difficult to evaluate their performance due in part to a lack of sensitive assessment metrics. We present a series of statistical summaries and plots to evaluate the performance in terms of specificity and sensitivity, available as a R/Bioconductor package (http://bioconductor.org/packages/rnaseqcomp). Using two independent datasets, we assessed seven competing pipelines. Performance was generally poor, with two methods clearly underperforming and RSEM slightly outperforming the rest.
Collapse
Affiliation(s)
- Mingxiang Teng
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, 450 Brookline Avenue, Boston, MA, 02215, USA.,Department of Biostatistics, Harvard TH Chan School of Public Health, 677 Huntington Avenue, Boston, MA, 02115, USA.,School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Michael I Love
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, 450 Brookline Avenue, Boston, MA, 02215, USA.,Department of Biostatistics, Harvard TH Chan School of Public Health, 677 Huntington Avenue, Boston, MA, 02115, USA
| | - Carrie A Davis
- Functional Genomics Group, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, 11724, USA
| | - Sarah Djebali
- Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG) and UPF, Doctor Aiguader, 88, Barcelona, 08003, Spain
| | - Alexander Dobin
- Functional Genomics Group, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, 11724, USA
| | - Brenton R Graveley
- Department of Genetics and Genome Sciences, Institute for System Genomics, UConn Health Center, Farmington, CT, 06030, USA
| | - Sheng Li
- Department of Physiology and Biophysics, Weill Cornell Medical College, New York, New York, USA
| | - Christopher E Mason
- Department of Physiology and Biophysics, Weill Cornell Medical College, New York, New York, USA
| | - Sara Olson
- Department of Genetics and Genome Sciences, Institute for System Genomics, UConn Health Center, Farmington, CT, 06030, USA
| | - Dmitri Pervouchine
- Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG) and UPF, Doctor Aiguader, 88, Barcelona, 08003, Spain
| | - Cricket A Sloan
- Department of Genetics, Stanford University, 300 Pasteur Drive, MC-5477, Stanford, CA, 94305, USA
| | - Xintao Wei
- Department of Genetics and Genome Sciences, Institute for System Genomics, UConn Health Center, Farmington, CT, 06030, USA
| | - Lijun Zhan
- Department of Genetics and Genome Sciences, Institute for System Genomics, UConn Health Center, Farmington, CT, 06030, USA
| | - Rafael A Irizarry
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, 450 Brookline Avenue, Boston, MA, 02215, USA. .,Department of Biostatistics, Harvard TH Chan School of Public Health, 677 Huntington Avenue, Boston, MA, 02115, USA.
| |
Collapse
|
4
|
Soneson C, Love MI, Robinson MD. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Res 2015; 4:1521. [PMID: 26925227 PMCID: PMC4712774 DOI: 10.12688/f1000research.7563.1] [Citation(s) in RCA: 1603] [Impact Index Per Article: 178.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 12/14/2015] [Indexed: 01/14/2023] Open
Abstract
High-throughput sequencing of cDNA (RNA-seq) is used extensively to characterize the transcriptome of cells. Many transcriptomic studies aim at comparing either abundance levels or the transcriptome composition between given conditions, and as a first step, the sequencing reads must be used as the basis for abundance quantification of transcriptomic features of interest, such as genes or transcripts. Several different quantification approaches have been proposed, ranging from simple counting of reads that overlap given genomic regions to more complex estimation of underlying transcript abundances. In this paper, we show that gene-level abundance estimates and statistical inference offer advantages over transcript-level analyses, in terms of performance and interpretability. We also illustrate that while the presence of differential isoform usage can lead to inflated false discovery rates in differential expression analyses on simple count matrices and transcript-level abundance estimates improve the performance in simulated data, the difference is relatively minor in several real data sets. Finally, we provide an R package ( tximport) to help users integrate transcript-level abundance estimates from common quantification pipelines into count-based statistical inference engines.
Collapse
Affiliation(s)
- Charlotte Soneson
- Institute for Molecular Life Sciences, University of Zurich, Zurich, 8057, Switzerland
- SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, 8057, Switzerland
| | - Michael I. Love
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, 02210, USA
- Department of Biostatistics, Harvard TH Chan School of Public Health, Boston, MA, 02115, USA
| | - Mark D. Robinson
- Institute for Molecular Life Sciences, University of Zurich, Zurich, 8057, Switzerland
- SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, 8057, Switzerland
| |
Collapse
|
5
|
Soneson C, Love MI, Robinson MD. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Res 2015; 4:1521. [PMID: 26925227 PMCID: PMC4712774 DOI: 10.12688/f1000research.7563.2] [Citation(s) in RCA: 1457] [Impact Index Per Article: 161.9] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 02/23/2016] [Indexed: 12/21/2022] Open
Abstract
High-throughput sequencing of cDNA (RNA-seq) is used extensively to characterize the transcriptome of cells. Many transcriptomic studies aim at comparing either abundance levels or the transcriptome composition between given conditions, and as a first step, the sequencing reads must be used as the basis for abundance quantification of transcriptomic features of interest, such as genes or transcripts. Various quantification approaches have been proposed, ranging from simple counting of reads that overlap given genomic regions to more complex estimation of underlying transcript abundances. In this paper, we show that gene-level abundance estimates and statistical inference offer advantages over transcript-level analyses, in terms of performance and interpretability. We also illustrate that the presence of differential isoform usage can lead to inflated false discovery rates in differential gene expression analyses on simple count matrices but that this can be addressed by incorporating offsets derived from transcript-level abundance estimates. We also show that the problem is relatively minor in several real data sets. Finally, we provide an R package ( tximport) to help users integrate transcript-level abundance estimates from common quantification pipelines into count-based statistical inference engines.
Collapse
Affiliation(s)
- Charlotte Soneson
- Institute for Molecular Life Sciences, University of Zurich, Zurich, 8057, Switzerland
- SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, 8057, Switzerland
| | - Michael I. Love
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, 02210, USA
- Department of Biostatistics, Harvard TH Chan School of Public Health, Boston, MA, 02115, USA
| | - Mark D. Robinson
- Institute for Molecular Life Sciences, University of Zurich, Zurich, 8057, Switzerland
- SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, 8057, Switzerland
| |
Collapse
|
6
|
Soneson C, Love MI, Robinson MD. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Res 2015. [PMID: 26925227 DOI: 10.12688/f1000research10.12688/f1000research.7563.1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 04/15/2023] Open
Abstract
High-throughput sequencing of cDNA (RNA-seq) is used extensively to characterize the transcriptome of cells. Many transcriptomic studies aim at comparing either abundance levels or the transcriptome composition between given conditions, and as a first step, the sequencing reads must be used as the basis for abundance quantification of transcriptomic features of interest, such as genes or transcripts. Various quantification approaches have been proposed, ranging from simple counting of reads that overlap given genomic regions to more complex estimation of underlying transcript abundances. In this paper, we show that gene-level abundance estimates and statistical inference offer advantages over transcript-level analyses, in terms of performance and interpretability. We also illustrate that the presence of differential isoform usage can lead to inflated false discovery rates in differential gene expression analyses on simple count matrices but that this can be addressed by incorporating offsets derived from transcript-level abundance estimates. We also show that the problem is relatively minor in several real data sets. Finally, we provide an R package ( tximport) to help users integrate transcript-level abundance estimates from common quantification pipelines into count-based statistical inference engines.
Collapse
Affiliation(s)
- Charlotte Soneson
- Institute for Molecular Life Sciences, University of Zurich, Zurich, 8057, Switzerland; SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, 8057, Switzerland
| | - Michael I Love
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, 02210, USA; Department of Biostatistics, Harvard TH Chan School of Public Health, Boston, MA, 02115, USA
| | - Mark D Robinson
- Institute for Molecular Life Sciences, University of Zurich, Zurich, 8057, Switzerland; SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, 8057, Switzerland
| |
Collapse
|
7
|
Soneson C, Love MI, Robinson MD. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Res 2015. [PMID: 26925227 DOI: 10.5256/f1000research.7563.d114723] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 04/20/2023] Open
Abstract
High-throughput sequencing of cDNA (RNA-seq) is used extensively to characterize the transcriptome of cells. Many transcriptomic studies aim at comparing either abundance levels or the transcriptome composition between given conditions, and as a first step, the sequencing reads must be used as the basis for abundance quantification of transcriptomic features of interest, such as genes or transcripts. Various quantification approaches have been proposed, ranging from simple counting of reads that overlap given genomic regions to more complex estimation of underlying transcript abundances. In this paper, we show that gene-level abundance estimates and statistical inference offer advantages over transcript-level analyses, in terms of performance and interpretability. We also illustrate that the presence of differential isoform usage can lead to inflated false discovery rates in differential gene expression analyses on simple count matrices but that this can be addressed by incorporating offsets derived from transcript-level abundance estimates. We also show that the problem is relatively minor in several real data sets. Finally, we provide an R package ( tximport) to help users integrate transcript-level abundance estimates from common quantification pipelines into count-based statistical inference engines.
Collapse
Affiliation(s)
- Charlotte Soneson
- Institute for Molecular Life Sciences, University of Zurich, Zurich, 8057, Switzerland; SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, 8057, Switzerland
| | - Michael I Love
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, 02210, USA; Department of Biostatistics, Harvard TH Chan School of Public Health, Boston, MA, 02115, USA
| | - Mark D Robinson
- Institute for Molecular Life Sciences, University of Zurich, Zurich, 8057, Switzerland; SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, 8057, Switzerland
| |
Collapse
|
8
|
Soneson C, Love MI, Robinson MD. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Res 2015. [PMID: 26925227 DOI: 10.5256/f1000research.7563.d114726] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 04/20/2023] Open
Abstract
High-throughput sequencing of cDNA (RNA-seq) is used extensively to characterize the transcriptome of cells. Many transcriptomic studies aim at comparing either abundance levels or the transcriptome composition between given conditions, and as a first step, the sequencing reads must be used as the basis for abundance quantification of transcriptomic features of interest, such as genes or transcripts. Various quantification approaches have been proposed, ranging from simple counting of reads that overlap given genomic regions to more complex estimation of underlying transcript abundances. In this paper, we show that gene-level abundance estimates and statistical inference offer advantages over transcript-level analyses, in terms of performance and interpretability. We also illustrate that the presence of differential isoform usage can lead to inflated false discovery rates in differential gene expression analyses on simple count matrices but that this can be addressed by incorporating offsets derived from transcript-level abundance estimates. We also show that the problem is relatively minor in several real data sets. Finally, we provide an R package ( tximport) to help users integrate transcript-level abundance estimates from common quantification pipelines into count-based statistical inference engines.
Collapse
Affiliation(s)
- Charlotte Soneson
- Institute for Molecular Life Sciences, University of Zurich, Zurich, 8057, Switzerland; SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, 8057, Switzerland
| | - Michael I Love
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, 02210, USA; Department of Biostatistics, Harvard TH Chan School of Public Health, Boston, MA, 02115, USA
| | - Mark D Robinson
- Institute for Molecular Life Sciences, University of Zurich, Zurich, 8057, Switzerland; SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, 8057, Switzerland
| |
Collapse
|
9
|
Soneson C, Love MI, Robinson MD. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Res 2015. [PMID: 26925227 DOI: 10.5256/f1000research.7563.d114724] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 04/20/2023] Open
Abstract
High-throughput sequencing of cDNA (RNA-seq) is used extensively to characterize the transcriptome of cells. Many transcriptomic studies aim at comparing either abundance levels or the transcriptome composition between given conditions, and as a first step, the sequencing reads must be used as the basis for abundance quantification of transcriptomic features of interest, such as genes or transcripts. Various quantification approaches have been proposed, ranging from simple counting of reads that overlap given genomic regions to more complex estimation of underlying transcript abundances. In this paper, we show that gene-level abundance estimates and statistical inference offer advantages over transcript-level analyses, in terms of performance and interpretability. We also illustrate that the presence of differential isoform usage can lead to inflated false discovery rates in differential gene expression analyses on simple count matrices but that this can be addressed by incorporating offsets derived from transcript-level abundance estimates. We also show that the problem is relatively minor in several real data sets. Finally, we provide an R package ( tximport) to help users integrate transcript-level abundance estimates from common quantification pipelines into count-based statistical inference engines.
Collapse
Affiliation(s)
- Charlotte Soneson
- Institute for Molecular Life Sciences, University of Zurich, Zurich, 8057, Switzerland; SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, 8057, Switzerland
| | - Michael I Love
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, 02210, USA; Department of Biostatistics, Harvard TH Chan School of Public Health, Boston, MA, 02115, USA
| | - Mark D Robinson
- Institute for Molecular Life Sciences, University of Zurich, Zurich, 8057, Switzerland; SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, 8057, Switzerland
| |
Collapse
|
10
|
Soneson C, Love MI, Robinson MD. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Res 2015. [PMID: 26925227 DOI: 10.5256/f1000research.7563.d114722] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 04/20/2023] Open
Abstract
High-throughput sequencing of cDNA (RNA-seq) is used extensively to characterize the transcriptome of cells. Many transcriptomic studies aim at comparing either abundance levels or the transcriptome composition between given conditions, and as a first step, the sequencing reads must be used as the basis for abundance quantification of transcriptomic features of interest, such as genes or transcripts. Various quantification approaches have been proposed, ranging from simple counting of reads that overlap given genomic regions to more complex estimation of underlying transcript abundances. In this paper, we show that gene-level abundance estimates and statistical inference offer advantages over transcript-level analyses, in terms of performance and interpretability. We also illustrate that the presence of differential isoform usage can lead to inflated false discovery rates in differential gene expression analyses on simple count matrices but that this can be addressed by incorporating offsets derived from transcript-level abundance estimates. We also show that the problem is relatively minor in several real data sets. Finally, we provide an R package ( tximport) to help users integrate transcript-level abundance estimates from common quantification pipelines into count-based statistical inference engines.
Collapse
Affiliation(s)
- Charlotte Soneson
- Institute for Molecular Life Sciences, University of Zurich, Zurich, 8057, Switzerland; SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, 8057, Switzerland
| | - Michael I Love
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, 02210, USA; Department of Biostatistics, Harvard TH Chan School of Public Health, Boston, MA, 02115, USA
| | - Mark D Robinson
- Institute for Molecular Life Sciences, University of Zurich, Zurich, 8057, Switzerland; SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, 8057, Switzerland
| |
Collapse
|
11
|
Soneson C, Love MI, Robinson MD. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Res 2015. [PMID: 26925227 DOI: 10.5256/f1000research.7563.d114730] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 04/20/2023] Open
Abstract
High-throughput sequencing of cDNA (RNA-seq) is used extensively to characterize the transcriptome of cells. Many transcriptomic studies aim at comparing either abundance levels or the transcriptome composition between given conditions, and as a first step, the sequencing reads must be used as the basis for abundance quantification of transcriptomic features of interest, such as genes or transcripts. Various quantification approaches have been proposed, ranging from simple counting of reads that overlap given genomic regions to more complex estimation of underlying transcript abundances. In this paper, we show that gene-level abundance estimates and statistical inference offer advantages over transcript-level analyses, in terms of performance and interpretability. We also illustrate that the presence of differential isoform usage can lead to inflated false discovery rates in differential gene expression analyses on simple count matrices but that this can be addressed by incorporating offsets derived from transcript-level abundance estimates. We also show that the problem is relatively minor in several real data sets. Finally, we provide an R package ( tximport) to help users integrate transcript-level abundance estimates from common quantification pipelines into count-based statistical inference engines.
Collapse
Affiliation(s)
- Charlotte Soneson
- Institute for Molecular Life Sciences, University of Zurich, Zurich, 8057, Switzerland; SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, 8057, Switzerland
| | - Michael I Love
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, 02210, USA; Department of Biostatistics, Harvard TH Chan School of Public Health, Boston, MA, 02115, USA
| | - Mark D Robinson
- Institute for Molecular Life Sciences, University of Zurich, Zurich, 8057, Switzerland; SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, 8057, Switzerland
| |
Collapse
|
12
|
Soneson C, Love MI, Robinson MD. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Res 2015. [PMID: 26925227 DOI: 10.5256/f1000research.7563.d114725] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 04/20/2023] Open
Abstract
High-throughput sequencing of cDNA (RNA-seq) is used extensively to characterize the transcriptome of cells. Many transcriptomic studies aim at comparing either abundance levels or the transcriptome composition between given conditions, and as a first step, the sequencing reads must be used as the basis for abundance quantification of transcriptomic features of interest, such as genes or transcripts. Various quantification approaches have been proposed, ranging from simple counting of reads that overlap given genomic regions to more complex estimation of underlying transcript abundances. In this paper, we show that gene-level abundance estimates and statistical inference offer advantages over transcript-level analyses, in terms of performance and interpretability. We also illustrate that the presence of differential isoform usage can lead to inflated false discovery rates in differential gene expression analyses on simple count matrices but that this can be addressed by incorporating offsets derived from transcript-level abundance estimates. We also show that the problem is relatively minor in several real data sets. Finally, we provide an R package ( tximport) to help users integrate transcript-level abundance estimates from common quantification pipelines into count-based statistical inference engines.
Collapse
Affiliation(s)
- Charlotte Soneson
- Institute for Molecular Life Sciences, University of Zurich, Zurich, 8057, Switzerland; SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, 8057, Switzerland
| | - Michael I Love
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, 02210, USA; Department of Biostatistics, Harvard TH Chan School of Public Health, Boston, MA, 02115, USA
| | - Mark D Robinson
- Institute for Molecular Life Sciences, University of Zurich, Zurich, 8057, Switzerland; SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, 8057, Switzerland
| |
Collapse
|