151
|
Hicks SC, Okrah K, Paulson JN, Quackenbush J, Irizarry RA, Bravo HC. Smooth quantile normalization. Biostatistics 2019; 19:185-198. [PMID: 29036413 DOI: 10.1093/biostatistics/kxx028] [Citation(s) in RCA: 66] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2016] [Accepted: 05/07/2017] [Indexed: 11/14/2022] Open
Abstract
Between-sample normalization is a critical step in genomic data analysis to remove systematic bias and unwanted technical variation in high-throughput data. Global normalization methods are based on the assumption that observed variability in global properties is due to technical reasons and are unrelated to the biology of interest. For example, some methods correct for differences in sequencing read counts by scaling features to have similar median values across samples, but these fail to reduce other forms of unwanted technical variation. Methods such as quantile normalization transform the statistical distributions across samples to be the same and assume global differences in the distribution are induced by only technical variation. However, it remains unclear how to proceed with normalization if these assumptions are violated, for example, if there are global differences in the statistical distributions between biological conditions or groups, and external information, such as negative or control features, is not available. Here, we introduce a generalization of quantile normalization, referred to as smooth quantile normalization (qsmooth), which is based on the assumption that the statistical distribution of each sample should be the same (or have the same distributional shape) within biological groups or conditions, but allowing that they may differ between groups. We illustrate the advantages of our method on several high-throughput datasets with global differences in distributions corresponding to different biological conditions. We also perform a Monte Carlo simulation study to illustrate the bias-variance tradeoff and root mean squared error of qsmooth compared to other global normalization methods. A software implementation is available from https://github.com/stephaniehicks/qsmooth.
Collapse
Affiliation(s)
- Stephanie C Hicks
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, 450 Brookline Ave, Boston, MA 02215, USA and Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, MA 02115, USA
| | - Kwame Okrah
- Genetech, Product Development Biostatistics, 1 DNA Way, South San Francisco, CA 94080, USA
| | - Joseph N Paulson
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, 450 Brookline Ave, Boston, MA 02215, USA and Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, MA 02115, USA
| | - John Quackenbush
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, 450 Brookline Ave, Boston, MA 02215, USA and Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, MA 02115, USA
| | - Rafael A Irizarry
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, 450 Brookline Ave, Boston, MA 02215, USA and Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, MA 02115, USA
| | - Héctor Corrada Bravo
- Department of Computer Science, University of Maryland, College Park, USA and Center for Bioinformatics and Computational Biology, Institute of Advanced Computer Studies, University of Maryland, 8314 Paint Branch Dr., College Park, MD 20742, College Park, USA
| |
Collapse
|
152
|
Genome-wide identification of mRNA 5-methylcytosine in mammals. Nat Struct Mol Biol 2019; 26:380-388. [PMID: 31061524 DOI: 10.1038/s41594-019-0218-x] [Citation(s) in RCA: 189] [Impact Index Per Article: 31.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2018] [Accepted: 03/27/2019] [Indexed: 02/07/2023]
Abstract
Accurate and systematic transcriptome-wide detection of 5-methylcytosine (m5C) has proved challenging, and there are conflicting views about the prevalence of this modification in mRNAs. Here we report an experimental and computational framework that robustly identified mRNA m5C sites and determined sequence motifs and structural features associated with the modification using a set of high-confidence sites. We developed a quantitative atlas of RNA m5C sites in human and mouse tissues based on our framework. In a given tissue, we typically identified several hundred exonic m5C sites. About 62-70% of the sites had low methylation levels (<20% methylation), while 8-10% of the sites were moderately or highly methylated (>40% methylation). Cross-species analysis revealed that species, rather than tissue type, was the primary determinant of methylation levels, indicating strong cis-directed regulation of RNA methylation. Combined, these data provide a valuable resource for identifying the regulation and functions of RNA methylation.
Collapse
|
153
|
Pérez-Rubio P, Lottaz C, Engelmann JC. FastqPuri: high-performance preprocessing of RNA-seq data. BMC Bioinformatics 2019; 20:226. [PMID: 31053060 PMCID: PMC6500068 DOI: 10.1186/s12859-019-2799-0] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2018] [Accepted: 04/09/2019] [Indexed: 12/23/2022] Open
Abstract
Background RNA sequencing (RNA-seq) has become the standard means of analyzing gene and transcript expression in high-throughput. While previously sequence alignment was a time demanding step, fast alignment methods and even more so transcript counting methods which avoid mapping and quantify gene and transcript expression by evaluating whether a read is compatible with a transcript, have led to significant speed-ups in data analysis. Now, the most time demanding step in the analysis of RNA-seq data is preprocessing the raw sequence data, such as running quality control and adapter, contamination and quality filtering before transcript or gene quantification. To do so, many researchers chain different tools, but a comprehensive, flexible and fast software that covers all preprocessing steps is currently missing. Results We here present FastqPuri, a light-weight and highly efficient preprocessing tool for fastq data. FastqPuri provides sequence quality reports on the sample and dataset level with new plots which facilitate decision making for subsequent quality filtering. Moreover, FastqPuri efficiently removes adapter sequences and sequences from biological contamination from the data. It accepts both single- and paired-end data in uncompressed or compressed fastq files. FastqPuri can be run stand-alone and is suitable to be run within pipelines. We benchmarked FastqPuri against existing tools and found that FastqPuri is superior in terms of speed, memory usage, versatility and comprehensiveness. Conclusions FastqPuri is a new tool which covers all aspects of short read sequence data preprocessing. It was designed for RNA-seq data to meet the needs for fast preprocessing of fastq data to allow transcript and gene counting, but it is suitable to process any short read sequencing data of which high sequence quality is needed, such as for genome assembly or SNV (single nucleotide variant) detection. FastqPuri is most flexible in filtering undesired biological sequences by offering two approaches to optimize speed and memory usage dependent on the total size of the potential contaminating sequences. FastqPuri is available at https://github.com/jengelmann/FastqPuri. It is implemented in C and R and licensed under GPL v3. Electronic supplementary material The online version of this article (10.1186/s12859-019-2799-0) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Paula Pérez-Rubio
- Statistical Bioinformatics, Institute of Functional Genomics, University of Regensburg, Am BioPark 9, Regensburg, 93053, Germany
| | - Claudio Lottaz
- Statistical Bioinformatics, Institute of Functional Genomics, University of Regensburg, Am BioPark 9, Regensburg, 93053, Germany
| | - Julia C Engelmann
- Department of Marine Microbiology and Biogeochemistry, NIOZ Royal Netherlands Institute for Sea Research and Utrecht University, P.O. Box 59, Den Burg, 1790 AB, The Netherlands.
| |
Collapse
|
154
|
Chung RH, Kang CY. A multi-omics data simulator for complex disease studies and its application to evaluate multi-omics data analysis methods for disease classification. Gigascience 2019; 8:giz045. [PMID: 31029063 PMCID: PMC6486474 DOI: 10.1093/gigascience/giz045] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2018] [Revised: 03/05/2019] [Accepted: 03/28/2019] [Indexed: 01/16/2023] Open
Abstract
BACKGROUND An integrative multi-omics analysis approach that combines multiple types of omics data including genomics, epigenomics, transcriptomics, proteomics, metabolomics, and microbiomics has become increasing popular for understanding the pathophysiology of complex diseases. Although many multi-omics analysis methods have been developed for complex disease studies, only a few simulation tools that simulate multiple types of omics data and model their relationships with disease status are available, and these tools have their limitations in simulating the multi-omics data. RESULTS We developed the multi-omics data simulator OmicsSIMLA, which simulates genomics (i.e., single-nucleotide polymorphisms [SNPs] and copy number variations), epigenomics (i.e., bisulphite sequencing), transcriptomics (i.e., RNA sequencing), and proteomics (i.e., normalized reverse phase protein array) data at the whole-genome level. Furthermore, the relationships between different types of omics data, such as methylation quantitative trait loci (SNPs influencing methylation), expression quantitative trait loci (SNPs influencing gene expression), and expression quantitative trait methylations (methylations influencing gene expression), were modeled. More importantly, the relationships between these multi-omics data and the disease status were modeled as well. We used OmicsSIMLA to simulate a multi-omics dataset for breast cancer under a hypothetical disease model and used the data to compare the performance among existing multi-omics analysis methods in terms of disease classification accuracy and runtime. We also used OmicsSIMLA to simulate a multi-omics dataset with a scale similar to an ovarian cancer multi-omics dataset. The neural network-based multi-omics analysis method ATHENA was applied to both the real and simulated data and the results were compared. Our results demonstrated that complex disease mechanisms can be simulated by OmicsSIMLA, and ATHENA showed the highest prediction accuracy when the effects of multi-omics features (e.g., SNPs, copy number variations, and gene expression levels) on the disease were strong. Furthermore, similar results can be obtained from ATHENA when analyzing the simulated and real ovarian multi-omics data. CONCLUSIONS OmicsSIMLA will be useful to evaluate the performace of different multi-omics analysis methods. Sample sizes and power can also be calculated by OmicsSIMLA when planning a new multi-omics disease study.
Collapse
Affiliation(s)
- Ren-Hua Chung
- Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, No. 35, Keyan Road, Zhunan, 350, Taiwan
| | - Chen-Yu Kang
- Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, No. 35, Keyan Road, Zhunan, 350, Taiwan
| |
Collapse
|
155
|
Sherman TD, Kagohara LT, Cao R, Cheng R, Satriano M, Considine M, Krigsfeld G, Ranaweera R, Tang Y, Jablonski SA, Stein-O'Brien G, Gaykalova DA, Weiner LM, Chung CH, Fertig EJ. CancerInSilico: An R/Bioconductor package for combining mathematical and statistical modeling to simulate time course bulk and single cell gene expression data in cancer. PLoS Comput Biol 2019; 14:e1006935. [PMID: 31002670 PMCID: PMC6504085 DOI: 10.1371/journal.pcbi.1006935] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2018] [Revised: 05/07/2019] [Accepted: 03/11/2019] [Indexed: 11/18/2022] Open
Abstract
Bioinformatics techniques to analyze time course bulk and single cell omics data
are advancing. The absence of a known ground truth of the dynamics of molecular
changes challenges benchmarking their performance on real data. Realistic
simulated time-course datasets are essential to assess the performance of time
course bioinformatics algorithms. We develop an R/Bioconductor package,
CancerInSilico, to simulate bulk and single cell
transcriptional data from a known ground truth obtained from mathematical models
of cellular systems. This package contains a general R infrastructure for
running cell-based models and simulating gene expression data based on the model
states. We show how to use this package to simulate a gene expression data set
and consequently benchmark analysis methods on this data set with a known ground
truth. The package is freely available via Bioconductor: http://bioconductor.org/packages/CancerInSilico/
Collapse
Affiliation(s)
- Thomas D. Sherman
- Department of Oncology, Division of Biostatistics and Bioinformatics,
Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore,
MD United States of America
- * E-mail:
(TDS); (EJF)
| | - Luciane T. Kagohara
- Department of Oncology, Division of Biostatistics and Bioinformatics,
Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore,
MD United States of America
| | - Raymon Cao
- Department of Oncology, Division of Biostatistics and Bioinformatics,
Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore,
MD United States of America
| | - Raymond Cheng
- Science, Math and Computer Science Magnet Program, Poolesville High
School, Poolesville, MD United States of America
| | - Matthew Satriano
- Department of Mathematics, University of Waterloo, Waterloo, Ontario,
Canada
| | - Michael Considine
- Department of Oncology, Division of Biostatistics and Bioinformatics,
Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore,
MD United States of America
| | - Gabriel Krigsfeld
- Department of Oncology, Division of Biostatistics and Bioinformatics,
Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore,
MD United States of America
| | | | - Yong Tang
- Lombardi Comprehensive Cancer Center, Georgetown University, Washington,
DC United States of America
| | - Sandra A. Jablonski
- Lombardi Comprehensive Cancer Center, Georgetown University, Washington,
DC United States of America
| | - Genevieve Stein-O'Brien
- Department of Oncology, Division of Biostatistics and Bioinformatics,
Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore,
MD United States of America
- Institute of Genetic Medicine, Johns Hopkins University, Baltimore, MD
United States of America
| | - Daria A. Gaykalova
- Department of Otolaryngology-Head and Neck Surgery, Johns Hopkins
University School of Medicine, Baltimore, MD United States of
America
| | - Louis M. Weiner
- Lombardi Comprehensive Cancer Center, Georgetown University, Washington,
DC United States of America
| | | | - Elana J. Fertig
- Department of Oncology, Division of Biostatistics and Bioinformatics,
Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore,
MD United States of America
- Department of Applied Mathematics and Statistics, Johns Hopkins
University, Baltimore, MD United States of America
- Department of Biomedical Engineering, Johns Hopkins University,
Baltimore, MD United States of America
- * E-mail:
(TDS); (EJF)
| |
Collapse
|
156
|
Aguiar VRC, César J, Delaneau O, Dermitzakis ET, Meyer D. Expression estimation and eQTL mapping for HLA genes with a personalized pipeline. PLoS Genet 2019; 15:e1008091. [PMID: 31009447 PMCID: PMC6497317 DOI: 10.1371/journal.pgen.1008091] [Citation(s) in RCA: 54] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2018] [Revised: 05/02/2019] [Accepted: 03/13/2019] [Indexed: 01/07/2023] Open
Abstract
The HLA (Human Leukocyte Antigens) genes are well-documented targets of balancing selection, and variation at these loci is associated with many disease phenotypes. Variation in expression levels also influences disease susceptibility and resistance, but little information exists about the regulation and population-level patterns of expression. This results from the difficulty in mapping short reads originated from these highly polymorphic loci, and in accounting for the existence of several paralogues. We developed a computational pipeline to accurately estimate expression for HLA genes based on RNA-seq, improving both locus-level and allele-level estimates. First, reads are aligned to all known HLA sequences in order to infer HLA genotypes, then quantification of expression is carried out using a personalized index. We use simulations to show that expression estimates obtained in this way are not biased due to divergence from the reference genome. We applied our pipeline to the GEUVADIS dataset, and compared the quantifications to those obtained with reference transcriptome. Although the personalized pipeline recovers more reads, we found that using the reference transcriptome produces estimates similar to the personalized pipeline (r ≥ 0.87) with the exception of HLA-DQA1. We describe the impact of the HLA-personalized approach on downstream analyses for nine classical HLA loci (HLA-A, HLA-C, HLA-B, HLA-DRA, HLA-DRB1, HLA-DQA1, HLA-DQB1, HLA-DPA1, HLA-DPB1). Although the influence of the HLA-personalized approach is modest for eQTL mapping, the p-values and the causality of the eQTLs obtained are better than when the reference transcriptome is used. We investigate how the eQTLs we identified explain variation in expression among lineages of HLA alleles. Finally, we discuss possible causes underlying differences between expression estimates obtained using RNA-seq, antibody-based approaches and qPCR.
Collapse
Affiliation(s)
- Vitor R. C. Aguiar
- Department of Genetics and Evolutionary Biology, Institute of Biosciences, University of São Paulo, São Paulo, Brazil
| | - Jônatas César
- Department of Genetics and Evolutionary Biology, Institute of Biosciences, University of São Paulo, São Paulo, Brazil
| | - Olivier Delaneau
- Department of Genetic Medicine and Development, University of Geneva Medical School, Geneva, Switzerland
| | - Emmanouil T. Dermitzakis
- Department of Genetic Medicine and Development, University of Geneva Medical School, Geneva, Switzerland
| | - Diogo Meyer
- Department of Genetics and Evolutionary Biology, Institute of Biosciences, University of São Paulo, São Paulo, Brazil
| |
Collapse
|
157
|
Owen N, Moosajee M. RNA-sequencing in ophthalmology research: considerations for experimental design and analysis. Ther Adv Ophthalmol 2019; 11:2515841419835460. [PMID: 30911735 PMCID: PMC6421592 DOI: 10.1177/2515841419835460] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2018] [Accepted: 02/08/2019] [Indexed: 12/13/2022] Open
Abstract
High-throughput, massively parallel sequence analysis has revolutionized the way that researchers design and execute scientific investigations. Vast amounts of sequence data can be generated in short periods of time. Regarding ophthalmology and vision research, extensive interrogation of patient samples for underlying causative DNA mutations has resulted in the discovery of many new genes relevant to eye disease. However, such analysis remains functionally limited. RNA-sequencing accurately snapshots thousands of genes, capturing many subtypes of RNA molecules, and has become the gold standard for transcriptome gene expression quantification. RNA-sequencing has the potential to advance our understanding of eye development and disease; it can reveal new candidates to improve our molecular diagnosis rates and highlight therapeutic targets for intervention. But with a wide range of applications, the design of such experiments can be problematic, no single optimal pipeline exists, and therefore, several considerations must be undertaken for optimal study design. We review the key steps involved in RNA-sequencing experimental design and the downstream bioinformatic pipelines used for differential gene expression. We provide guidance on the application of RNA-sequencing to ophthalmology and sources of open-access eye-related data sets.
Collapse
Affiliation(s)
- Nicholas Owen
- Development, Ageing and Disease Theme, UCL Institute of Ophthalmology, University College London, London, UK
| | | |
Collapse
|
158
|
Soneson C, Love MI, Patro R, Hussain S, Malhotra D, Robinson MD. A junction coverage compatibility score to quantify the reliability of transcript abundance estimates and annotation catalogs. Life Sci Alliance 2019; 2:2/1/e201800175. [PMID: 30655364 PMCID: PMC6337739 DOI: 10.26508/lsa.201800175] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2018] [Revised: 01/07/2019] [Accepted: 01/08/2019] [Indexed: 02/01/2023] Open
Abstract
Comparison of observed exon–exon junction counts to those predicted from estimated transcript abundances can identify genes with misannotated or misquantified transcripts. Most methods for statistical analysis of RNA-seq data take a matrix of abundance estimates for some type of genomic features as their input, and consequently the quality of any obtained results is directly dependent on the quality of these abundances. Here, we present the junction coverage compatibility score, which provides a way to evaluate the reliability of transcript-level abundance estimates and the accuracy of transcript annotation catalogs. It works by comparing the observed number of reads spanning each annotated splice junction in a genomic region to the predicted number of junction-spanning reads, inferred from the estimated transcript abundances and the genomic coordinates of the corresponding annotated transcripts. We show that although most genes show good agreement between the observed and predicted junction coverages, there is a small set of genes that do not. Genes with poor agreement are found regardless of the method used to estimate transcript abundances, and the corresponding transcript abundances should be treated with care in any downstream analyses.
Collapse
Affiliation(s)
- Charlotte Soneson
- Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland .,SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland
| | - Michael I Love
- Department of Biostatistics, University of North Carolina-Chapel Hill, Chapel Hill, NC, USA.,Department of Genetics, University of North Carolina-Chapel Hill, Chapel Hill, NC, USA
| | - Rob Patro
- Department of Computer Science, Stony Brook University, NY, USA
| | - Shobbir Hussain
- Department of Biology and Biochemistry, University of Bath, Bath, UK
| | - Dheeraj Malhotra
- F. Hoffmann-La Roche Ltd, Pharma Research and Early Development, Neuroscience, Ophthalmology and Rare Diseases, Roche Innovation Center Basel, Basel, Switzerland
| | - Mark D Robinson
- Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland .,SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland
| |
Collapse
|
159
|
Alasoo K, Rodrigues J, Danesh J, Freitag DF, Paul DS, Gaffney DJ. Genetic effects on promoter usage are highly context-specific and contribute to complex traits. eLife 2019; 8:e41673. [PMID: 30618377 PMCID: PMC6349408 DOI: 10.7554/elife.41673] [Citation(s) in RCA: 44] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2018] [Accepted: 01/08/2019] [Indexed: 12/12/2022] Open
Abstract
Genetic variants regulating RNA splicing and transcript usage have been implicated in both common and rare diseases. Although transcript usage quantitative trait loci (tuQTLs) have been mapped across multiple cell types and contexts, it is challenging to distinguish between the main molecular mechanisms controlling transcript usage: promoter choice, splicing and 3' end choice. Here, we analysed RNA-seq data from human macrophages exposed to three inflammatory and one metabolic stimulus. In addition to conventional gene-level and transcript-level analyses, we also directly quantified promoter usage, splicing and 3' end usage. We found that promoters, splicing and 3' ends were predominantly controlled by independent genetic variants enriched in distinct genomic features. Promoter usage QTLs were also 50% more likely to be context-specific than other tuQTLs and constituted 25% of the transcript-level colocalisations with complex traits. Thus, promoter usage might be an underappreciated molecular mechanism mediating complex trait associations in a context-specific manner.
Collapse
Affiliation(s)
- Kaur Alasoo
- Institute of Computer ScienceUniversity of TartuTartuEstonia
- Wellcome Sanger Institute, Wellcome Genome CampusHinxtonUnited Kingdom
| | - Julia Rodrigues
- Wellcome Sanger Institute, Wellcome Genome CampusHinxtonUnited Kingdom
| | - John Danesh
- Wellcome Sanger Institute, Wellcome Genome CampusHinxtonUnited Kingdom
- BHF Cardiovascular Epidemiology Unit, Department of Public Health and Primary CareUniversity of CambridgeCambridgeUnited Kingdom
- British Heart Foundation Centre of Excellence, Division of Cardiovascular MedicineAddenbrooke’s HospitalCambridgeUnited Kingdom
- National Institute for Health Research Blood and Transplant Unit (NIHR BTRU) in Donor Health and Genomics, Department of Public Health and Primary CareUniversity of CambridgeCambridgeUnited Kingdom
| | - Daniel F Freitag
- Wellcome Sanger Institute, Wellcome Genome CampusHinxtonUnited Kingdom
- British Heart Foundation Centre of Excellence, Division of Cardiovascular MedicineAddenbrooke’s HospitalCambridgeUnited Kingdom
| | - Dirk S Paul
- Wellcome Sanger Institute, Wellcome Genome CampusHinxtonUnited Kingdom
- British Heart Foundation Centre of Excellence, Division of Cardiovascular MedicineAddenbrooke’s HospitalCambridgeUnited Kingdom
| | - Daniel J Gaffney
- Wellcome Sanger Institute, Wellcome Genome CampusHinxtonUnited Kingdom
| |
Collapse
|
160
|
Garanina IA, Fisunov GY, Govorun VM. BAC-BROWSER: The Tool for Visualization and Analysis of Prokaryotic Genomes. Front Microbiol 2018; 9:2827. [PMID: 30519231 PMCID: PMC6258810 DOI: 10.3389/fmicb.2018.02827] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2018] [Accepted: 11/05/2018] [Indexed: 11/13/2022] Open
Abstract
Prokaryotes are actively studied objects in the scope of genomic regulation. Microbiologists need special tools for complex analysis of data to study and identification of regulatory mechanism in bacteria and archaea. We developed a tool BAC-BROWSER, specifically for visualization and analysis of small prokaryotic genomes. BAC-BROWSER provides tools for different types of analysis to study a wide set of regulatory mechanisms of prokaryotes: -transcriptional regulation by transcription factors (TFs), analysis of TFs, their targets, and binding sites.-other regulatory motifs, promoters, terminators and ribosome binding sites-transcriptional regulation by variation of operon structure, alternative starts or ends of transcription.-non-coding RNAs, antisense RNAs-RNA secondary structure, riboswitches-GC content, GC skew, codon usage BAC-browser incorporated free programs accelerating the verification of obtained results: primer design and oligocalculator, vector visualization, the tool for synthetic gene construction. The program is designed for Windows operating system and freely available for download in http://smdb.rcpcm.org/tools/index.html.
Collapse
Affiliation(s)
- Irina A Garanina
- Federal Research and Clinical Centre of Physical-Chemical Medicine, Moscow, Russia.,Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of Sciences, Moscow, Russia
| | - Gleb Y Fisunov
- Federal Research and Clinical Centre of Physical-Chemical Medicine, Moscow, Russia
| | - Vadim M Govorun
- Federal Research and Clinical Centre of Physical-Chemical Medicine, Moscow, Russia.,Moscow Institute of Physics and Technology, Dolgoprudny, Russia
| |
Collapse
|
161
|
Lee D, Cheng A, Lawlor N, Bolisetty M, Ucar D. Detection of correlated hidden factors from single cell transcriptomes using Iteratively Adjusted-SVA (IA-SVA). Sci Rep 2018; 8:17040. [PMID: 30451954 PMCID: PMC6242813 DOI: 10.1038/s41598-018-35365-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2017] [Accepted: 11/01/2018] [Indexed: 01/01/2023] Open
Abstract
Single cell RNA-sequencing (scRNA-seq) precisely characterizes gene expression levels and dissects variation in expression associated with the state (technical or biological) and the type of the cell, which is averaged out in bulk measurements. Multiple and correlated sources contribute to gene expression variation in single cells, which makes their estimation difficult with the existing methods developed for batch correction (e.g., surrogate variable analysis (SVA)) that estimate orthogonal transformations of these sources. We developed iteratively adjusted surrogate variable analysis (IA-SVA) that can estimate hidden factors even when they are correlated with other sources of variation by identifying a set of genes associated with each hidden factor in an iterative manner. Analysis of scRNA-seq data from human cells showed that IA-SVA could accurately capture hidden variation arising from technical (e.g., stacked doublet cells) or biological sources (e.g., cell type or cell-cycle stage). Furthermore, IA-SVA delivers a set of genes associated with the detected hidden source to be used in downstream data analyses. As a proof of concept, IA-SVA recapitulated known marker genes for islet cell subsets (e.g., alpha, beta), which improved the grouping of subsets into distinct clusters. Taken together, IA-SVA is an effective and novel method to dissect multiple and correlated sources of variation in scRNA-seq data.
Collapse
Affiliation(s)
- Donghyung Lee
- The Jackson Laboratory for Genomic Medicine, Farmington, 06032, CT, USA.
| | - Anthony Cheng
- The Jackson Laboratory for Genomic Medicine, Farmington, 06032, CT, USA
- Department of Genetics and Genome Sciences, University of Connecticut Health Center, Farmington, 06030, CT, USA
| | - Nathan Lawlor
- The Jackson Laboratory for Genomic Medicine, Farmington, 06032, CT, USA
| | | | - Duygu Ucar
- The Jackson Laboratory for Genomic Medicine, Farmington, 06032, CT, USA.
- Department of Genetics and Genome Sciences, University of Connecticut Health Center, Farmington, 06030, CT, USA.
- Institute of Systems Genomics, University of Connecticut Health Center, Farmington, 06030, CT, USA.
| |
Collapse
|
162
|
Ye CJ, Chen J, Villani AC, Gate RE, Subramaniam M, Bhangale T, Lee MN, Raj T, Raychowdhury R, Li W, Rogel N, Simmons S, Imboywa SH, Chipendo PI, McCabe C, Lee MH, Frohlich IY, Stranger BE, De Jager PL, Regev A, Behrens T, Hacohen N. Genetic analysis of isoform usage in the human anti-viral response reveals influenza-specific regulation of ERAP2 transcripts under balancing selection. Genome Res 2018; 28:1812-1825. [PMID: 30446528 PMCID: PMC6280757 DOI: 10.1101/gr.240390.118] [Citation(s) in RCA: 55] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2018] [Accepted: 10/09/2018] [Indexed: 02/02/2023]
Abstract
While genetic variants are known to be associated with overall gene abundance in stimulated immune cells, less is known about their effects on alternative isoform usage. By analyzing RNA-seq profiles of monocyte-derived dendritic cells from 243 individuals, we uncovered thousands of unannotated isoforms synthesized in response to influenza infection and type 1 interferon stimulation. We identified more than a thousand quantitative trait loci (QTLs) associated with alternate isoform usage (isoQTLs), many of which are independent of expression QTLs (eQTLs) for the same gene. Compared with eQTLs, isoQTLs are enriched for splice sites and untranslated regions, but depleted of sequences upstream of annotated transcription start sites. Both eQTLs and isoQTLs explain a significant proportion of the disease heritability attributed to common genetic variants. At the ERAP2 locus, we shed light on the function of the gene and how two frequent, highly differentiated haplotypes with intermediate frequencies could be maintained by balancing selection. At baseline and following type 1 interferon stimulation, the major haplotype is associated with low ERAP2 expression caused by nonsense-mediated decay, while the minor haplotype, known to increase Crohn's disease risk, is associated with high ERAP2 expression. In response to influenza infection, we found two uncharacterized isoforms expressed from the major haplotype, likely the result of multiple perfectly linked variants affecting the transcription and splicing at the locus. Thus, genetic variants at a single locus could modulate independent gene regulatory processes in innate immune responses and, in the case of ERAP2, may confer a historical fitness advantage in response to virus.
Collapse
Affiliation(s)
- Chun Jimmie Ye
- Institute for Human Genetics, Institute for Health and Computational Sciences, Department of Biostatistics and Epidemiology, Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, California 94143, USA
| | - Jenny Chen
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.,Division of Health Sciences and Technology, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - Alexandra-Chloé Villani
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.,Department of Medicine, Massachusetts General Hospital Cancer Center, Boston, Massachusetts 02114, USA
| | - Rachel E Gate
- Institute for Human Genetics, Institute for Health and Computational Sciences, Department of Biostatistics and Epidemiology, Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, California 94143, USA.,Biomedical Informatics Program, University of California, San Francisco, California 94143, USA
| | - Meena Subramaniam
- Institute for Human Genetics, Institute for Health and Computational Sciences, Department of Biostatistics and Epidemiology, Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, California 94143, USA.,Biomedical Informatics Program, University of California, San Francisco, California 94143, USA
| | - Tushar Bhangale
- Genentech Incorporated, South San Francisco, California 94080, USA
| | - Mark N Lee
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.,Department of Medicine, Massachusetts General Hospital Cancer Center, Boston, Massachusetts 02114, USA.,Harvard Medical School, Boston, Massachusetts 02116, USA
| | - Towfique Raj
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.,Harvard Medical School, Boston, Massachusetts 02116, USA.,Departments of Neurology and Psychiatry, Brigham and Women's Hospital, Boston, Massachusetts 02115, USA
| | | | - Weibo Li
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | - Noga Rogel
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | - Sean Simmons
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA
| | | | | | - Cristin McCabe
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.,Departments of Neurology and Psychiatry, Brigham and Women's Hospital, Boston, Massachusetts 02115, USA
| | - Michelle H Lee
- Harvard Medical School, Boston, Massachusetts 02116, USA
| | | | - Barbara E Stranger
- Section of Genetic Medicine, Department of Medicine, Institute for Genomics and Systems Biology, Center for Data Intensive Science, The University of Chicago, Chicago, Illinois 60637, USA
| | - Philip L De Jager
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.,Harvard Medical School, Boston, Massachusetts 02116, USA.,Departments of Neurology and Psychiatry, Brigham and Women's Hospital, Boston, Massachusetts 02115, USA
| | - Aviv Regev
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.,Department of Biology, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA.,Howard Hughes Medical Institute, Chevy Chase, Maryland 20815, USA
| | - Tim Behrens
- Genentech Incorporated, South San Francisco, California 94080, USA
| | - Nir Hacohen
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.,Department of Medicine, Massachusetts General Hospital Cancer Center, Boston, Massachusetts 02114, USA
| |
Collapse
|
163
|
Westoby J, Herrera MS, Ferguson-Smith AC, Hemberg M. Simulation-based benchmarking of isoform quantification in single-cell RNA-seq. Genome Biol 2018; 19:191. [PMID: 30404663 PMCID: PMC6223048 DOI: 10.1186/s13059-018-1571-5] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2018] [Accepted: 10/19/2018] [Indexed: 11/18/2022] Open
Abstract
Single-cell RNA-seq has the potential to facilitate isoform quantification as the confounding factor of a mixed population of cells is eliminated. However, best practice for using existing quantification methods has not been established. We carry out a benchmark for five popular isoform quantification tools. Performance is generally good for simulated data based on SMARTer and SMART-seq2 data. The reduction in performance compared with bulk RNA-seq is small. An important biological insight comes from our analysis of real data which shows that genes that express two isoforms in bulk RNA-seq predominantly express one or neither isoform in individual cells.
Collapse
Affiliation(s)
- Jennifer Westoby
- Department of Genetics, University of Cambridge, Downing Street, Cambridge, CB2 3EH UK
| | - Marcela Sjöberg Herrera
- Departamento de Biología Celular y Molecular, Facultad de Ciencias Biológicas, Pontificia Universidad Católica de Chile, Av. Libertador Bernardo O’Higgins 340, 8331150 Santiago, Chile
| | | | - Martin Hemberg
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SA UK
| |
Collapse
|
164
|
Rigaill G, Balzergue S, Brunaud V, Blondet E, Rau A, Rogier O, Caius J, Maugis-Rabusseau C, Soubigou-Taconnat L, Aubourg S, Lurin C, Martin-Magniette ML, Delannoy E. Synthetic data sets for the identification of key ingredients for RNA-seq differential analysis. Brief Bioinform 2018; 19:65-76. [PMID: 27742662 DOI: 10.1093/bib/bbw092] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2016] [Indexed: 12/16/2022] Open
Abstract
Numerous statistical pipelines are now available for the differential analysis of gene expression measured with RNA-sequencing technology. Most of them are based on similar statistical frameworks after normalization, differing primarily in the choice of data distribution, mean and variance estimation strategy and data filtering. We propose an evaluation of the impact of these choices when few biological replicates are available through the use of synthetic data sets. This framework is based on real data sets and allows the exploration of various scenarios differing in the proportion of non-differentially expressed genes. Hence, it provides an evaluation of the key ingredients of the differential analysis, free of the biases associated with the simulation of data using parametric models. Our results show the relevance of a proper modeling of the mean by using linear or generalized linear modeling. Once the mean is properly modeled, the impact of the other parameters on the performance of the test is much less important. Finally, we propose to use the simple visualization of the raw P-value histogram as a practical evaluation criterion of the performance of differential analysis methods on real data sets.
Collapse
|
165
|
Sterne-Weiler T, Weatheritt RJ, Best AJ, Ha KC, Blencowe BJ. Efficient and Accurate Quantitative Profiling of Alternative Splicing Patterns of Any Complexity on a Laptop. Mol Cell 2018; 72:187-200.e6. [DOI: 10.1016/j.molcel.2018.08.018] [Citation(s) in RCA: 84] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2018] [Revised: 06/24/2018] [Accepted: 08/09/2018] [Indexed: 01/08/2023]
|
166
|
Event Analysis: Using Transcript Events To Improve Estimates of Abundance in RNA-seq Data. G3-GENES GENOMES GENETICS 2018; 8:2923-2940. [PMID: 30021829 PMCID: PMC6118309 DOI: 10.1534/g3.118.200373] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Alternative splicing leverages genomic content by allowing the synthesis of multiple transcripts and, by implication, protein isoforms, from a single gene. However, estimating the abundance of transcripts produced in a given tissue from short sequencing reads is difficult and can result in both the construction of transcripts that do not exist, and the failure to identify true transcripts. An alternative approach is to catalog the events that make up isoforms (splice junctions and exons). We present here the Event Analysis (EA) approach, where we project transcripts onto the genome and identify overlapping/unique regions and junctions. In addition, all possible logical junctions are assembled into a catalog. Transcripts are filtered before quantitation based on simple measures: the proportion of the events detected, and the coverage. We find that mapping to a junction catalog is more efficient at detecting novel junctions than mapping in a splice aware manner. We identify 99.8% of true transcripts while iReckon identifies 82% of the true transcripts and creates more transcripts not included in the simulation than were initially used in the simulation. Using PacBio Iso-seq data from a mouse neural progenitor cell model, EA detects 60% of the novel junctions that are combinations of existing exons while only 43% are detected by STAR. EA further detects ∼5,000 annotated junctions missed by STAR. Filtering transcripts based on the proportion of the transcript detected and the number of reads on average supporting that transcript captures 95% of the PacBio transcriptome. Filtering the reference transcriptome before quantitation, results in is a more stable estimate of isoform abundance, with improved correlation between replicates. This was particularly evident when EA is applied to an RNA-seq study of type 1 diabetes (T1D), where the coefficient of variation among subjects (n = 81) in the transcript abundance estimates was substantially reduced compared to the estimation using the full reference. EA focuses on individual transcriptional events. These events can be quantitate and analyzed directly or used to identify the probable set of expressed transcripts. Simple rules based on detected events and coverage used in filtering result in a dramatic improvement in isoform estimation without the use of ancillary data (e.g., ChIP, long reads) that may not be available for many studies.
Collapse
|
167
|
Abstract
Single-cell RNAseq and alternative splicing studies have recently become two of the most prominent applications of RNAseq. However, the combination of both is still challenging, and few research efforts have been dedicated to the intersection between them. Cell-level insight on isoform expression is required to fully understand the biology of alternative splicing, but it is still an open question to what extent isoform expression analysis at the single-cell level is actually feasible. Here, we establish a set of four conditions that are required for a successful single-cell-level isoform study and evaluate how these conditions are met by these technologies in published research.
Collapse
Affiliation(s)
- Ángeles Arzalluz-Luque
- Genomics of Gene Expression Laboratory, Centro de Investigación Principe Felipe (CIPF), 46012, Valencia, Spain
| | - Ana Conesa
- Genomics of Gene Expression Laboratory, Centro de Investigación Principe Felipe (CIPF), 46012, Valencia, Spain.
- Department of Microbiology and Cell Science, Institute for Food and Agricultural Sciences, Genetics Institute, University of Florida, Gainesville, Florida, 32611, USA.
| |
Collapse
|
168
|
Quinn TP, Crowley TM, Richardson MF. Benchmarking differential expression analysis tools for RNA-Seq: normalization-based vs. log-ratio transformation-based methods. BMC Bioinformatics 2018; 19:274. [PMID: 30021534 PMCID: PMC6052553 DOI: 10.1186/s12859-018-2261-8] [Citation(s) in RCA: 42] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2017] [Accepted: 06/25/2018] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Count data generated by next-generation sequencing assays do not measure absolute transcript abundances. Instead, the data are constrained to an arbitrary "library size" by the sequencing depth of the assay, and typically must be normalized prior to statistical analysis. The constrained nature of these data means one could alternatively use a log-ratio transformation in lieu of normalization, as often done when testing for differential abundance (DA) of operational taxonomic units (OTUs) in 16S rRNA data. Therefore, we benchmark how well the ALDEx2 package, a transformation-based DA tool, detects differential expression in high-throughput RNA-sequencing data (RNA-Seq), compared to conventional RNA-Seq methods such as edgeR and DESeq2. RESULTS To evaluate the performance of log-ratio transformation-based tools, we apply the ALDEx2 package to two simulated, and two real, RNA-Seq data sets. One of the latter was previously used to benchmark dozens of conventional RNA-Seq differential expression methods, enabling us to directly compare transformation-based approaches. We show that ALDEx2, widely used in meta-genomics research, identifies differentially expressed genes (and transcripts) from RNA-Seq data with high precision and, given sufficient sample sizes, high recall too (regardless of the alignment and quantification procedure used). Although we show that the choice in log-ratio transformation can affect performance, ALDEx2 has high precision (i.e., few false positives) across all transformations. Finally, we present a novel, iterative log-ratio transformation (now implemented in ALDEx2) that further improves performance in simulations. CONCLUSIONS Our results suggest that log-ratio transformation-based methods can work to measure differential expression from RNA-Seq data, provided that certain assumptions are met. Moreover, these methods have very high precision (i.e., few false positives) in simulations and perform well on real data too. With previously demonstrated applicability to 16S rRNA data, ALDEx2 can thus serve as a single tool for data from multiple sequencing modalities.
Collapse
Affiliation(s)
- Thomas P. Quinn
- Centre for Molecular and Medical Research, School of Medicine, Deakin University, Geelong, 3220 Australia
- Bioinformatics Core Research Group, Deakin University, Geelong, 3220 Australia
| | - Tamsyn M. Crowley
- Centre for Molecular and Medical Research, School of Medicine, Deakin University, Geelong, 3220 Australia
- Bioinformatics Core Research Group, Deakin University, Geelong, 3220 Australia
- Poultry Hub Australia, University of New England, Armidale, 2351 Australia
| | - Mark F. Richardson
- Bioinformatics Core Research Group, Deakin University, Geelong, 3220 Australia
- Centre for Integrative Ecology, School of Life and Environmental Science, Deakin University, Geelong, 3220 Australia
| |
Collapse
|
169
|
Merleev AA, Marusina AI, Ma C, Elder JT, Tsoi LC, Raychaudhuri SP, Weidinger S, Wang EA, Adamopoulos IE, Luxardi G, Gudjonsson JE, Shimoda M, Maverakis E. Meta-analysis of RNA sequencing datasets reveals an association between TRAJ23, psoriasis, and IL-17A. JCI Insight 2018; 3:120682. [PMID: 29997305 DOI: 10.1172/jci.insight.120682] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2018] [Accepted: 05/23/2018] [Indexed: 12/20/2022] Open
Abstract
Numerous studies of relatively few patients have linked T cell receptor (TCR) genes to psoriasis but have yielded dramatically conflicting results. To resolve these discrepancies, we have chosen to mine RNA-Seq datasets for patterns of TCR gene segment usage in psoriasis. A meta-analysis of 3 existing and 1 unpublished datasets revealed a statistically significant link between the relative expression of TRAJ23 and psoriasis and the psoriasis-associated cytokine IL-17A. TRGV5, a TCR-γ segment, was also associated with psoriasis but correlated instead with IL-36A, other IL-36 family members, and IL-17C (not IL-17A). In contrast, TRAJ39 was strongly associated with healthy skin. T cell diversity measurements and analysis of CDR3 sequences were also conducted, revealing no psoriasis-associated public CDR3 sequences. Finally, in comparison with the expression of TCR-αβ genes, the expression of TCR-γδ genes was relatively low but mildly elevated in psoriatic skin. These results have implications for the development of targeted therapies for psoriasis and other autoimmune diseases. Also, the techniques employed in this study have applications in other fields, such as cancer immunology and infectious disease.
Collapse
Affiliation(s)
- Alexander A Merleev
- Department of Dermatology, School of Medicine, UCD, Sacramento, California, USA
| | - Alina I Marusina
- Department of Dermatology, School of Medicine, UCD, Sacramento, California, USA
| | - Chelsea Ma
- Department of Dermatology, School of Medicine, UCD, Sacramento, California, USA
| | - James T Elder
- Department of Dermatology, University of Michigan, Ann Arbor, Michigan, USA
| | - Lam C Tsoi
- Department of Dermatology, University of Michigan, Ann Arbor, Michigan, USA
| | - Siba P Raychaudhuri
- Department of Internal Medicine, Division of Rheumatology, Allergy & Clinical immunology, UCD School of Medicine, Sacramento, California, USA.,VA Medical Center Sacramento, Division of Rheumatology & Immunology, Sacramento, California, USA
| | - Stephan Weidinger
- Department of Dermatology and Allergy, University Hospital Schleswig-Holstein, Kiel, Germany
| | - Elizabeth A Wang
- Department of Dermatology, School of Medicine, UCD, Sacramento, California, USA
| | - Iannis E Adamopoulos
- Department of Internal Medicine, Division of Rheumatology, Allergy & Clinical immunology, UCD School of Medicine, Sacramento, California, USA
| | - Guillaume Luxardi
- Department of Dermatology, School of Medicine, UCD, Sacramento, California, USA
| | | | - Michiko Shimoda
- Department of Dermatology, School of Medicine, UCD, Sacramento, California, USA
| | - Emanual Maverakis
- Department of Dermatology, School of Medicine, UCD, Sacramento, California, USA.,Department of Medical Microbiology and Immunology, School of Medicine, UCD, California, USA
| |
Collapse
|
170
|
Love MI, Soneson C, Patro R. Swimming downstream: statistical analysis of differential transcript usage following Salmon quantification. F1000Res 2018; 7:952. [PMID: 30356428 PMCID: PMC6178912 DOI: 10.12688/f1000research.15398.3] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 09/27/2018] [Indexed: 12/30/2022] Open
Abstract
Detection of differential transcript usage (DTU) from RNA-seq data is an important bioinformatic analysis that complements differential gene expression analysis. Here we present a simple workflow using a set of existing R/Bioconductor packages for analysis of DTU. We show how these packages can be used downstream of RNA-seq quantification using the Salmon software package. The entire pipeline is fast, benefiting from inference steps by Salmon to quantify expression at the transcript level. The workflow includes live, runnable code chunks for analysis using DRIMSeq and DEXSeq, as well as for performing two-stage testing of DTU using the stageR package, a statistical framework to screen at the gene level and then confirm which transcripts within the significant genes show evidence of DTU. We evaluate these packages and other related packages on a simulated dataset with parameters estimated from real data.
Collapse
Affiliation(s)
- Michael I. Love
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27516, USA
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27516, USA
| | - Charlotte Soneson
- Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Rob Patro
- Department of Computer Science, Stony Brook University, Stony Brook, NY, 11794, USA
| |
Collapse
|
171
|
Love MI, Soneson C, Patro R. Swimming downstream: statistical analysis of differential transcript usage following Salmon quantification. F1000Res 2018; 7:952. [PMID: 30356428 PMCID: PMC6178912 DOI: 10.12688/f1000research.15398.1] [Citation(s) in RCA: 71] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 06/22/2018] [Indexed: 12/25/2022] Open
Abstract
Detection of differential transcript usage (DTU) from RNA-seq data is an important bioinformatic analysis that complements differential gene expression analysis. Here we present a simple workflow using a set of existing R/Bioconductor packages for analysis of DTU. We show how these packages can be used downstream of RNA-seq quantification using the Salmon software package. The entire pipeline is fast, benefiting from inference steps by Salmon to quantify expression at the transcript level. The workflow includes live, runnable code chunks for analysis using DRIMSeq and DEXSeq, as well as for performing two-stage testing of DTU using the stageR package, a statistical framework to screen at the gene level and then confirm which transcripts within the significant genes show evidence of DTU. We evaluate these packages and other related packages on a simulated dataset with parameters estimated from real data.
Collapse
Affiliation(s)
- Michael I. Love
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27516, USA
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27516, USA
| | - Charlotte Soneson
- Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Rob Patro
- Department of Computer Science, Stony Brook University, Stony Brook, NY, 11794, USA
| |
Collapse
|
172
|
Love MI, Soneson C, Patro R. Swimming downstream: statistical analysis of differential transcript usage following Salmon quantification. F1000Res 2018; 7:952. [PMID: 30356428 PMCID: PMC6178912 DOI: 10.12688/f1000research.15398.2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 09/10/2018] [Indexed: 09/29/2023] Open
Abstract
Detection of differential transcript usage (DTU) from RNA-seq data is an important bioinformatic analysis that complements differential gene expression analysis. Here we present a simple workflow using a set of existing R/Bioconductor packages for analysis of DTU. We show how these packages can be used downstream of RNA-seq quantification using the Salmon software package. The entire pipeline is fast, benefiting from inference steps by Salmon to quantify expression at the transcript level. The workflow includes live, runnable code chunks for analysis using DRIMSeq and DEXSeq, as well as for performing two-stage testing of DTU using the stageR package, a statistical framework to screen at the gene level and then confirm which transcripts within the significant genes show evidence of DTU. We evaluate these packages and other related packages on a simulated dataset with parameters estimated from real data.
Collapse
Affiliation(s)
- Michael I. Love
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27516, USA
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27516, USA
| | - Charlotte Soneson
- Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Rob Patro
- Department of Computer Science, Stony Brook University, Stony Brook, NY, 11794, USA
| |
Collapse
|
173
|
Robinson S, Nevalainen J, Pinna G, Campalans A, Radicella JP, Guyon L. Incorporating interaction networks into the determination of functionally related hit genes in genomic experiments with Markov random fields. Bioinformatics 2018; 33:i170-i179. [PMID: 28881978 PMCID: PMC5870666 DOI: 10.1093/bioinformatics/btx244] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
Motivation Incorporating gene interaction data into the identification of ‘hit’ genes in genomic experiments is a well-established approach leveraging the ‘guilt by association’ assumption to obtain a network based hit list of functionally related genes. We aim to develop a method to allow for multivariate gene scores and multiple hit labels in order to extend the analysis of genomic screening data within such an approach. Results We propose a Markov random field-based method to achieve our aim and show that the particular advantages of our method compared with those currently used lead to new insights in previously analysed data as well as for our own motivating data. Our method additionally achieves the best performance in an independent simulation experiment. The real data applications we consider comprise of a survival analysis and differential expression experiment and a cell-based RNA interference functional screen. Availability and implementation We provide all of the data and code related to the results in the paper. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sean Robinson
- CEA, BIG, Biologie à Grande Echelle, F-38054 Grenoble, France.,Université Grenoble-Alpes, F-38000 Grenoble, France.,INSERM, U1038, F-38054 Grenoble, France.,Department of Mathematics and Statistics, University of Turku, Turku, Finland.,Industrial Biotechnology, VTT Technical Research Centre of Finland, Turku, Finland
| | - Jaakko Nevalainen
- Department of Mathematics and Statistics, University of Turku, Turku, Finland.,School of Health Sciences, University of Tampere, Tampere, Finland
| | - Guillaume Pinna
- Plateforme ARN Interférence (PArI), DSV/ISVFJ/SBIGEM/UMR 9198 I2BC, CEA Saclay, Gif-sur-Yvette, France
| | - Anna Campalans
- Institute of Molecular and Cellular Radiobiology, CEA, Fontenay-aux-Roses, France.,INSERM, U967, Fontenay-aux-Roses, France.,Université Paris Diderot, U967, Fontenay-aux-Roses, France.,Université Paris Sud, U967, Fontenay-aux-Roses, France
| | - J Pablo Radicella
- Institute of Molecular and Cellular Radiobiology, CEA, Fontenay-aux-Roses, France.,INSERM, U967, Fontenay-aux-Roses, France.,Université Paris Diderot, U967, Fontenay-aux-Roses, France.,Université Paris Sud, U967, Fontenay-aux-Roses, France
| | - Laurent Guyon
- CEA, BIG, Biologie à Grande Echelle, F-38054 Grenoble, France.,Université Grenoble-Alpes, F-38000 Grenoble, France.,INSERM, U1038, F-38054 Grenoble, France
| |
Collapse
|
174
|
Li M, Xie X, Zhou J, Sheng M, Yin X, Ko EA, Zhou T, Gu W. Quantifying circular RNA expression from RNA-seq data using model-based framework. Bioinformatics 2018; 33:2131-2139. [PMID: 28334396 DOI: 10.1093/bioinformatics/btx129] [Citation(s) in RCA: 51] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2016] [Accepted: 03/07/2017] [Indexed: 11/13/2022] Open
Abstract
Motivation Circular RNAs (circRNAs) are a class of non-coding RNAs that are widely expressed in various cell lines and tissues of many organisms. Although the exact function of many circRNAs is largely unknown, the cell type-and tissue-specific circRNA expression has implicated their crucial functions in many biological processes. Hence, the quantification of circRNA expression from high-throughput RNA-seq data is becoming important to ascertain. Although many model-based methods have been developed to quantify linear RNA expression from RNA-seq data, these methods are not applicable to circRNA quantification. Results Here, we proposed a novel strategy that transforms circular transcripts to pseudo-linear transcripts and estimates the expression values of both circular and linear transcripts using an existing model-based algorithm, Sailfish. The new strategy can accurately estimate transcript expression of both linear and circular transcripts from RNA-seq data. Several factors, such as gene length, amount of expression and the ratio of circular to linear transcripts, had impacts on quantification performance of circular transcripts. In comparison to count-based tools, the new computational framework had superior performance in estimating the amount of circRNA expression from both simulated and real ribosomal RNA-depleted (rRNA-depleted) RNA-seq datasets. On the other hand, the consideration of circular transcripts in expression quantification from rRNA-depleted RNA-seq data showed substantial increased accuracy of linear transcript expression. Our proposed strategy was implemented in a program named Sailfish-cir. Availability and Implementation Sailfish-cir is freely available at https://github.com/zerodel/Sailfish-cir . Contact tongz@medicine.nevada.edu or wanjun.gu@gmail.com. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Musheng Li
- State Key Laboratory of Bioelectronics, School of Biological Sciences and Medical Engineering, Southeast University, Nanjing, Jiangsu, China
| | - Xueying Xie
- State Key Laboratory of Bioelectronics, School of Biological Sciences and Medical Engineering, Southeast University, Nanjing, Jiangsu, China
| | - Jing Zhou
- Research Center for Learning Sciences, Southeast University, Nanjing, Jiangsu, China
| | - Mengying Sheng
- State Key Laboratory of Bioelectronics, School of Biological Sciences and Medical Engineering, Southeast University, Nanjing, Jiangsu, China
| | - Xiaofeng Yin
- Department of Orthopedics and Trauma, Peking University People's Hospital, Beijing, China
| | - Eun-A Ko
- Department of Physiology and Cell Biology, The University of Nevada School of Medicine, Reno, NV, USA
| | - Tong Zhou
- Department of Physiology and Cell Biology, The University of Nevada School of Medicine, Reno, NV, USA
| | - Wanjun Gu
- State Key Laboratory of Bioelectronics, School of Biological Sciences and Medical Engineering, Southeast University, Nanjing, Jiangsu, China
| |
Collapse
|
175
|
Ha KCH, Blencowe BJ, Morris Q. QAPA: a new method for the systematic analysis of alternative polyadenylation from RNA-seq data. Genome Biol 2018; 19:45. [PMID: 29592814 PMCID: PMC5874996 DOI: 10.1186/s13059-018-1414-4] [Citation(s) in RCA: 122] [Impact Index Per Article: 17.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2017] [Accepted: 02/28/2018] [Indexed: 12/21/2022] Open
Abstract
Alternative polyadenylation (APA) affects most mammalian genes. The genome-wide investigation of APA has been hampered by an inability to reliably profile it using conventional RNA-seq. We describe 'Quantification of APA' (QAPA), a method that infers APA from conventional RNA-seq data. QAPA is faster and more sensitive than other methods. Application of QAPA reveals discrete, temporally coordinated APA programs during neurogenesis and that there is little overlap between genes regulated by alternative splicing and those by APA. Modeling of these data uncovers an APA sequence code. QAPA thus enables the discovery and characterization of programs of regulated APA using conventional RNA-seq.
Collapse
Affiliation(s)
- Kevin C H Ha
- Department of Molecular Genetics, University of Toronto, 1 King's College Circle, Toronto, ON, M5A 1A8, Canada.,Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, 160 College Street, Toronto, ON, M5S 3E1, Canada
| | - Benjamin J Blencowe
- Department of Molecular Genetics, University of Toronto, 1 King's College Circle, Toronto, ON, M5A 1A8, Canada. .,Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, 160 College Street, Toronto, ON, M5S 3E1, Canada.
| | - Quaid Morris
- Department of Molecular Genetics, University of Toronto, 1 King's College Circle, Toronto, ON, M5A 1A8, Canada. .,Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, 160 College Street, Toronto, ON, M5S 3E1, Canada. .,Department of Computer Science, University of Toronto, 10 King's College Road, Toronto, ON, M5S 3G4, Canada. .,Vector Institute, 661 University Avenue, Toronto, ON, M5G 1M1, Canada.
| |
Collapse
|
176
|
Simion P, Belkhir K, François C, Veyssier J, Rink JC, Manuel M, Philippe H, Telford MJ. A software tool 'CroCo' detects pervasive cross-species contamination in next generation sequencing data. BMC Biol 2018; 16:28. [PMID: 29506533 PMCID: PMC5838952 DOI: 10.1186/s12915-018-0486-7] [Citation(s) in RCA: 59] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2017] [Accepted: 01/11/2018] [Indexed: 01/20/2023] Open
Abstract
Background Multiple RNA samples are frequently processed together and often mixed before multiplex sequencing in the same sequencing run. While different samples can be separated post sequencing using sample barcodes, the possibility of cross contamination between biological samples from different species that have been processed or sequenced in parallel has the potential to be extremely deleterious for downstream analyses. Results We present CroCo, a software package for identifying and removing such cross contaminants from assembled transcriptomes. Using multiple, recently published sequence datasets, we show that cross contamination is consistently present at varying levels in real data. Using real and simulated data, we demonstrate that CroCo detects contaminants efficiently and correctly. Using a real example from a molecular phylogenetic dataset, we show that contaminants, if not eliminated, can have a decisive, deleterious impact on downstream comparative analyses. Conclusions Cross contamination is pervasive in new and published datasets and, if undetected, can have serious deleterious effects on downstream analyses. CroCo is a database-independent, multi-platform tool, designed for ease of use, that efficiently and accurately detects and removes cross contamination in assembled transcriptomes to avoid these problems. We suggest that the use of CroCo should become a standard cleaning step when processing multiple samples for transcriptome sequencing. Electronic supplementary material The online version of this article (10.1186/s12915-018-0486-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Paul Simion
- Institut des Sciences de l'Evolution (ISEM), UMR 5554, CNRS, IRD, EPHE, Université de Montpellier, Montpellier, France.,Sorbonne Université, CNRS, Institut de Biologie Paris-Seine (IBPS), Evolution Paris-Seine (UMR7138), Case 05, 7 Quai St Bernard, 75005, Paris, France
| | - Khalid Belkhir
- Institut des Sciences de l'Evolution (ISEM), UMR 5554, CNRS, IRD, EPHE, Université de Montpellier, Montpellier, France
| | - Clémentine François
- Institut des Sciences de l'Evolution (ISEM), UMR 5554, CNRS, IRD, EPHE, Université de Montpellier, Montpellier, France
| | - Julien Veyssier
- Institut des Sciences de l'Evolution (ISEM), UMR 5554, CNRS, IRD, EPHE, Université de Montpellier, Montpellier, France
| | - Jochen C Rink
- Max Plank Institute of Molecular Cell Biology and Genetics, Pfotenhauerstrasse 108, 01307, Dresden, Germany
| | - Michaël Manuel
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine (IBPS), Evolution Paris-Seine (UMR7138), Case 05, 7 Quai St Bernard, 75005, Paris, France
| | - Hervé Philippe
- Centre de Théorisation et de Modélisation de la Biodiversité, Station d'Ecologie Théorique et Expérimentale, UMR CNRS 5321, Moulis, 09200, France.,Département de Biochimie, Centre Robert-Cedergren, Université de Montréal, Montréal, H3C 3J7, Québec, Canada
| | - Maximilian J Telford
- Centre for Life's Origins and Evolution, Department of Genetics Evolution and Environment, University College London, Darwin Building, Gower Street, London, WC1E 6BT, UK.
| |
Collapse
|
177
|
Fang H, Huang YF, Radhakrishnan A, Siepel A, Lyon GJ, Schatz MC. Scikit-ribo Enables Accurate Estimation and Robust Modeling of Translation Dynamics at Codon Resolution. Cell Syst 2018; 6:180-191.e4. [PMID: 29361467 PMCID: PMC5832574 DOI: 10.1016/j.cels.2017.12.007] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2017] [Revised: 09/24/2017] [Accepted: 12/08/2017] [Indexed: 10/18/2022]
Abstract
Ribosome profiling (Ribo-seq) is a powerful technique for measuring protein translation; however, sampling errors and biological biases are prevalent and poorly understood. Addressing these issues, we present Scikit-ribo (https://github.com/schatzlab/scikit-ribo), an open-source analysis package for accurate genome-wide A-site prediction and translation efficiency (TE) estimation from Ribo-seq and RNA sequencing data. Scikit-ribo accurately identifies A-site locations and reproduces codon elongation rates using several digestion protocols (r = 0.99). Next, we show that the commonly used reads per kilobase of transcript per million mapped reads-derived TE estimation is prone to biases, especially for low-abundance genes. Scikit-ribo introduces a codon-level generalized linear model with ridge penalty that correctly estimates TE, while accommodating variable codon elongation rates and mRNA secondary structure. This corrects the TE errors for over 2,000 genes in S. cerevisiae, which we validate using mass spectrometry of protein abundances (r = 0.81), and allows us to determine the Kozak-like sequence directly from Ribo-seq. We conclude with an analysis of coverage requirements needed for robust codon-level analysis and quantify the artifacts that can occur from cycloheximide treatment.
Collapse
Affiliation(s)
- Han Fang
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA; Department of Applied Mathematics & Statistics, Stony Brook University, Stony Brook, NY 11794, USA
| | - Yi-Fei Huang
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| | - Aditya Radhakrishnan
- Department of Molecular Biology and Genetics, Johns Hopkins University, Baltimore, MD 21205, USA
| | - Adam Siepel
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| | - Gholson J Lyon
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| | - Michael C Schatz
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA; Departments of Computer Science and Biology, Johns Hopkins University, Baltimore, MD 21211, USA.
| |
Collapse
|
178
|
Liu R, Dickerson J. Strawberry: Fast and accurate genome-guided transcript reconstruction and quantification from RNA-Seq. PLoS Comput Biol 2017; 13:e1005851. [PMID: 29176847 PMCID: PMC5720828 DOI: 10.1371/journal.pcbi.1005851] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2017] [Revised: 12/07/2017] [Accepted: 10/26/2017] [Indexed: 12/14/2022] Open
Abstract
We propose a novel method and software tool, Strawberry, for transcript reconstruction and quantification from RNA-Seq data under the guidance of genome alignment and independent of gene annotation. Strawberry consists of two modules: assembly and quantification. The novelty of Strawberry is that the two modules use different optimization frameworks but utilize the same data graph structure, which allows a highly efficient, expandable and accurate algorithm for dealing large data. The assembly module parses aligned reads into splicing graphs, and uses network flow algorithms to select the most likely transcripts. The quantification module uses a latent class model to assign read counts from the nodes of splicing graphs to transcripts. Strawberry simultaneously estimates the transcript abundances and corrects for sequencing bias through an EM algorithm. Based on simulations, Strawberry outperforms Cufflinks and StringTie in terms of both assembly and quantification accuracies. Under the evaluation of a real data set, the estimated transcript expression by Strawberry has the highest correlation with Nanostring probe counts, an independent experiment measure for transcript expression. Availability: Strawberry is written in C++14, and is available as open source software at https://github.com/ruolin/strawberry under the MIT license. Transcript assembly and quantification are important bioinformatics applications of RNA-Seq. The difficulty of solving these problem arises from the ambiguity of reads assignment to isoforms uniquely. This challenge is twofold: statistically, it requires a high-dimensional mixture model, and computationally, it needs to process datasets that commonly consist of tens of millions of reads. Existing algorithms either use very complex models that are too slow or assume no models, rather heuristic, and thus less accurate. Strawberry seeks to achieve a great balance between the model complexity and speed. Strawberry effectively leverages a graph-based algorithm to utilize all possible information from pair-end reads and, to our knowledge, is the first to apply a flow network algorithm on the constrained assembly problem. We are also the first to formulate the quantification problem in a latent class model. All of these features not only lead to a more flexible and complex quantification model but also yield software that is easier to maintain and extend. In this method paper, we have shown that the Strawberry method is novel, accurate, fast and scalable using both simulated data and real data.
Collapse
Affiliation(s)
- Ruolin Liu
- Department of Electrical and Computational Engineering, Iowa State University, Ames, Iowa, United States of America
| | - Julie Dickerson
- Department of Electrical and Computational Engineering, Iowa State University, Ames, Iowa, United States of America
- * E-mail:
| |
Collapse
|
179
|
Systematic Identification and Molecular Characteristics of Long Noncoding RNAs in Pig Tissues. BIOMED RESEARCH INTERNATIONAL 2017; 2017:6152582. [PMID: 29062838 PMCID: PMC5618743 DOI: 10.1155/2017/6152582] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/19/2017] [Revised: 07/26/2017] [Accepted: 08/08/2017] [Indexed: 12/15/2022]
Abstract
Long noncoding RNAs (lncRNAs) are non-protein-coding RNAs that are involved in a variety of biological processes. The pig is an important farm animal and an ideal biomedical model. In this study, we performed a genome-wide scan for lncRNAs in multiple tissue types from pigs. A total of 118 million paired-end 90 nt clean reads were obtained via strand-specific RNA sequencing, 80.4% of which were aligned to the pig reference genome. We developed a stringent bioinformatics pipeline to identify 2,139 high-quality multiexonic lncRNAs. The characteristic analysis revealed that the novel lncRNAs showed relatively shorter transcript length, fewer exons, and lower expression levels in comparison with protein-coding genes (PCGs). The guanine-cytosine (GC) content of the protein-coding exons and introns was significantly higher than that of the lncRNAs. Moreover, the single nucleotide polymorphism (SNP) density of lncRNAs was significantly higher than that of PCGs. Conservation analysis revealed that most lncRNAs were evolutionarily conserved among pigs, humans, and mice, such as CUFF.253988.1, which shares homology with human long noncoding RNA MALAT1. The findings of our study significantly increase the number of known lncRNAs in pigs.
Collapse
|
180
|
Ziegenhain C, Vieth B, Parekh S, Reinius B, Guillaumet-Adkins A, Smets M, Leonhardt H, Heyn H, Hellmann I, Enard W. Comparative Analysis of Single-Cell RNA Sequencing Methods. Mol Cell 2017; 65:631-643.e4. [PMID: 28212749 DOI: 10.1016/j.molcel.2017.01.023] [Citation(s) in RCA: 949] [Impact Index Per Article: 118.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2016] [Revised: 12/01/2016] [Accepted: 01/17/2017] [Indexed: 02/06/2023]
Abstract
Single-cell RNA sequencing (scRNA-seq) offers new possibilities to address biological and medical questions. However, systematic comparisons of the performance of diverse scRNA-seq protocols are lacking. We generated data from 583 mouse embryonic stem cells to evaluate six prominent scRNA-seq methods: CEL-seq2, Drop-seq, MARS-seq, SCRB-seq, Smart-seq, and Smart-seq2. While Smart-seq2 detected the most genes per cell and across cells, CEL-seq2, Drop-seq, MARS-seq, and SCRB-seq quantified mRNA levels with less amplification noise due to the use of unique molecular identifiers (UMIs). Power simulations at different sequencing depths showed that Drop-seq is more cost-efficient for transcriptome quantification of large numbers of cells, while MARS-seq, SCRB-seq, and Smart-seq2 are more efficient when analyzing fewer cells. Our quantitative comparison offers the basis for an informed choice among six prominent scRNA-seq methods, and it provides a framework for benchmarking further improvements of scRNA-seq protocols.
Collapse
Affiliation(s)
- Christoph Ziegenhain
- Anthropology & Human Genomics, Department of Biology II, Ludwig-Maximilians University, Großhaderner Straße 2, 82152 Martinsried, Germany
| | - Beate Vieth
- Anthropology & Human Genomics, Department of Biology II, Ludwig-Maximilians University, Großhaderner Straße 2, 82152 Martinsried, Germany
| | - Swati Parekh
- Anthropology & Human Genomics, Department of Biology II, Ludwig-Maximilians University, Großhaderner Straße 2, 82152 Martinsried, Germany
| | - Björn Reinius
- Ludwig Institute for Cancer Research, Box 240, 171 77 Stockholm, Sweden; Department of Cell and Molecular Biology, Karolinska Institutet, 171 77 Stockholm, Sweden
| | - Amy Guillaumet-Adkins
- CNAG-CRG, Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), 08028 Barcelona, Spain; Universitat Pompeu Fabra (UPF), 08002 Barcelona, Spain
| | - Martha Smets
- Department of Biology II and Center for Integrated Protein Science Munich (CIPSM), Ludwig-Maximilians University, Großhaderner Straße 2, 82152 Martinsried, Germany
| | - Heinrich Leonhardt
- Department of Biology II and Center for Integrated Protein Science Munich (CIPSM), Ludwig-Maximilians University, Großhaderner Straße 2, 82152 Martinsried, Germany
| | - Holger Heyn
- CNAG-CRG, Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), 08028 Barcelona, Spain; Universitat Pompeu Fabra (UPF), 08002 Barcelona, Spain
| | - Ines Hellmann
- Anthropology & Human Genomics, Department of Biology II, Ludwig-Maximilians University, Großhaderner Straße 2, 82152 Martinsried, Germany
| | - Wolfgang Enard
- Anthropology & Human Genomics, Department of Biology II, Ludwig-Maximilians University, Großhaderner Straße 2, 82152 Martinsried, Germany.
| |
Collapse
|
181
|
Zhang C, Zhang B, Lin LL, Zhao S. Evaluation and comparison of computational tools for RNA-seq isoform quantification. BMC Genomics 2017; 18:583. [PMID: 28784092 PMCID: PMC5547501 DOI: 10.1186/s12864-017-4002-1] [Citation(s) in RCA: 113] [Impact Index Per Article: 14.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2017] [Accepted: 08/01/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Alternatively spliced transcript isoforms are commonly observed in higher eukaryotes. The expression levels of these isoforms are key for understanding normal functions in healthy tissues and the progression of disease states. However, accurate quantification of expression at the transcript level is limited with current RNA-seq technologies because of, for example, limited read length and the cost of deep sequencing. RESULTS A large number of tools have been developed to tackle this problem, and we performed a comprehensive evaluation of these tools using both experimental and simulated RNA-seq datasets. We found that recently developed alignment-free tools are both fast and accurate. The accuracy of all methods was mainly influenced by the complexity of gene structures and caution must be taken when interpreting quantification results for short transcripts. Using TP53 gene simulation, we discovered that both sequencing depth and the relative abundance of different isoforms affect quantification accuracy CONCLUSIONS: Our comprehensive evaluation helps data analysts to make informed choice when selecting computational tools for isoform quantification.
Collapse
Affiliation(s)
- Chi Zhang
- Early Clinical Development, Pfizer Worldwide R&D, Cambridge, MA, 02139, USA
| | - Baohong Zhang
- Early Clinical Development, Pfizer Worldwide R&D, Cambridge, MA, 02139, USA
| | - Lih-Ling Lin
- Inflammation and Immunology RU, Pfizer Worldwide R&D, Cambridge, MA, 02139, USA
| | - Shanrong Zhao
- Early Clinical Development, Pfizer Worldwide R&D, Cambridge, MA, 02139, USA.
| |
Collapse
|
182
|
Zakeri M, Srivastava A, Almodaresi F, Patro R. Improved data-driven likelihood factorizations for transcript abundance estimation. Bioinformatics 2017; 33:i142-i151. [PMID: 28881996 PMCID: PMC5870700 DOI: 10.1093/bioinformatics/btx262] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
MOTIVATION Many methods for transcript-level abundance estimation reduce the computational burden associated with the iterative algorithms they use by adopting an approximate factorization of the likelihood function they optimize. This leads to considerably faster convergence of the optimization procedure, since each round of e.g. the EM algorithm, can execute much more quickly. However, these approximate factorizations of the likelihood function simplify calculations at the expense of discarding certain information that can be useful for accurate transcript abundance estimation. RESULTS We demonstrate that model simplifications (i.e. factorizations of the likelihood function) adopted by certain abundance estimation methods can lead to a diminished ability to accurately estimate the abundances of highly related transcripts. In particular, considering factorizations based on transcript-fragment compatibility alone can result in a loss of accuracy compared to the per-fragment, unsimplified model. However, we show that such shortcomings are not an inherent limitation of approximately factorizing the underlying likelihood function. By considering the appropriate conditional fragment probabilities, and adopting improved, data-driven factorizations of this likelihood, we demonstrate that such approaches can achieve accuracy nearly indistinguishable from methods that consider the complete (i.e. per-fragment) likelihood, while retaining the computational efficiently of the compatibility-based factorizations. AVAILABILITY AND IMPLEMENTATION Our data-driven factorizations are incorporated into a branch of the Salmon transcript quantification tool: https://github.com/COMBINE-lab/salmon/tree/factorizations . CONTACT rob.patro@cs.stonybrook.edu. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mohsen Zakeri
- Department of Computer Science, Stony Brook University, Stony Brook, NY, USA
| | - Avi Srivastava
- Department of Computer Science, Stony Brook University, Stony Brook, NY, USA
| | - Fatemeh Almodaresi
- Department of Computer Science, Stony Brook University, Stony Brook, NY, USA
| | - Rob Patro
- Department of Computer Science, Stony Brook University, Stony Brook, NY, USA
| |
Collapse
|
183
|
Pimentel H, Bray NL, Puente S, Melsted P, Pachter L. Differential analysis of RNA-seq incorporating quantification uncertainty. Nat Methods 2017; 14:687-690. [PMID: 28581496 DOI: 10.1101/058164] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2017] [Accepted: 05/04/2017] [Indexed: 05/22/2023]
Abstract
We describe sleuth (http://pachterlab.github.io/sleuth), a method for the differential analysis of gene expression data that utilizes bootstrapping in conjunction with response error linear modeling to decouple biological variance from inferential variance. sleuth is implemented in an interactive shiny app that utilizes kallisto quantifications and bootstraps for fast and accurate analysis of data from RNA-seq experiments.
Collapse
Affiliation(s)
- Harold Pimentel
- Department of Computer Science, University of California, Berkeley, Berkeley, California, USA
| | - Nicolas L Bray
- Innovative Genomics Institute and Department of Molecular &Cell Biology, University of California, Berkeley, Berkeley, California, USA
| | - Suzette Puente
- Department of Statistics, University of California, Berkeley, Berkeley, California, USA
| | - Páll Melsted
- Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjavík, Iceland
| | - Lior Pachter
- Division of Biology and Biological Engineering, Caltech, Pasadena, California, USA
| |
Collapse
|
184
|
Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods 2017; 14:417-419. [PMID: 28263959 PMCID: PMC5600148 DOI: 10.1038/nmeth.4197] [Citation(s) in RCA: 6965] [Impact Index Per Article: 870.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2016] [Accepted: 01/22/2017] [Indexed: 12/12/2022]
Abstract
We introduce Salmon, a lightweight method for quantifying transcript abundance from RNA-seq reads. Salmon combines a new dual-phase parallel inference algorithm and feature-rich bias models with an ultra-fast read mapping procedure. It is the first transcriptome-wide quantifier to correct for fragment GC-content bias, which, as we demonstrate here, substantially improves the accuracy of abundance estimates and the sensitivity of subsequent differential expression analysis.
Collapse
Affiliation(s)
- Rob Patro
- Department of Computer Science, Stony Brook University, Stony Brook, New York, USA
| | | | - Michael I Love
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Cambridge, Massachusetts, USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Cambridge, Massachusetts, USA
| | - Rafael A Irizarry
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Cambridge, Massachusetts, USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Cambridge, Massachusetts, USA
| | - Carl Kingsford
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| |
Collapse
|
185
|
Abstract
We introduce Salmon, a lightweight method for quantifying transcript abundance from RNA-seq reads. Salmon combines a new dual-phase parallel inference algorithm and feature-rich bias models with an ultra-fast read mapping procedure. It is the first transcriptome-wide quantifier to correct for fragment GC-content bias, which, as we demonstrate here, substantially improves the accuracy of abundance estimates and the sensitivity of subsequent differential expression analysis.
Collapse
|
186
|
Collado-Torres L, Nellore A, Frazee AC, Wilks C, Love MI, Langmead B, Irizarry RA, Leek JT, Jaffe AE. Flexible expressed region analysis for RNA-seq with derfinder. Nucleic Acids Res 2017; 45:e9. [PMID: 27694310 PMCID: PMC5314792 DOI: 10.1093/nar/gkw852] [Citation(s) in RCA: 42] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2016] [Revised: 08/25/2016] [Accepted: 09/15/2016] [Indexed: 12/20/2022] Open
Abstract
Differential expression analysis of RNA sequencing (RNA-seq) data typically relies on reconstructing transcripts or counting reads that overlap known gene structures. We previously introduced an intermediate statistical approach called differentially expressed region (DER) finder that seeks to identify contiguous regions of the genome showing differential expression signal at single base resolution without relying on existing annotation or potentially inaccurate transcript assembly.We present the derfinder software that improves our annotation-agnostic approach to RNA-seq analysis by: (i) implementing a computationally efficient bump-hunting approach to identify DERs that permits genome-scale analyses in a large number of samples, (ii) introducing a flexible statistical modeling framework, including multi-group and time-course analyses and (iii) introducing a new set of data visualizations for expressed region analysis. We apply this approach to public RNA-seq data from the Genotype-Tissue Expression (GTEx) project and BrainSpan project to show that derfinder permits the analysis of hundreds of samples at base resolution in R, identifies expression outside of known gene boundaries and can be used to visualize expressed regions at base-resolution. In simulations, our base resolution approaches enable discovery in the presence of incomplete annotation and is nearly as powerful as feature-level methods when the annotation is complete.derfinder analysis using expressed region-level and single base-level approaches provides a compromise between full transcript reconstruction and feature-level analysis. The package is available from Bioconductor at www.bioconductor.org/packages/derfinder.
Collapse
Affiliation(s)
- Leonardo Collado-Torres
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21205, USA
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD 21205, USA
| | - Abhinav Nellore
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21205, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Alyssa C Frazee
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21205, USA
| | - Christopher Wilks
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21205, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Michael I Love
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
- Dana-Farber Cancer Institute, Harvard University, Boston, MA 02215, USA
| | - Ben Langmead
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21205, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Rafael A Irizarry
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
- Dana-Farber Cancer Institute, Harvard University, Boston, MA 02215, USA
| | - Jeffrey T Leek
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21205, USA
| | - Andrew E Jaffe
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21205, USA
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD 21205, USA
- Department of Mental Health, Johns Hopkins University, Baltimore, MD 21205, USA
| |
Collapse
|
187
|
Love MI, Hogenesch JB, Irizarry RA. Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation. Nat Biotechnol 2016. [PMID: 27669167 DOI: 10.1101/025767] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/02/2023]
Abstract
We find that current computational methods for estimating transcript abundance from RNA-seq data can lead to hundreds of false-positive results. We show that these systematic errors stem largely from a failure to model fragment GC content bias. Sample-specific biases associated with fragment sequence features lead to misidentification of transcript isoforms. We introduce alpine, a method for estimating sample-specific bias-corrected transcript abundance. By incorporating fragment sequence features, alpine greatly increases the accuracy of transcript abundance estimates, enabling a fourfold reduction in the number of false positives for reported changes in expression compared with Cufflinks. Using simulated data, we also show that alpine retains the ability to discover true positives, similar to other approaches. The method is available as an R/Bioconductor package that includes data visualization tools useful for bias discovery.
Collapse
Affiliation(s)
- Michael I Love
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, Massachusetts, USA
- Department of Biostatistics, Harvard TH Chan School of Public Health, Boston, Massachusetts, USA
| | - John B Hogenesch
- Department of Pharmacology, Institute for Translational Medicine and Therapeutics, University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania, USA
| | - Rafael A Irizarry
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, Massachusetts, USA
- Department of Biostatistics, Harvard TH Chan School of Public Health, Boston, Massachusetts, USA
| |
Collapse
|
188
|
Alignment-free Transcriptomic and Metatranscriptomic Comparison Using Sequencing Signatures with Variable Length Markov Chains. Sci Rep 2016; 6:37243. [PMID: 27876823 PMCID: PMC5120338 DOI: 10.1038/srep37243] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2016] [Accepted: 10/27/2016] [Indexed: 11/08/2022] Open
Abstract
The comparison between microbial sequencing data is critical to understand the dynamics of microbial communities. The alignment-based tools analyzing metagenomic datasets require reference sequences and read alignments. The available alignment-free dissimilarity approaches model the background sequences with Fixed Order Markov Chain (FOMC) yielding promising results for the comparison of microbial communities. However, in FOMC, the number of parameters grows exponentially with the increase of the order of Markov Chain (MC). Under a fixed high order of MC, the parameters might not be accurately estimated owing to the limitation of sequencing depth. In our study, we investigate an alternative to FOMC to model background sequences with the data-driven Variable Length Markov Chain (VLMC) in metatranscriptomic data. The VLMC originally designed for long sequences was extended to apply to high-throughput sequencing reads and the strategies to estimate the corresponding parameters were developed. The flexible number of parameters in VLMC avoids estimating the vast number of parameters of high-order MC under limited sequencing depth. Different from the manual selection in FOMC, VLMC determines the MC order adaptively. Several beta diversity measures based on VLMC were applied to compare the bacterial RNA-Seq and metatranscriptomic datasets. Experiments show that VLMC outperforms FOMC to model the background sequences in transcriptomic and metatranscriptomic samples. A software pipeline is available at https://d2vlmc.codeplex.com.
Collapse
|
189
|
Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation. Nat Biotechnol 2016; 34:1287-1291. [PMID: 27669167 PMCID: PMC5143225 DOI: 10.1038/nbt.3682] [Citation(s) in RCA: 114] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2015] [Accepted: 08/22/2016] [Indexed: 11/17/2022]
|
190
|
Sun X, Dalpiaz D, Wu D, S Liu J, Zhong W, Ma P. Statistical inference for time course RNA-Seq data using a negative binomial mixed-effect model. BMC Bioinformatics 2016; 17:324. [PMID: 27565575 PMCID: PMC5002174 DOI: 10.1186/s12859-016-1180-9] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2016] [Accepted: 08/11/2016] [Indexed: 02/05/2023] Open
Abstract
Background Accurate identification of differentially expressed (DE) genes in time course RNA-Seq data is crucial for understanding the dynamics of transcriptional regulatory network. However, most of the available methods treat gene expressions at different time points as replicates and test the significance of the mean expression difference between treatments or conditions irrespective of time. They thus fail to identify many DE genes with different profiles across time. In this article, we propose a negative binomial mixed-effect model (NBMM) to identify DE genes in time course RNA-Seq data. In the NBMM, mean gene expression is characterized by a fixed effect, and time dependency is described by random effects. The NBMM is very flexible and can be fitted to both unreplicated and replicated time course RNA-Seq data via a penalized likelihood method. By comparing gene expression profiles over time, we further classify the DE genes into two subtypes to enhance the understanding of expression dynamics. A significance test for detecting DE genes is derived using a Kullback-Leibler distance ratio. Additionally, a significance test for gene sets is developed using a gene set score. Results Simulation analysis shows that the NBMM outperforms currently available methods for detecting DE genes and gene sets. Moreover, our real data analysis of fruit fly developmental time course RNA-Seq data demonstrates the NBMM identifies biologically relevant genes which are well justified by gene ontology analysis. Conclusions The proposed method is powerful and efficient to detect biologically relevant DE genes and gene sets in time course RNA-Seq data. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1180-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Xiaoxiao Sun
- Department of Statistics, University of Georgia, 101 Cedar Street, Athens, 30602, USA
| | - David Dalpiaz
- Department of Statistics, University of Illinois at Urbana-Champaign, 725 South Wright Street, Champaign, 61820, USA
| | - Di Wu
- Department of Statistics, Harvard University, One Oxford Street, Cambridge, 02138, USA
| | - Jun S Liu
- Department of Statistics, Harvard University, One Oxford Street, Cambridge, 02138, USA
| | - Wenxuan Zhong
- Department of Statistics, University of Georgia, 101 Cedar Street, Athens, 30602, USA
| | - Ping Ma
- Department of Statistics, University of Georgia, 101 Cedar Street, Athens, 30602, USA.
| |
Collapse
|
191
|
Ziemann M, Kaspi A, El-Osta A. Evaluation of microRNA alignment techniques. RNA (NEW YORK, N.Y.) 2016; 22:1120-38. [PMID: 27284164 PMCID: PMC4931105 DOI: 10.1261/rna.055509.115] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/01/2015] [Accepted: 05/04/2016] [Indexed: 05/26/2023]
Abstract
Genomic alignment of small RNA (smRNA) sequences such as microRNAs poses considerable challenges due to their short length (∼21 nucleotides [nt]) as well as the large size and complexity of plant and animal genomes. While several tools have been developed for high-throughput mapping of longer mRNA-seq reads (>30 nt), there are few that are specifically designed for mapping of smRNA reads including microRNAs. The accuracy of these mappers has not been systematically determined in the case of smRNA-seq. In addition, it is unknown whether these aligners accurately map smRNA reads containing sequence errors and polymorphisms. By using simulated read sets, we determine the alignment sensitivity and accuracy of 16 short-read mappers and quantify their robustness to mismatches, indels, and nontemplated nucleotide additions. These were explored in the context of a plant genome (Oryza sativa, ∼500 Mbp) and a mammalian genome (Homo sapiens, ∼3.1 Gbp). Analysis of simulated and real smRNA-seq data demonstrates that mapper selection impacts differential expression results and interpretation. These results will inform on best practice for smRNA mapping and enable more accurate smRNA detection and quantification of expression and RNA editing.
Collapse
Affiliation(s)
- Mark Ziemann
- Epigenetics in Human Health and Disease Laboratory, Baker IDI Heart and Diabetes Institute, The Alfred Medical Research and Education Precinct, Melbourne, Victoria 3004, AustraliaEpigenomics Profiling Facility, Baker IDI Heart and Diabetes Institute, The Alfred Medical Research and Education Precinct, Melbourne, Victoria 3004, Australia
| | - Antony Kaspi
- Epigenetics in Human Health and Disease Laboratory, Baker IDI Heart and Diabetes Institute, The Alfred Medical Research and Education Precinct, Melbourne, Victoria 3004, AustraliaEpigenomics Profiling Facility, Baker IDI Heart and Diabetes Institute, The Alfred Medical Research and Education Precinct, Melbourne, Victoria 3004, Australia
| | - Assam El-Osta
- Epigenetics in Human Health and Disease Laboratory, Baker IDI Heart and Diabetes Institute, The Alfred Medical Research and Education Precinct, Melbourne, Victoria 3004, AustraliaEpigenomics Profiling Facility, Baker IDI Heart and Diabetes Institute, The Alfred Medical Research and Education Precinct, Melbourne, Victoria 3004, Australia
| |
Collapse
|
192
|
MetaTrans: an open-source pipeline for metatranscriptomics. Sci Rep 2016; 6:26447. [PMID: 27211518 PMCID: PMC4876386 DOI: 10.1038/srep26447] [Citation(s) in RCA: 54] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2015] [Accepted: 04/29/2016] [Indexed: 01/08/2023] Open
Abstract
To date, meta-omic approaches use high-throughput sequencing technologies, which produce a huge amount of data, thus challenging modern computers. Here we present MetaTrans, an efficient open-source pipeline to analyze the structure and functions of active microbial communities using the power of multi-threading computers. The pipeline is designed to perform two types of RNA-Seq analyses: taxonomic and gene expression. It performs quality-control assessment, rRNA removal, maps reads against functional databases and also handles differential gene expression analysis. Its efficacy was validated by analyzing data from synthetic mock communities, data from a previous study and data generated from twelve human fecal samples. Compared to an existing web application server, MetaTrans shows more efficiency in terms of runtime (around 2 hours per million of transcripts) and presents adapted tools to compare gene expression levels. It has been tested with a human gut microbiome database but also proposes an option to use a general database in order to analyze other ecosystems. For the installation and use of the pipeline, we provide a detailed guide at the following website (www.metatrans.org).
Collapse
|
193
|
Germain PL, Vitriolo A, Adamo A, Laise P, Das V, Testa G. RNAontheBENCH: computational and empirical resources for benchmarking RNAseq quantification and differential expression methods. Nucleic Acids Res 2016; 44:5054-67. [PMID: 27190234 PMCID: PMC4914128 DOI: 10.1093/nar/gkw448] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2016] [Accepted: 05/09/2016] [Indexed: 11/13/2022] Open
Abstract
RNA sequencing (RNAseq) has become the method of choice for transcriptome analysis, yet no consensus exists as to the most appropriate pipeline for its analysis, with current benchmarks suffering important limitations. Here, we address these challenges through a rich benchmarking resource harnessing (i) two RNAseq datasets including ERCC ExFold spike-ins; (ii) Nanostring measurements of a panel of 150 genes on the same samples; (iii) a set of internal, genetically-determined controls; (iv) a reanalysis of the SEQC dataset; and (v) a focus on relative quantification (i.e. across-samples). We use this resource to compare different approaches to each step of RNAseq analysis, from alignment to differential expression testing. We show that methods providing the best absolute quantification do not necessarily provide good relative quantification across samples, that count-based methods are superior for gene-level relative quantification, and that the new generation of pseudo-alignment-based software performs as well as established methods, at a fraction of the computing time. We also assess the impact of library type and size on quantification and differential expression analysis. Finally, we have created a R package and a web platform to enable the simple and streamlined application of this resource to the benchmarking of future methods.
Collapse
Affiliation(s)
- Pierre-Luc Germain
- European Institute of Oncology, Department of Experimental Oncology, Via Adamello 16, 20139 Milano, Italy
| | - Alessandro Vitriolo
- European Institute of Oncology, Department of Experimental Oncology, Via Adamello 16, 20139 Milano, Italy University of Milan, Department of Oncology and Hemato-Oncology, Via Festa del Perdono 7, 20122 Milano, Italy
| | - Antonio Adamo
- European Institute of Oncology, Department of Experimental Oncology, Via Adamello 16, 20139 Milano, Italy
| | - Pasquale Laise
- European Institute of Oncology, Department of Experimental Oncology, Via Adamello 16, 20139 Milano, Italy
| | - Vivek Das
- European Institute of Oncology, Department of Experimental Oncology, Via Adamello 16, 20139 Milano, Italy University of Milan, Department of Oncology and Hemato-Oncology, Via Festa del Perdono 7, 20122 Milano, Italy
| | - Giuseppe Testa
- European Institute of Oncology, Department of Experimental Oncology, Via Adamello 16, 20139 Milano, Italy University of Milan, Department of Oncology and Hemato-Oncology, Via Festa del Perdono 7, 20122 Milano, Italy
| |
Collapse
|
194
|
Teng M, Love MI, Davis CA, Djebali S, Dobin A, Graveley BR, Li S, Mason CE, Olson S, Pervouchine D, Sloan CA, Wei X, Zhan L, Irizarry RA. A benchmark for RNA-seq quantification pipelines. Genome Biol 2016; 17:74. [PMID: 27107712 PMCID: PMC4842274 DOI: 10.1186/s13059-016-0940-1] [Citation(s) in RCA: 127] [Impact Index Per Article: 14.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2015] [Accepted: 04/08/2016] [Indexed: 02/07/2023] Open
Abstract
Obtaining RNA-seq measurements involves a complex data analytical process with a large number of competing algorithms as options. There is much debate about which of these methods provides the best approach. Unfortunately, it is currently difficult to evaluate their performance due in part to a lack of sensitive assessment metrics. We present a series of statistical summaries and plots to evaluate the performance in terms of specificity and sensitivity, available as a R/Bioconductor package (http://bioconductor.org/packages/rnaseqcomp). Using two independent datasets, we assessed seven competing pipelines. Performance was generally poor, with two methods clearly underperforming and RSEM slightly outperforming the rest.
Collapse
Affiliation(s)
- Mingxiang Teng
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, 450 Brookline Avenue, Boston, MA, 02215, USA.,Department of Biostatistics, Harvard TH Chan School of Public Health, 677 Huntington Avenue, Boston, MA, 02115, USA.,School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Michael I Love
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, 450 Brookline Avenue, Boston, MA, 02215, USA.,Department of Biostatistics, Harvard TH Chan School of Public Health, 677 Huntington Avenue, Boston, MA, 02115, USA
| | - Carrie A Davis
- Functional Genomics Group, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, 11724, USA
| | - Sarah Djebali
- Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG) and UPF, Doctor Aiguader, 88, Barcelona, 08003, Spain
| | - Alexander Dobin
- Functional Genomics Group, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, 11724, USA
| | - Brenton R Graveley
- Department of Genetics and Genome Sciences, Institute for System Genomics, UConn Health Center, Farmington, CT, 06030, USA
| | - Sheng Li
- Department of Physiology and Biophysics, Weill Cornell Medical College, New York, New York, USA
| | - Christopher E Mason
- Department of Physiology and Biophysics, Weill Cornell Medical College, New York, New York, USA
| | - Sara Olson
- Department of Genetics and Genome Sciences, Institute for System Genomics, UConn Health Center, Farmington, CT, 06030, USA
| | - Dmitri Pervouchine
- Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG) and UPF, Doctor Aiguader, 88, Barcelona, 08003, Spain
| | - Cricket A Sloan
- Department of Genetics, Stanford University, 300 Pasteur Drive, MC-5477, Stanford, CA, 94305, USA
| | - Xintao Wei
- Department of Genetics and Genome Sciences, Institute for System Genomics, UConn Health Center, Farmington, CT, 06030, USA
| | - Lijun Zhan
- Department of Genetics and Genome Sciences, Institute for System Genomics, UConn Health Center, Farmington, CT, 06030, USA
| | - Rafael A Irizarry
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, 450 Brookline Avenue, Boston, MA, 02215, USA. .,Department of Biostatistics, Harvard TH Chan School of Public Health, 677 Huntington Avenue, Boston, MA, 02115, USA.
| |
Collapse
|
195
|
Hirsch CD, Springer NM, Hirsch CN. Genomic limitations to RNA sequencing expression profiling. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2015; 84:491-503. [PMID: 26331235 DOI: 10.1111/tpj.13014] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/28/2015] [Accepted: 08/25/2015] [Indexed: 05/24/2023]
Abstract
The field of genomics has grown rapidly with the advent of massively parallel sequencing technologies, allowing for novel biological insights with regards to genomic, transcriptomic, and epigenomic variation. One widely utilized application of high-throughput sequencing is transcriptional profiling using RNA sequencing (RNAseq). Understanding the limitations of a technology is critical for accurate biological interpretations, and clear interpretation of RNAseq data can be difficult in species with complex genomes. To understand the limitations of accurate profiling of expression levels we simulated RNAseq reads from annotated gene models in several plant species including Arabidopsis, brachypodium, maize, potato, rice, soybean, and tomato. The simulated reads were aligned using various parameters such as unique versus multiple read alignments. This allowed the identification of genes recalcitrant to RNAseq analyses by having over- and/or under-estimated expression levels. In maize, over 25% of genes deviated by more than 20% from the expected count values, suggesting the need for cautious interpretation of RNAseq data for certain genes. The reasons identified for deviation from expected expression varied between species due to differences in genome structure including, but not limited to, genes encoding short transcripts, overlapping gene models, and gene family size. Utilizing existing empirical datasets we demonstrate the potential for biological misinterpretation resulting from inclusion of 'flagged genes' in analyses. While RNAseq is a powerful tool for understanding biology, there are limitations to this technology that need to be understood in order to improve our biological interpretations.
Collapse
Affiliation(s)
- Cory D Hirsch
- Department of Plant Biology, University of Minnesota, St Paul, MN, 55108, USA
| | - Nathan M Springer
- Department of Plant Biology, University of Minnesota, St Paul, MN, 55108, USA
| | - Candice N Hirsch
- Department of Agronomy and Plant Genetics, University of Minnesota, St Paul, MN, 55108, USA
| |
Collapse
|