101
|
Shi X, Neuwald AF, Wang X, Wang TL, Hilakivi-Clarke L, Clarke R, Xuan J. IntAPT: integrated assembly of phenotype-specific transcripts from multiple RNA-seq profiles. Bioinformatics 2021; 37:650-658. [PMID: 33016988 PMCID: PMC8097681 DOI: 10.1093/bioinformatics/btaa852] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2019] [Revised: 08/27/2020] [Accepted: 09/21/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION High-throughput RNA sequencing has revolutionized the scope and depth of transcriptome analysis. Accurate reconstruction of a phenotype-specific transcriptome is challenging due to the noise and variability of RNA-seq data. This requires computational identification of transcripts from multiple samples of the same phenotype, given the underlying consensus transcript structure. RESULTS We present a Bayesian method, integrated assembly of phenotype-specific transcripts (IntAPT), that identifies phenotype-specific isoforms from multiple RNA-seq profiles. IntAPT features a novel two-layer Bayesian model to capture the presence of isoforms at the group layer and to quantify the abundance of isoforms at the sample layer. A spike-and-slab prior is used to model the isoform expression and to enforce the sparsity of expressed isoforms. Dependencies between the existence of isoforms and their expression are modeled explicitly to facilitate parameter estimation. Model parameters are estimated iteratively using Gibbs sampling to infer the joint posterior distribution, from which the presence and abundance of isoforms can reliably be determined. Studies using both simulations and real datasets show that IntAPT consistently outperforms existing methods for the IntAPT. Experimental results demonstrate that, despite sequencing errors, IntAPT exhibits a robust performance among multiple samples, resulting in notably improved identification of expressed isoforms of low abundance. AVAILABILITY AND IMPLEMENTATION The IntAPT package is available at http://github.com/henryxushi/IntAPT. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xu Shi
- Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, USA.,Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA
| | - Andrew F Neuwald
- Institute for Genome Sciences and Department of Biochemistry & Molecular Biology, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Xiao Wang
- Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, USA
| | - Tian-Li Wang
- Department of Pathology, Johns Hopkins Medical Institutions, Baltimore, MD 21231, USA
| | | | - Robert Clarke
- Hormel Institute, University of Minnesota, 801 16th Ave NE, Austin, MN 55912, USA
| | - Jianhua Xuan
- Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA 22203, USA
| |
Collapse
|
102
|
Stupnikov A, McInerney CE, Savage KI, McIntosh SA, Emmert-Streib F, Kennedy R, Salto-Tellez M, Prise KM, McArt DG. Robustness of differential gene expression analysis of RNA-seq. Comput Struct Biotechnol J 2021; 19:3470-3481. [PMID: 34188784 PMCID: PMC8214188 DOI: 10.1016/j.csbj.2021.05.040] [Citation(s) in RCA: 40] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2020] [Revised: 05/25/2021] [Accepted: 05/25/2021] [Indexed: 01/05/2023] Open
Abstract
RNA-sequencing (RNA-seq) is a relatively new technology that lacks standardisation. RNA-seq can be used for Differential Gene Expression (DGE) analysis, however, no consensus exists as to which methodology ensures robust and reproducible results. Indeed, it is broadly acknowledged that DGE methods provide disparate results. Despite obstacles, RNA-seq assays are in advanced development for clinical use but further optimisation will be needed. Herein, five DGE models (DESeq2, voom + limma, edgeR, EBSeq, NOISeq) for gene-level detection were investigated for robustness to sequencing alterations using a controlled analysis of fixed count matrices. Two breast cancer datasets were analysed with full and reduced sample sizes. DGE model robustness was compared between filtering regimes and for different expression levels (high, low) using unbiased metrics. Test sensitivity estimated as relative False Discovery Rate (FDR), concordance between model outputs and comparisons of a ’population’ of slopes of relative FDRs across different library sizes, generated using linear regressions, were examined. Patterns of relative DGE model robustness proved dataset-agnostic and reliable for drawing conclusions when sample sizes were sufficiently large. Overall, the non-parametric method NOISeq was the most robust followed by edgeR, voom, EBSeq and DESeq2. Our rigorous appraisal provides information for method selection for molecular diagnostics. Metrics may prove useful towards improving the standardisation of RNA-seq for precision medicine.
Collapse
Affiliation(s)
- A Stupnikov
- Department of Biological and Medical Physics, Moscow Institute of Physics and Technology, Dolgoprudny, Russian Federation.,Patrick G. Johnson Centre for Cancer Research, Queen's University, Belfast, Northern Ireland, UK
| | - C E McInerney
- Patrick G. Johnson Centre for Cancer Research, Queen's University, Belfast, Northern Ireland, UK
| | - K I Savage
- Patrick G. Johnson Centre for Cancer Research, Queen's University, Belfast, Northern Ireland, UK
| | - S A McIntosh
- Patrick G. Johnson Centre for Cancer Research, Queen's University, Belfast, Northern Ireland, UK
| | - F Emmert-Streib
- Predictive Society and Data Analytics Lab, Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland
| | - R Kennedy
- Patrick G. Johnson Centre for Cancer Research, Queen's University, Belfast, Northern Ireland, UK
| | - M Salto-Tellez
- Patrick G. Johnson Centre for Cancer Research, Queen's University, Belfast, Northern Ireland, UK
| | - K M Prise
- Patrick G. Johnson Centre for Cancer Research, Queen's University, Belfast, Northern Ireland, UK
| | - D G McArt
- Patrick G. Johnson Centre for Cancer Research, Queen's University, Belfast, Northern Ireland, UK
| |
Collapse
|
103
|
Sarantopoulou D, Brooks TG, Nayak S, Mrčela A, Lahens NF, Grant GR. Comparative evaluation of full-length isoform quantification from RNA-Seq. BMC Bioinformatics 2021; 22:266. [PMID: 34034652 PMCID: PMC8145802 DOI: 10.1186/s12859-021-04198-1] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2020] [Accepted: 05/16/2021] [Indexed: 11/18/2022] Open
Abstract
Background Full-length isoform quantification from RNA-Seq is a key goal in transcriptomics analyses and has been an area of active development since the beginning. The fundamental difficulty stems from the fact that RNA transcripts are long, while RNA-Seq reads are short. Results Here we use simulated benchmarking data that reflects many properties of real data, including polymorphisms, intron signal and non-uniform coverage, allowing for systematic comparative analyses of isoform quantification accuracy and its impact on differential expression analysis. Genome, transcriptome and pseudo alignment-based methods are included; and a simple approach is included as a baseline control. Conclusions Salmon, kallisto, RSEM, and Cufflinks exhibit the highest accuracy on idealized data, while on more realistic data they do not perform dramatically better than the simple approach. We determine the structural parameters with the greatest impact on quantification accuracy to be length and sequence compression complexity and not so much the number of isoforms. The effect of incomplete annotation on performance is also investigated. Overall, the tested methods show sufficient divergence from the truth to suggest that full-length isoform quantification and isoform level DE should still be employed selectively. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04198-1.
Collapse
Affiliation(s)
- Dimitra Sarantopoulou
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA.,National Institute on Aging, National Institutes of Health, Baltimore, MD, USA
| | - Thomas G Brooks
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Soumyashant Nayak
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Antonijo Mrčela
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Nicholas F Lahens
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Gregory R Grant
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA. .,Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
104
|
Dent CI, Singh S, Mukherjee S, Mishra S, Sarwade RD, Shamaya N, Loo KP, Harrison P, Sureshkumar S, Powell D, Balasubramanian S. Quantifying splice-site usage: a simple yet powerful approach to analyze splicing. NAR Genom Bioinform 2021; 3:lqab041. [PMID: 34017946 PMCID: PMC8121094 DOI: 10.1093/nargab/lqab041] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2020] [Revised: 03/24/2021] [Accepted: 04/28/2021] [Indexed: 02/07/2023] Open
Abstract
RNA splicing, and variations in this process referred to as alternative splicing, are critical aspects of gene regulation in eukaryotes. From environmental responses in plants to being a primary link between genetic variation and disease in humans, splicing differences confer extensive phenotypic changes across diverse organisms (1–3). Regulation of splicing occurs through differential selection of splice sites in a splicing reaction, which results in variation in the abundance of isoforms and/or splicing events. However, genomic determinants that influence splice-site selection remain largely unknown. While traditional approaches for analyzing splicing rely on quantifying variant transcripts (i.e. isoforms) or splicing events (i.e. intron retention, exon skipping etc.) (4), recent approaches focus on analyzing complex/mutually exclusive splicing patterns (5–8). However, none of these approaches explicitly measure individual splice-site usage, which can provide valuable information about splice-site choice and its regulation. Here, we present a simple approach to quantify the empirical usage of individual splice sites reflecting their strength, which determines their selection in a splicing reaction. Splice-site strength/usage, as a quantitative phenotype, allows us to directly link genetic variation with usage of individual splice-sites. We demonstrate the power of this approach in defining the genomic determinants of splice-site choice through GWAS. Our pilot analysis with more than a thousand splice sites hints that sequence divergence in cis rather than trans is associated with variations in splicing among accessions of Arabidopsis thaliana. This approach allows deciphering principles of splicing and has broad implications from agriculture to medicine.
Collapse
Affiliation(s)
- Craig I Dent
- School of Biological Sciences, Monash University, VIC 3800, Australia
| | - Shilpi Singh
- School of Biological Sciences, Monash University, VIC 3800, Australia
| | | | - Shikhar Mishra
- School of Biological Sciences, Monash University, VIC 3800, Australia
| | - Rucha D Sarwade
- School of Biological Sciences, Monash University, VIC 3800, Australia
| | - Nawar Shamaya
- School of Biological Sciences, Monash University, VIC 3800, Australia
| | - Kok Ping Loo
- School of Biological Sciences, Monash University, VIC 3800, Australia
| | - Paul Harrison
- Monash Bioinformatics Platform, Monash University, VIC 3800, Australia
| | | | - David Powell
- Monash Bioinformatics Platform, Monash University, VIC 3800, Australia
| | | |
Collapse
|
105
|
Ma C, Zheng H, Kingsford C. Exact transcript quantification over splice graphs. Algorithms Mol Biol 2021; 16:5. [PMID: 33971903 PMCID: PMC8112020 DOI: 10.1186/s13015-021-00184-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2021] [Accepted: 04/19/2021] [Indexed: 11/10/2022] Open
Abstract
Background The probability of sequencing a set of RNA-seq reads can be directly modeled using the abundances of splice junctions in splice graphs instead of the abundances of a list of transcripts. We call this model graph quantification, which was first proposed by Bernard et al. (Bioinformatics 30:2447–55, 2014). The model can be viewed as a generalization of transcript expression quantification where every full path in the splice graph is a possible transcript. However, the previous graph quantification model assumes the length of single-end reads or paired-end fragments is fixed. Results We provide an improvement of this model to handle variable-length reads or fragments and incorporate bias correction. We prove that our model is equivalent to running a transcript quantifier with exactly the set of all compatible transcripts. The key to our method is constructing an extension of the splice graph based on Aho-Corasick automata. The proof of equivalence is based on a novel reparameterization of the read generation model of a state-of-art transcript quantification method. Conclusion We propose a new approach for graph quantification, which is useful for modeling scenarios where reference transcriptome is incomplete or not available and can be further used in transcriptome assembly or alternative splicing analysis. Supplementary Information The online version contains supplementary material available at 10.1186/s13015-021-00184-7.
Collapse
|
106
|
Chung M, Bruno VM, Rasko DA, Cuomo CA, Muñoz JF, Livny J, Shetty AC, Mahurkar A, Dunning Hotopp JC. Best practices on the differential expression analysis of multi-species RNA-seq. Genome Biol 2021; 22:121. [PMID: 33926528 PMCID: PMC8082843 DOI: 10.1186/s13059-021-02337-8] [Citation(s) in RCA: 52] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2020] [Accepted: 04/01/2021] [Indexed: 02/07/2023] Open
Abstract
Advances in transcriptome sequencing allow for simultaneous interrogation of differentially expressed genes from multiple species originating from a single RNA sample, termed dual or multi-species transcriptomics. Compared to single-species differential expression analysis, the design of multi-species differential expression experiments must account for the relative abundances of each organism of interest within the sample, often requiring enrichment methods and yielding differences in total read counts across samples. The analysis of multi-species transcriptomics datasets requires modifications to the alignment, quantification, and downstream analysis steps compared to the single-species analysis pipelines. We describe best practices for multi-species transcriptomics and differential gene expression.
Collapse
Affiliation(s)
- Matthew Chung
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201 USA
- Department of Microbiology and Immunology, University of Maryland School of Medicine, Baltimore, MD 21201 USA
| | - Vincent M. Bruno
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201 USA
- Department of Microbiology and Immunology, University of Maryland School of Medicine, Baltimore, MD 21201 USA
| | - David A. Rasko
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201 USA
- Department of Microbiology and Immunology, University of Maryland School of Medicine, Baltimore, MD 21201 USA
| | - Christina A. Cuomo
- Infectious Disease and Microbiome Program, Broad Institute, Cambridge, MA 02142 USA
| | - José F. Muñoz
- Infectious Disease and Microbiome Program, Broad Institute, Cambridge, MA 02142 USA
| | - Jonathan Livny
- Infectious Disease and Microbiome Program, Broad Institute, Cambridge, MA 02142 USA
| | - Amol C. Shetty
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201 USA
| | - Anup Mahurkar
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201 USA
| | - Julie C. Dunning Hotopp
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201 USA
- Department of Microbiology and Immunology, University of Maryland School of Medicine, Baltimore, MD 21201 USA
- Greenebaum Cancer Center, University of Maryland, Baltimore, MD 21201 USA
| |
Collapse
|
107
|
Melnick M, Gonzales P, LaRocca TJ, Song Y, Wuu J, Benatar M, Oskarsson B, Petrucelli L, Dowell RD, Link CD, Prudencio M. Application of a bioinformatic pipeline to RNA-seq data identifies novel viruslike sequence in human blood. G3-GENES GENOMES GENETICS 2021; 11:6259144. [PMID: 33914880 PMCID: PMC8661426 DOI: 10.1093/g3journal/jkab141] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/26/2021] [Accepted: 04/20/2021] [Indexed: 12/11/2022]
Abstract
Numerous reports have suggested that infectious agents could play a role in neurodegenerative diseases, but specific etiological agents have not been convincingly demonstrated. To search for candidate agents in an unbiased fashion, we have developed a bioinformatic pipeline that identifies microbial sequences in mammalian RNA-seq data, including sequences with no significant nucleotide similarity hits in GenBank. Effectiveness of the pipeline was tested using publicly available RNA-seq data and in a reconstruction experiment using synthetic data. We then applied this pipeline to a novel RNA-seq dataset generated from a cohort of 120 samples from amyotrophic lateral sclerosis patients and controls, and identified sequences corresponding to known bacteria and viruses, as well as novel virus-like sequences. The presence of these novel virus-like sequences, which were identified in subsets of both patients and controls, were confirmed by quantitative RT-PCR. We believe this pipeline will be a useful tool for the identification of potential etiological agents in the many RNA-seq datasets currently being generated.
Collapse
Affiliation(s)
- Marko Melnick
- Integrative Physiology, University of Colorado, Boulder, Colorado, 80303, USA
| | - Patrick Gonzales
- Integrative Physiology, University of Colorado, Boulder, Colorado, 80303, USA
| | - Thomas J LaRocca
- Department of Health and Exercise Science, Center for Healthy Aging, Colorado State University, Fort Collins, Colorado, 80523, USA
| | - Yuping Song
- Department of Neuroscience, Mayo Clinic, 4500 San Pablo Road, Jacksonville, Florida, 32224, USA
| | - Joanne Wuu
- Department of Neurology, University of Miami, Miami, Florida, 33136, USA
| | - Michael Benatar
- Department of Neurology, University of Miami, Miami, Florida, 33136, USA
| | - Björn Oskarsson
- Department of Neurology, Mayo Clinic, 4500 San Pablo Road, Jacksonville FL, 32224, USA
| | - Leonard Petrucelli
- Department of Neuroscience, Mayo Clinic, 4500 San Pablo Road, Jacksonville, Florida, 32224, USA.,Neuroscience Graduate Program, Mayo Clinic Graduate School of Biomedical Sciences, Jacksonville, Florida, 32224, USA
| | - Robin D Dowell
- BioFrontiers Institute and Department of Molecular, Cellular and Developmental Biology, University of Colorado, Boulder, Colorado, 80303, USA
| | - Christopher D Link
- Integrative Physiology, University of Colorado, Boulder, Colorado, 80303, USA.,Institute for Behavioral Genetics, University of Colorado, Boulder, Colorado, 80303, USA
| | - Mercedes Prudencio
- Department of Neuroscience, Mayo Clinic, 4500 San Pablo Road, Jacksonville, Florida, 32224, USA.,Neuroscience Graduate Program, Mayo Clinic Graduate School of Biomedical Sciences, Jacksonville, Florida, 32224, USA
| |
Collapse
|
108
|
Wolf SA, Epping L, Andreotti S, Reinert K, Semmler T. SCORE: Smart Consensus Of RNA Expression-a consensus tool for detecting differentially expressed genes in bacteria. Bioinformatics 2021; 37:426-428. [PMID: 32717040 DOI: 10.1093/bioinformatics/btaa681] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2020] [Revised: 06/11/2020] [Accepted: 07/24/2020] [Indexed: 11/13/2022] Open
Abstract
SUMMARY RNA-sequencing (RNA-Seq) is the current method of choice for studying bacterial transcriptomes. To date, many computational pipelines have been developed to predict differentially expressed genes from RNA-Seq data, but no gold-standard has been widely accepted. We present the Snakemake-based tool Smart Consensus Of RNA Expression (SCORE) which uses a consensus approach founded on a selection of well-established tools for differential gene expression analysis. This allows SCORE to increase the overall prediction accuracy and to merge varying results into a single, human-readable output. SCORE performs all steps for the analysis of bacterial RNA-Seq data, from read preprocessing to the overrepresentation analysis of significantly associated ontologies. Development of consensus approaches like SCORE will help to streamline future RNA-Seq workflows and will fundamentally contribute to the creation of new gold-standards for the analysis of these types of data. AVAILABILITY AND IMPLEMENTATION https://github.com/SiWolf/SCORE. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Silver A Wolf
- Microbial Genomics, Robert Koch Institute, Berlin 13353, Germany
| | - Lennard Epping
- Microbial Genomics, Robert Koch Institute, Berlin 13353, Germany
| | - Sandro Andreotti
- Department of Mathematics and Computer Science, Freie Universität, Berlin 14195, Germany
| | - Knut Reinert
- Department of Mathematics and Computer Science, Freie Universität, Berlin 14195, Germany
| | - Torsten Semmler
- Microbial Genomics, Robert Koch Institute, Berlin 13353, Germany
| |
Collapse
|
109
|
Gerber S, Schratt G, Germain PL. Streamlining differential exon and 3' UTR usage with diffUTR. BMC Bioinformatics 2021; 22:189. [PMID: 33849458 PMCID: PMC8045333 DOI: 10.1186/s12859-021-04114-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2021] [Accepted: 03/30/2021] [Indexed: 12/13/2022] Open
Abstract
Background Despite the importance of alternative poly-adenylation and 3′ UTR length for a variety of biological phenomena, there are limited means of detecting UTR changes from standard transcriptomic data. Results We present the diffUTR Bioconductor package which streamlines and improves upon differential exon usage (DEU) analyses, and leverages existing DEU tools and alternative poly-adenylation site databases to enable differential 3′ UTR usage analysis. We demonstrate the diffUTR features and show that it is more flexible and more accurate than state-of-the-art alternatives, both in simulations and in real data. Conclusions diffUTR enables differential 3′ UTR analysis and more generally facilitates DEU and the exploration of their results. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04114-7.
Collapse
Affiliation(s)
- Stefan Gerber
- Group of Computational Neurogenomics, D-HEST Institute for Neurosciences, ETH Zürich, Winterthurerstrasse 190, 8057, Zurich, Switzerland.,Lab of Systems Neuroscience, D-HEST Institute for Neurosciences, ETH Zürich, Winterthurerstrasse 190, 8057, Zurich, Switzerland
| | - Gerhard Schratt
- Lab of Systems Neuroscience, D-HEST Institute for Neurosciences, ETH Zürich, Winterthurerstrasse 190, 8057, Zurich, Switzerland
| | - Pierre-Luc Germain
- Group of Computational Neurogenomics, D-HEST Institute for Neurosciences, ETH Zürich, Winterthurerstrasse 190, 8057, Zurich, Switzerland. .,Lab of Statistical Bioinformatics, DMLS, University of Zürich, Winterthurerstrasse 190, 8057, Zurich, Switzerland. .,SIB Swiss Institute of Bioinformatics, Zurich, Switzerland.
| |
Collapse
|
110
|
Behera S, Voshall A, Moriyama EN. Plant Transcriptome Assembly: Review and Benchmarking. Bioinformatics 2021. [DOI: 10.36255/exonpublications.bioinformatics.2021.ch7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
|
111
|
Liu P, Ewald J, Galvez JH, Head J, Crump D, Bourque G, Basu N, Xia J. Ultrafast functional profiling of RNA-seq data for nonmodel organisms. Genome Res 2021; 31:713-720. [PMID: 33731361 PMCID: PMC8015844 DOI: 10.1101/gr.269894.120] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2020] [Accepted: 02/18/2021] [Indexed: 12/02/2022]
Abstract
Computational time and cost remain a major bottleneck for RNA-seq data analysis of nonmodel organisms without reference genomes. To address this challenge, we have developed Seq2Fun, a novel, all-in-one, ultrafast tool to directly perform functional quantification of RNA-seq reads without transcriptome de novo assembly. The pipeline starts with raw read quality control: sequencing error correction, removing poly(A) tails, and joining overlapped paired-end reads. It then conducts a DNA-to-protein search by translating each read into all possible amino acid fragments and subsequently identifies possible homologous sequences in a well-curated protein database. Finally, the pipeline generates several informative outputs including gene abundance tables, pathway and species hit tables, an HTML report to visualize the results, and an output of clean reads annotated with mapped genes ready for downstream analysis. Seq2Fun does not have any intermediate steps of file writing and loading, making I/O very efficient. Seq2Fun is written in C++ and can run on a personal computer with a limited number of CPUs and memory. It can process >2,000,000 reads/min and is >120 times faster than conventional workflows based on de novo assembly, while maintaining high accuracy in our various test data sets.
Collapse
Affiliation(s)
- Peng Liu
- Faculty of Agricultural and Environmental Sciences, McGill University, Montreal, Quebec H9X 3V9, Canada
| | - Jessica Ewald
- Faculty of Agricultural and Environmental Sciences, McGill University, Montreal, Quebec H9X 3V9, Canada
| | - Jose Hector Galvez
- Department of Human Genetics, McGill University, Montreal, Quebec H3A 0C7, Canada.,Canadian Center for Computational Genomics, McGill University, Montreal, Quebec H3A 0G1, Canada
| | - Jessica Head
- Faculty of Agricultural and Environmental Sciences, McGill University, Montreal, Quebec H9X 3V9, Canada
| | - Doug Crump
- Environment and Climate Change Canada, National Wildlife Research Centre, Ottawa, Ontario K1A 0H3, Canada
| | - Guillaume Bourque
- Department of Human Genetics, McGill University, Montreal, Quebec H3A 0C7, Canada.,Canadian Center for Computational Genomics, McGill University, Montreal, Quebec H3A 0G1, Canada
| | - Niladri Basu
- Faculty of Agricultural and Environmental Sciences, McGill University, Montreal, Quebec H9X 3V9, Canada
| | - Jianguo Xia
- Faculty of Agricultural and Environmental Sciences, McGill University, Montreal, Quebec H9X 3V9, Canada.,Department of Human Genetics, McGill University, Montreal, Quebec H3A 0C7, Canada
| |
Collapse
|
112
|
Sarkar H, Srivastava A, Bravo HC, Love MI, Patro R. Terminus enables the discovery of data-driven, robust transcript groups from RNA-seq data. Bioinformatics 2021; 36:i102-i110. [PMID: 32657377 PMCID: PMC7355257 DOI: 10.1093/bioinformatics/btaa448] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Advances in sequencing technology, inference algorithms and differential testing methodology have enabled transcript-level analysis of RNA-seq data. Yet, the inherent inferential uncertainty in transcript-level abundance estimation, even among the most accurate approaches, means that robust transcript-level analysis often remains a challenge. Conversely, gene-level analysis remains a common and robust approach for understanding RNA-seq data, but it coarsens the resulting analysis to the level of genes, even if the data strongly support specific transcript-level effects. RESULTS We introduce a new data-driven approach for grouping together transcripts in an experiment based on their inferential uncertainty. Transcripts that share large numbers of ambiguously-mapping fragments with other transcripts, in complex patterns, often cannot have their abundances confidently estimated. Yet, the total transcriptional output of that group of transcripts will have greatly reduced inferential uncertainty, thus allowing more robust and confident downstream analysis. Our approach, implemented in the tool terminus, groups together transcripts in a data-driven manner allowing transcript-level analysis where it can be confidently supported, and deriving transcriptional groups where the inferential uncertainty is too high to support a transcript-level result. AVAILABILITY AND IMPLEMENTATION Terminus is implemented in Rust, and is freely available and open source. It can be obtained from https://github.com/COMBINE-lab/Terminus. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hirak Sarkar
- Department of Computer Science, University of Maryland, College Park, MD 20742, USA.,Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA
| | - Avi Srivastava
- Department of Computer Science, Stony Brook University, Stony Brook, NY 11794, USA
| | - Héctor Corrada Bravo
- Department of Computer Science, University of Maryland, College Park, MD 20742, USA.,Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA
| | - Michael I Love
- Department of Biostatistics, University of North Carolina-Chapel Hill, Chapel Hill, NC 27516, USA.,Department of Genetics, University of North Carolina-Chapel Hill, Chapel Hill, NC 27514, USA
| | - Rob Patro
- Department of Computer Science, University of Maryland, College Park, MD 20742, USA.,Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA
| |
Collapse
|
113
|
Sánchez-Ramírez S, Weiss JG, Thomas CG, Cutter AD. Widespread misregulation of inter-species hybrid transcriptomes due to sex-specific and sex-chromosome regulatory evolution. PLoS Genet 2021; 17:e1009409. [PMID: 33667233 PMCID: PMC7968742 DOI: 10.1371/journal.pgen.1009409] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2020] [Revised: 03/17/2021] [Accepted: 02/09/2021] [Indexed: 01/04/2023] Open
Abstract
When gene regulatory networks diverge between species, their dysfunctional expression in inter-species hybrid individuals can create genetic incompatibilities that generate the developmental defects responsible for intrinsic post-zygotic reproductive isolation. Both cis- and trans-acting regulatory divergence can be hastened by directional selection through adaptation, sexual selection, and inter-sexual conflict, in addition to cryptic evolution under stabilizing selection. Dysfunctional sex-biased gene expression, in particular, may provide an important source of sexually-dimorphic genetic incompatibilities. Here, we characterize and compare male and female/hermaphrodite transcriptome profiles for sibling nematode species Caenorhabditis briggsae and C. nigoni, along with allele-specific expression in their F1 hybrids, to deconvolve features of expression divergence and regulatory dysfunction. Despite evidence of widespread stabilizing selection on gene expression, misexpression of sex-biased genes pervades F1 hybrids of both sexes. This finding implicates greater fragility of male genetic networks to produce dysfunctional organismal phenotypes. Spermatogenesis genes are especially prone to high divergence in both expression and coding sequences, consistent with a "faster male" model for Haldane's rule and elevated sterility of hybrid males. Moreover, underdominant expression pervades male-biased genes compared to female-biased and sex-neutral genes and an excess of cis-trans compensatory regulatory divergence for X-linked genes underscores a "large-X effect" for hybrid male expression dysfunction. Extensive regulatory divergence in sex determination pathway genes likely contributes to demasculinization of XX hybrids. The evolution of genetic incompatibilities due to regulatory versus coding sequence divergence, however, are expected to arise in an uncorrelated fashion. This study identifies important differences between the sexes in how regulatory networks diverge to contribute to sex-biases in how genetic incompatibilities manifest during the speciation process.
Collapse
Affiliation(s)
- Santiago Sánchez-Ramírez
- Department of Ecology and Evolutionary Biology, University of Toronto, Toronto, Canada
- * E-mail: (SSR); (ADC)
| | - Jörg G. Weiss
- Department of Ecology and Evolutionary Biology, University of Toronto, Toronto, Canada
| | - Cristel G. Thomas
- Department of Ecology and Evolutionary Biology, University of Toronto, Toronto, Canada
- Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| | - Asher D. Cutter
- Department of Ecology and Evolutionary Biology, University of Toronto, Toronto, Canada
- * E-mail: (SSR); (ADC)
| |
Collapse
|
114
|
Sokolowski DJ, Faykoo-Martinez M, Erdman L, Hou H, Chan C, Zhu H, Holmes MM, Goldenberg A, Wilson MD. Single-cell mapper (scMappR): using scRNA-seq to infer the cell-type specificities of differentially expressed genes. NAR Genom Bioinform 2021; 3:lqab011. [PMID: 33655208 PMCID: PMC7902236 DOI: 10.1093/nargab/lqab011] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2020] [Revised: 12/23/2020] [Accepted: 02/04/2021] [Indexed: 12/11/2022] Open
Abstract
RNA sequencing (RNA-seq) is widely used to identify differentially expressed genes (DEGs) and reveal biological mechanisms underlying complex biological processes. RNA-seq is often performed on heterogeneous samples and the resulting DEGs do not necessarily indicate the cell-types where the differential expression occurred. While single-cell RNA-seq (scRNA-seq) methods solve this problem, technical and cost constraints currently limit its widespread use. Here we present single cell Mapper (scMappR), a method that assigns cell-type specificity scores to DEGs obtained from bulk RNA-seq by leveraging cell-type expression data generated by scRNA-seq and existing deconvolution methods. After evaluating scMappR with simulated RNA-seq data and benchmarking scMappR using RNA-seq data obtained from sorted blood cells, we asked if scMappR could reveal known cell-type specific changes that occur during kidney regeneration. scMappR appropriately assigned DEGs to cell-types involved in kidney regeneration, including a relatively small population of immune cells. While scMappR can work with user-supplied scRNA-seq data, we curated scRNA-seq expression matrices for ∼100 human and mouse tissues to facilitate its stand-alone use with bulk RNA-seq data from these species. Overall, scMappR is a user-friendly R package that complements traditional differential gene expression analysis of bulk RNA-seq data.
Collapse
Affiliation(s)
- Dustin J Sokolowski
- Department of Molecular Genetics, University of Toronto, Toronto, ON, M5S 1A8, Canada
| | | | - Lauren Erdman
- Genetics and Genome Biology, SickKids Research Institute, Toronto, ON, M5G 0A4, Canada
| | - Huayun Hou
- Department of Molecular Genetics, University of Toronto, Toronto, ON, M5S 1A8, Canada
| | - Cadia Chan
- Department of Molecular Genetics, University of Toronto, Toronto, ON, M5S 1A8, Canada
| | - Helen Zhu
- Department of Medical Biophysics, University of Toronto, Toronto, ON, M5G 1L7, Canada
| | - Melissa M Holmes
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| | - Anna Goldenberg
- Genetics and Genome Biology, SickKids Research Institute, Toronto, ON, M5G 0A4, Canada
| | - Michael D Wilson
- Department of Molecular Genetics, University of Toronto, Toronto, ON, M5S 1A8, Canada
| |
Collapse
|
115
|
Chen SY, Liu CJ, Zhang Q, Guo AY. An ultra-sensitive T-cell receptor detection method for TCR-Seq and RNA-Seq data. Bioinformatics 2021; 36:4255-4262. [PMID: 32399561 DOI: 10.1093/bioinformatics/btaa432] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2020] [Revised: 04/14/2020] [Accepted: 05/06/2020] [Indexed: 12/30/2022] Open
Abstract
MOTIVATION T-cell receptors (TCRs) function to recognize antigens and play vital roles in T-cell immunology. Surveying TCR repertoires by characterizing complementarity-determining region 3 (CDR3) is a key issue. Due to the high diversity of CDR3 and technological limitation, accurate characterization of CDR3 repertoires remains a great challenge. RESULTS We propose a computational method named CATT for ultra-sensitive and precise TCR CDR3 sequences detection. CATT can be applied on TCR sequencing, RNA-Seq and single-cell TCR(RNA)-Seq data to characterize CDR3 repertoires. CATT integrated de Bruijn graph-based micro-assembly algorithm, data-driven error correction model and Bayesian inference algorithm, to self-adaptively and ultra-sensitively characterize CDR3 repertoires with high performance. Benchmark results of datasets from in silico and experimental data demonstrated that CATT showed superior recall and precision compared with existing tools, especially for data with short read length and small size and single-cell sequencing data. Thus, CATT will be a useful tool for TCR analysis in researches of cancer and immunology. AVAILABILITY AND IMPLEMENTATION http://bioinfo.life.hust.edu.cn/CATT or https://github.com/GuoBioinfoLab/CATT. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Si-Yi Chen
- Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Chun-Jie Liu
- Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Qiong Zhang
- Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China.,Department of Biotechnology, College of Life Sciences, Anhui Normal University, Wuhu, China
| | - An-Yuan Guo
- Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China
| |
Collapse
|
116
|
Varabyou A, Salzberg SL, Pertea M. Effects of transcriptional noise on estimates of gene and transcript expression in RNA sequencing experiments. Genome Res 2021; 31:301-308. [PMID: 33361112 PMCID: PMC7849408 DOI: 10.1101/gr.266213.120] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2020] [Accepted: 12/18/2020] [Indexed: 12/25/2022]
Abstract
RNA sequencing is widely used to measure gene expression across a vast range of animal and plant tissues and conditions. Most studies of computational methods for gene expression analysis use simulated data to evaluate the accuracy of these methods. These simulations typically include reads generated from known genes at varying levels of expression. Until now, simulations did not include reads from noisy transcripts, which might include erroneous transcription, erroneous splicing, and other processes that affect transcription in living cells. Here we examine the effects of realistic amounts of transcriptional noise on the ability of leading computational methods to assemble and quantify the genes and transcripts in an RNA sequencing experiment. We show that the inclusion of noise leads to systematic errors in the ability of these programs to measure expression, including systematic underestimates of transcript abundance levels and large increases in the number of false-positive genes and transcripts. Our results also suggest that alignment-free computational methods sometimes fail to detect transcripts expressed at relatively low levels.
Collapse
Affiliation(s)
- Ales Varabyou
- Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland 21211, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218, USA
| | - Steven L Salzberg
- Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland 21211, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland 21218, USA
- Department of Biostatistics, Johns Hopkins University, Baltimore, Maryland 21205, USA
| | - Mihaela Pertea
- Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland 21211, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland 21218, USA
| |
Collapse
|
117
|
Parada GE, Munita R, Georgakopoulos-Soares I, Fernandes HJR, Kedlian VR, Metzakopian E, Andres ME, Miska EA, Hemberg M. MicroExonator enables systematic discovery and quantification of microexons across mouse embryonic development. Genome Biol 2021; 22:43. [PMID: 33482885 PMCID: PMC7821500 DOI: 10.1186/s13059-020-02246-2] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2020] [Accepted: 12/15/2020] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND Microexons, exons that are ≤ 30 nucleotides, are a highly conserved and dynamically regulated set of cassette exons. They have key roles in nervous system development and function, as evidenced by recent results demonstrating the impact of microexons on behaviour and cognition. However, microexons are often overlooked due to the difficulty of detecting them using standard RNA-seq aligners. RESULTS Here, we present MicroExonator, a novel pipeline for reproducible de novo discovery and quantification of microexons. We process 289 RNA-seq datasets from eighteen mouse tissues corresponding to nine embryonic and postnatal stages, providing the most comprehensive survey of microexons available for mice. We detect 2984 microexons, 332 of which are differentially spliced throughout mouse embryonic brain development, including 29 that are not present in mouse transcript annotation databases. Unsupervised clustering of microexons based on their inclusion patterns segregates brain tissues by developmental time, and further analysis suggests a key function for microexons in axon growth and synapse formation. Finally, we analyse single-cell RNA-seq data from the mouse visual cortex, and for the first time, we report differential inclusion between neuronal subpopulations, suggesting that some microexons could be cell type-specific. CONCLUSIONS MicroExonator facilitates the investigation of microexons in transcriptome studies, particularly when analysing large volumes of data. As a proof of principle, we use MicroExonator to analyse a large collection of both mouse bulk and single-cell RNA-seq datasets. The analyses enabled the discovery of previously uncharacterized microexons, and our study provides a comprehensive microexon inclusion catalogue during mouse development.
Collapse
Affiliation(s)
- Guillermo E Parada
- Wellcome Sanger Institute, Wellcome Genome Campus, Cambridge, CB10 1SA, UK
- Wellcome Trust Cancer Research UK Gurdon Institute, University of Cambridge, Tennis Court Road, Cambridge, CB2 1QN, UK
| | - Roberto Munita
- Department of Cellular and Molecular Biology, Faculty of Biological Sciences, Pontificia Universidad Católica de Chile, Santiago, Chile
| | - Ilias Georgakopoulos-Soares
- Wellcome Sanger Institute, Wellcome Genome Campus, Cambridge, CB10 1SA, UK
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, 94158, USA
| | - Hugo J R Fernandes
- UK Dementia Research Institute, Department of Clinical Neurosciences, University of Cambridge, Cambridge, CB2 0AH, UK
| | - Veronika R Kedlian
- Wellcome Sanger Institute, Wellcome Genome Campus, Cambridge, CB10 1SA, UK
| | - Emmanouil Metzakopian
- UK Dementia Research Institute, Department of Clinical Neurosciences, University of Cambridge, Cambridge, CB2 0AH, UK
| | - Maria Estela Andres
- Department of Cellular and Molecular Biology, Faculty of Biological Sciences, Pontificia Universidad Católica de Chile, Santiago, Chile
| | - Eric A Miska
- Wellcome Sanger Institute, Wellcome Genome Campus, Cambridge, CB10 1SA, UK.
- Wellcome Trust Cancer Research UK Gurdon Institute, University of Cambridge, Tennis Court Road, Cambridge, CB2 1QN, UK.
- Department of Genetics, University of Cambridge, Downing Street, Cambridge, CB2 3EH, UK.
| | - Martin Hemberg
- Wellcome Sanger Institute, Wellcome Genome Campus, Cambridge, CB10 1SA, UK.
- Wellcome Trust Cancer Research UK Gurdon Institute, University of Cambridge, Tennis Court Road, Cambridge, CB2 1QN, UK.
| |
Collapse
|
118
|
Abstract
Most currently available quantification tools for transcriptomics analyses have been designed for human data sets, in which full-length transcript sequences, including the untranslated regions, are well annotated. In most prokaryotic systems, full-length transcript sequences have yet to be characterized, leading to prokaryotic transcriptomics analyses being performed based on only the coding sequences. Quantification tools for RNA sequencing (RNA-Seq) analyses are often designed and tested using human transcriptomics data sets, in which full-length transcript sequences are well annotated. For prokaryotic transcriptomics experiments, full-length transcript sequences are seldom known, and coding sequences must instead be used for quantification steps in RNA-Seq analyses. However, operons confound accurate quantification of coding sequences since a single transcript does not necessarily equate to a single gene. Here, we introduce FADU (Feature Aggregate Depth Utility), a quantification tool designed specifically for prokaryotic RNA-Seq analyses. FADU assigns partial count values proportional to the length of the fragment overlapping the target feature. To assess the ability of FADU to quantify genes in prokaryotic transcriptomics analyses, we compared its performance to those of eXpress, featureCounts, HTSeq, kallisto, and Salmon across three paired-end read data sets of (i) Ehrlichia chaffeensis, (ii) Escherichia coli, and (iii) the Wolbachia endosymbiont wBm. Across each of the three data sets, we find that FADU can more accurately quantify operonic genes by deriving proportional counts for multigene fragments within operons. FADU is available at https://github.com/IGS/FADU. IMPORTANCE Most currently available quantification tools for transcriptomics analyses have been designed for human data sets, in which full-length transcript sequences, including the untranslated regions, are well annotated. In most prokaryotic systems, full-length transcript sequences have yet to be characterized, leading to prokaryotic transcriptomics analyses being performed based on only the coding sequences. In contrast to eukaryotes, prokaryotes contain polycistronic transcripts, and when genes are quantified based on coding sequences instead of transcript sequences, this leads to an increased abundance of improperly assigned ambiguous multigene fragments, specifically those mapping to multiple genes in operons. Here, we describe FADU, a quantification tool for prokaryotic RNA-Seq analyses designed to assign proportional counts with the purpose of better quantifying operonic genes while minimizing the pitfalls associated with improperly assigning fragment counts from ambiguous transcripts.
Collapse
|
119
|
Shao W, Wang T. Transcript assembly improves expression quantification of transposable elements in single-cell RNA-seq data. Genome Res 2021; 31:88-100. [PMID: 33355230 PMCID: PMC7849386 DOI: 10.1101/gr.265173.120] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2020] [Accepted: 11/24/2020] [Indexed: 12/28/2022]
Abstract
Transposable elements (TEs) are an integral part of the host transcriptome. TE-containing noncoding RNAs (ncRNAs) show considerable tissue specificity and play important roles during development, including stem cell maintenance and cell differentiation. Recent advances in single-cell RNA-seq (scRNA-seq) revolutionized cell type-specific gene expression analysis. However, effective scRNA-seq quantification tools tailored for TEs are lacking, limiting our ability to dissect TE expression dynamics at single-cell resolution. To address this issue, we established a TE expression quantification pipeline that is compatible with scRNA-seq data generated across multiple technology platforms. We constructed TE-containing ncRNA references using bulk RNA-seq data and showed that quantifying TE expression at the transcript level effectively reduces noise. As proof of principle, we applied this strategy to mouse embryonic stem cells and successfully captured the expression profile of endogenous retroviruses in single cells. We further expanded our analysis to scRNA-seq data from early stages of mouse embryogenesis. Our results illustrated the dynamic TE expression at preimplantation stages and revealed 146 TE-containing ncRNA transcripts with substantial tissue specificity during gastrulation and early organogenesis.
Collapse
Affiliation(s)
- Wanqing Shao
- Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63110, USA
- Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, Missouri 63110, USA
| | - Ting Wang
- Department of Genetics, Washington University School of Medicine, St. Louis, Missouri 63110, USA
- Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, Missouri 63110, USA
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, Missouri 63108, USA
| |
Collapse
|
120
|
Duan Y, Zhang W, Cheng Y, Shi M, Xia XQ. A systematic evaluation of bioinformatics tools for identification of long noncoding RNAs. RNA (NEW YORK, N.Y.) 2021; 27:80-98. [PMID: 33055239 PMCID: PMC7749630 DOI: 10.1261/rna.074724.120] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/07/2020] [Accepted: 10/07/2020] [Indexed: 06/11/2023]
Abstract
High-throughput RNA sequencing unveiled the complexity of transcriptome and significantly increased the records of long noncoding RNAs (lncRNAs), which were reported to participate in a variety of biological processes. Identification of lncRNAs is a key step in lncRNA analysis, and a bunch of bioinformatics tools have been developed for this purpose in recent years. While these tools allow us to identify lncRNA more efficiently and accurately, they may produce inconsistent results, making selection a confusing issue. We compared the performance of 41 analysis models based on 14 software packages and different data sets, including high-quality data and low-quality data from 33 species. In addition, computational efficiency, robustness, and joint prediction of the models were explored. As a practical guidance, key points for lncRNA identification under different situations were summarized. In this investigation, no one of these models could be superior to others under all test conditions. The performance of a model relied to a great extent on the source of transcripts and the quality of assemblies. As general references, FEELnc_all_cl, CPC, and CPAT_mouse work well in most species while COME, CNCI, and lncScore are good choices for model organisms. Since these tools are sensitive to different factors such as the species involved and the quality of assembly, researchers must carefully select the appropriate tool based on the actual data. Alternatively, our test suggests that joint prediction could behave better than any single model if proper models were chosen. All scripts/data used in this research can be accessed at http://bioinfo.ihb.ac.cn/elit.
Collapse
Affiliation(s)
- You Duan
- Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan 430072, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Wanting Zhang
- Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan 430072, China
- The Innovative Academy of Seed Design, Chinese Academy of Sciences, Beijing 100101, China
| | - Yingyin Cheng
- Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan 430072, China
- The Innovative Academy of Seed Design, Chinese Academy of Sciences, Beijing 100101, China
| | - Mijuan Shi
- Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan 430072, China
- The Innovative Academy of Seed Design, Chinese Academy of Sciences, Beijing 100101, China
| | - Xiao-Qin Xia
- Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan 430072, China
- University of Chinese Academy of Sciences, Beijing 100049, China
- The Innovative Academy of Seed Design, Chinese Academy of Sciences, Beijing 100101, China
| |
Collapse
|
121
|
Banerjee S, Velásquez-Zapata V, Fuerst G, Elmore JM, Wise RP. NGPINT: a next-generation protein-protein interaction software. Brief Bioinform 2020; 22:6046042. [PMID: 33367498 DOI: 10.1093/bib/bbaa351] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2020] [Revised: 10/23/2020] [Accepted: 11/02/2020] [Indexed: 12/27/2022] Open
Abstract
Mapping protein-protein interactions at a proteome scale is critical to understanding how cellular signaling networks respond to stimuli. Since eukaryotic genomes encode thousands of proteins, testing their interactions one-by-one is a challenging prospect. High-throughput yeast-two hybrid (Y2H) assays that employ next-generation sequencing to interrogate complementary DNA (cDNA) libraries represent an alternative approach that optimizes scale, cost and effort. We present NGPINT, a robust and scalable software to identify all putative interactors of a protein using Y2H in batch culture. NGPINT combines diverse tools to align sequence reads to target genomes, reconstruct prey fragments and compute gene enrichment under reporter selection. Central to this pipeline is the identification of fusion reads containing sequences derived from both the Y2H expression plasmid and the cDNA of interest. To reduce false positives, these fusion reads are evaluated as to whether the cDNA fragment forms an in-frame translational fusion with the Y2H transcription factor. NGPINT successfully recognized 95% of interactions in simulated test runs. As proof of concept, NGPINT was tested using published data sets and it recognized all validated interactions. NGPINT can process interaction data from any biosystem with an available genome or transcriptome reference, thus facilitating the discovery of protein-protein interactions in model and non-model organisms.
Collapse
Affiliation(s)
- Sagnik Banerjee
- Program in Bioinformatics & Computational Biology, Iowa State University, Ames, IA, 50011, USA.,Department of Statistics, Iowa State University, Ames, IA, 50011, USA
| | - Valeria Velásquez-Zapata
- Program in Bioinformatics & Computational Biology, Iowa State University, Ames, IA, 50011, USA.,Department of Plant Pathology & Microbiology, Iowa State University, Ames, IA, 50011, USA
| | - Gregory Fuerst
- Department of Plant Pathology & Microbiology, Iowa State University, Ames, IA, 50011, USA.,Corn Insects and Crop Genetics Research, USDA-Agricultural Research Service, Ames, IA, 50011, USA
| | - J Mitch Elmore
- Department of Plant Pathology & Microbiology, Iowa State University, Ames, IA, 50011, USA.,Corn Insects and Crop Genetics Research, USDA-Agricultural Research Service, Ames, IA, 50011, USA
| | - Roger P Wise
- Program in Bioinformatics & Computational Biology, Iowa State University, Ames, IA, 50011, USA.,Department of Plant Pathology & Microbiology, Iowa State University, Ames, IA, 50011, USA.,Corn Insects and Crop Genetics Research, USDA-Agricultural Research Service, Ames, IA, 50011, USA
| |
Collapse
|
122
|
Spinozzi G, Tini V, Adorni A, Falini B, Martelli MP. ARPIR: automatic RNA-Seq pipelines with interactive report. BMC Bioinformatics 2020; 21:574. [PMID: 33349239 PMCID: PMC7751108 DOI: 10.1186/s12859-020-03846-2] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2020] [Accepted: 10/27/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND RNA-Seq is an increasing used methodology to study either coding and non-coding RNA expression. There are many software tools available for each phase of the RNA-Seq analysis and each of them uses different algorithms. Furthermore, the analysis consists of several steps regarding alignment (primary-analysis), quantification, differential analysis (secondary-analysis) and any tertiary-analysis and can therefore be time-consuming to deal with each step separately, in addition to requiring a computer knowledge. For this reason, the development of an automated pipeline that allows the entire analysis to be managed through a single initial command and that is easy to use even for those without computer skills can be useful. Faced with the vast availability of RNA-Seq analysis tools, it is first of all necessary to select a limited number of pipelines to include. For this purpose, we compared eight pipelines obtained by combining the most used tools and for each one we evaluated peak of RAM, time, sensitivity and specificity. RESULTS The pipeline with shorter times, lower consumption of RAM and higher sensitivity is the one consisting in HISAT2 for alignment, featureCounts for quantification and edgeR for differential analysis. Here, we developed ARPIR, an automated pipeline that recurs by default to the cited pipeline, but it also allows to choose, between different tools, those of the pipelines having the best performances. CONCLUSIONS ARPIR allows the analysis of RNA-Seq data from groups undergoing different treatment allowing multiple comparisons in a single launch and can be used either for paired-end or single-end analysis. All the required prerequisites can be installed via a configuration script and the analysis can be launched via a graphical interface or by a template script. In addition, ARPIR makes a final tertiary-analysis that includes a Gene Ontology and Pathway analysis. The results can be viewed in an interactive Shiny App and exported in a report (pdf, word or html formats). ARPIR is an efficient and easy-to-use tool for RNA-Seq analysis from quality control to Pathway analysis that allows you to choose between different pipelines.
Collapse
Affiliation(s)
- Giulio Spinozzi
- Department of Medicine, Section of Hematology, University of Perugia, Perugia, Italy.
| | - Valentina Tini
- Department of Medicine, Section of Hematology, University of Perugia, Perugia, Italy
| | - Alessia Adorni
- Department of Medicine, Section of Hematology, University of Perugia, Perugia, Italy
| | - Brunangelo Falini
- Department of Medicine, Section of Hematology, University of Perugia, Perugia, Italy
| | - Maria Paola Martelli
- Department of Medicine, Section of Hematology, University of Perugia, Perugia, Italy.
| |
Collapse
|
123
|
Puntambekar S, Newhouse R, San-Miguel J, Chauhan R, Vernaz G, Willis T, Wayland MT, Umrania Y, Miska EA, Prabakaran S. Evolutionary divergence of novel open reading frames in cichlids speciation. Sci Rep 2020; 10:21570. [PMID: 33299045 PMCID: PMC7726158 DOI: 10.1038/s41598-020-78555-0] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2020] [Accepted: 11/26/2020] [Indexed: 01/02/2023] Open
Abstract
Novel open reading frames (nORFs) with coding potential may arise from noncoding DNA. Not much is known about their emergence, functional role, fixation in a population or contribution to adaptive radiation. Cichlids fishes exhibit extensive phenotypic diversification and speciation. Encounters with new environments alone are not sufficient to explain this striking diversity of cichlid radiation because other taxa coexistent with the Cichlidae demonstrate lower species richness. Wagner et al. analyzed cichlid diversification in 46 African lakes and reported that both extrinsic environmental factors and intrinsic lineage-specific traits related to sexual selection have strongly influenced the cichlid radiation, which indicates the existence of unknown molecular mechanisms responsible for rapid phenotypic diversification, such as emergence of novel open reading frames (nORFs). In this study, we integrated transcriptomic and proteomic signatures from two tissues of two cichlids species, identified nORFs and performed evolutionary analysis on these nORF regions. Our results suggest that the time scale of speciation of the two species and evolutionary divergence of these nORF genomic regions are similar and indicate a potential role for these nORFs in speciation of the cichlid fishes.
Collapse
Affiliation(s)
- Shraddha Puntambekar
- Department of Biology, Indian Institute of Science Education and Research, Pune, Maharashtra, 411008, India
| | - Rachel Newhouse
- Department of Genetics, University of Cambridge, Downing Site, Cambridge, CB2 3EH, UK
| | - Jaime San-Miguel
- Department of Genetics, University of Cambridge, Downing Site, Cambridge, CB2 3EH, UK
| | - Ruchi Chauhan
- Department of Genetics, University of Cambridge, Downing Site, Cambridge, CB2 3EH, UK
| | - Grégoire Vernaz
- Department of Genetics, University of Cambridge, Downing Site, Cambridge, CB2 3EH, UK
- The Wellcome Trust/CRUK Gurdon Institute, University of Cambridge, Cambridge, CB2 1QN, UK
- Wellcome Sanger Institute, Wellcome Genome Campus, Cambridge, CB10 1SA, UK
| | - Thomas Willis
- Department of Genetics, University of Cambridge, Downing Site, Cambridge, CB2 3EH, UK
| | - Matthew T Wayland
- Department of Zoology, University of Cambridge, Downing Site, Cambridge, CB2 3EH, UK
| | - Yagnesh Umrania
- Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge, Tennis Court Road, Cambridge, CB2 1QR, UK
| | - Eric A Miska
- Department of Genetics, University of Cambridge, Downing Site, Cambridge, CB2 3EH, UK
- Wellcome Sanger Institute, Wellcome Genome Campus, Cambridge, CB10 1SA, UK
- Cambridge Centre for Proteomics, Department of Biochemistry, University of Cambridge, Tennis Court Road, Cambridge, CB2 1QR, UK
| | - Sudhakaran Prabakaran
- Department of Biology, Indian Institute of Science Education and Research, Pune, Maharashtra, 411008, India.
- Department of Genetics, University of Cambridge, Downing Site, Cambridge, CB2 3EH, UK.
- St. Edmund's College, University of Cambridge, Cambridge, CB3 0BN, UK.
| |
Collapse
|
124
|
Chen L, Lang K, Mei Y, Shi Z, He K, Li F, Xiao H, Ye G, Han Z. FastD: Fast detection of insecticide target-site mutations and overexpressed detoxification genes in insect populations from RNA-Seq data. Ecol Evol 2020; 10:14346-14358. [PMID: 33391720 PMCID: PMC7771117 DOI: 10.1002/ece3.7037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2019] [Revised: 08/26/2020] [Accepted: 09/21/2020] [Indexed: 11/24/2022] Open
Abstract
Target-site mutations and detoxification gene overexpression are two major mechanisms conferring insecticide resistance. Molecular assays applied to detect these resistance genetic markers are time-consuming and with high false-positive rates. RNA-Seq data contains information on the variations within expressed genomic regions and expression of detoxification genes. However, there is no corresponding method to detect resistance markers at present. Here, we collected 66 reported resistance mutations of four insecticide targets (AChE, VGSC, RyR, and nAChR) from 82 insect species. Next, we obtained 403 sequences of the four target genes and 12,665 sequences of three kinds of detoxification genes including P450s, GSTs, and CCEs. Then, we developed a Perl program, FastD, to detect target-site mutations and overexpressed detoxification genes from RNA-Seq data and constructed a web server for FastD (http://www.insect-genome.com/fastd). The estimation of FastD on simulated RNA-Seq data showed high sensitivity and specificity. We applied FastD to detect resistant markers in 15 populations of six insects, Plutella xylostella, Aphis gossypii, Anopheles arabiensis, Musca domestica, Leptinotarsa decemlineata and Apis mellifera. Results showed that 11 RyR mutations in P. xylostella, one nAChR mutation in A. gossypii, one VGSC mutation in A. arabiensis and five VGSC mutations in M. domestica were found to be with frequency difference >40% between resistant and susceptible populations including previously confirmed mutations G4946E in RyR, R81T in nAChR and L1014F in VGSC. And 49 detoxification genes were found to be overexpressed in resistant populations compared with susceptible populations including previously confirmed detoxification genes CYP6BG1, CYP6CY22, CYP6CY13, CYP6P3, CYP6M2, CYP6P4 and CYP4G16. The candidate target-site mutations and detoxification genes were worth further validation. Resistance estimates according to confirmed markers were consistent with population phenotypes, confirming the reliability of this program in predicting population resistance at omics-level.
Collapse
Affiliation(s)
- Longfei Chen
- Institute of Insect SciencesCollege of Agriculture and BiotechnologyZhejiang UniversityHangzhouChina
- Department of EntomologyNanjing Agricultural UniversityNanjingChina
| | - Kun Lang
- Institute of Insect SciencesCollege of Agriculture and BiotechnologyZhejiang UniversityHangzhouChina
- Department of EntomologyNanjing Agricultural UniversityNanjingChina
| | - Yang Mei
- Institute of Insect SciencesCollege of Agriculture and BiotechnologyZhejiang UniversityHangzhouChina
| | - Zhenmin Shi
- Institute of Insect SciencesCollege of Agriculture and BiotechnologyZhejiang UniversityHangzhouChina
| | - Kang He
- Institute of Insect SciencesCollege of Agriculture and BiotechnologyZhejiang UniversityHangzhouChina
| | - Fei Li
- Institute of Insect SciencesCollege of Agriculture and BiotechnologyZhejiang UniversityHangzhouChina
| | - Huamei Xiao
- Institute of Insect SciencesCollege of Agriculture and BiotechnologyZhejiang UniversityHangzhouChina
- Key Laboratory of Crop Growth and Development Regulation of Jiangxi ProvinceCollege of Life Sciences and Resource EnvironmentYichun UniversityYichunChina
| | - Gongyin Ye
- Institute of Insect SciencesCollege of Agriculture and BiotechnologyZhejiang UniversityHangzhouChina
| | - Zhaojun Han
- Department of EntomologyNanjing Agricultural UniversityNanjingChina
| |
Collapse
|
125
|
Zhang Y, Parmigiani G, Johnson WE. ComBat-seq: batch effect adjustment for RNA-seq count data. NAR Genom Bioinform 2020; 2:lqaa078. [PMID: 33015620 PMCID: PMC7518324 DOI: 10.1093/nargab/lqaa078] [Citation(s) in RCA: 697] [Impact Index Per Article: 139.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2020] [Revised: 08/09/2020] [Accepted: 09/17/2020] [Indexed: 12/25/2022] Open
Abstract
The benefit of integrating batches of genomic data to increase statistical power is often hindered by batch effects, or unwanted variation in data caused by differences in technical factors across batches. It is therefore critical to effectively address batch effects in genomic data to overcome these challenges. Many existing methods for batch effects adjustment assume the data follow a continuous, bell-shaped Gaussian distribution. However in RNA-seq studies the data are typically skewed, over-dispersed counts, so this assumption is not appropriate and may lead to erroneous results. Negative binomial regression models have been used previously to better capture the properties of counts. We developed a batch correction method, ComBat-seq, using a negative binomial regression model that retains the integer nature of count data in RNA-seq studies, making the batch adjusted data compatible with common differential expression software packages that require integer counts. We show in realistic simulations that the ComBat-seq adjusted data results in better statistical power and control of false positives in differential expression compared to data adjusted by the other available methods. We further demonstrated in a real data example that ComBat-seq successfully removes batch effects and recovers the biological signal in the data.
Collapse
Affiliation(s)
- Yuqing Zhang
- Department of Bioinformatics and Clinical Data Science, Gilead Sciences, Inc., 333 Lakeside Dr, Foster City, CA 94404, USA
| | - Giovanni Parmigiani
- Department of Data Sciences, Dana-Farber Cancer Institute, 450 Brookline Ave, Boston, MA 02215, USA
| | - W Evan Johnson
- Division of Computational Biomedicine, Boston University School of Medicine, 72 East Concord Street, Boston, MA 02118, USA
| |
Collapse
|
126
|
Srivastava A, Malik L, Sarkar H, Zakeri M, Almodaresi F, Soneson C, Love MI, Kingsford C, Patro R. Alignment and mapping methodology influence transcript abundance estimation. Genome Biol 2020; 21:239. [PMID: 32894187 PMCID: PMC7487471 DOI: 10.1186/s13059-020-02151-8] [Citation(s) in RCA: 90] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2019] [Accepted: 08/19/2020] [Indexed: 01/23/2023] Open
Abstract
Background The accuracy of transcript quantification using RNA-seq data depends on many factors, such as the choice of alignment or mapping method and the quantification model being adopted. While the choice of quantification model has been shown to be important, considerably less attention has been given to comparing the effect of various read alignment approaches on quantification accuracy. Results We investigate the influence of mapping and alignment on the accuracy of transcript quantification in both simulated and experimental data, as well as the effect on subsequent differential expression analysis. We observe that, even when the quantification model itself is held fixed, the effect of choosing a different alignment methodology, or aligning reads using different parameters, on quantification estimates can sometimes be large and can affect downstream differential expression analyses as well. These effects can go unnoticed when assessment is focused too heavily on simulated data, where the alignment task is often simpler than in experimentally acquired samples. We also introduce a new alignment methodology, called selective alignment, to overcome the shortcomings of lightweight approaches without incurring the computational cost of traditional alignment. Conclusion We observe that, on experimental datasets, the performance of lightweight mapping and alignment-based approaches varies significantly, and highlight some of the underlying factors. We show this variation both in terms of quantification and downstream differential expression analysis. In all comparisons, we also show the improved performance of our proposed selective alignment method and suggest best practices for performing RNA-seq quantification.
Collapse
Affiliation(s)
- Avi Srivastava
- Department of Computer Science, Stony Brook University, Stony Brook, USA
| | - Laraib Malik
- Department of Computer Science, Stony Brook University, Stony Brook, USA
| | - Hirak Sarkar
- Department of Computer Science, University of Maryland, College Park, USA
| | - Mohsen Zakeri
- Department of Computer Science, University of Maryland, College Park, USA
| | - Fatemeh Almodaresi
- Department of Computer Science, University of Maryland, College Park, USA
| | - Charlotte Soneson
- Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland.,SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Michael I Love
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, USA.,Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, USA
| | - Carl Kingsford
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, USA
| | - Rob Patro
- Department of Computer Science, University of Maryland, College Park, USA.
| |
Collapse
|
127
|
Germain PL, Sonrel A, Robinson MD. pipeComp, a general framework for the evaluation of computational pipelines, reveals performant single cell RNA-seq preprocessing tools. Genome Biol 2020; 21:227. [PMID: 32873325 PMCID: PMC7465801 DOI: 10.1186/s13059-020-02136-7] [Citation(s) in RCA: 58] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2020] [Accepted: 08/06/2020] [Indexed: 11/13/2022] Open
Abstract
We present pipeComp ( https://github.com/plger/pipeComp ), a flexible R framework for pipeline comparison handling interactions between analysis steps and relying on multi-level evaluation metrics. We apply it to the benchmark of single-cell RNA-sequencing analysis pipelines using simulated and real datasets with known cell identities, covering common methods of filtering, doublet detection, normalization, feature selection, denoising, dimensionality reduction, and clustering. pipeComp can easily integrate any other step, tool, or evaluation metric, allowing extensible benchmarks and easy applications to other fields, as we demonstrate through a study of the impact of removal of unwanted variation on differential expression analysis.
Collapse
Affiliation(s)
- Pierre-Luc Germain
- Department of Molecular Life Sciences, University of Zürich, Winterthurerstrasse 190, Zürich, 8057 Switzerland
- SIB Swiss Institute of Bioinformatics, Zürich, Switzerland
- D-HEST Institute for Neurosciences, ETH Zürich, Winterthurerstrasse 190, Zürich, 8057 Switzerland
| | - Anthony Sonrel
- Department of Molecular Life Sciences, University of Zürich, Winterthurerstrasse 190, Zürich, 8057 Switzerland
- SIB Swiss Institute of Bioinformatics, Zürich, Switzerland
| | - Mark D. Robinson
- Department of Molecular Life Sciences, University of Zürich, Winterthurerstrasse 190, Zürich, 8057 Switzerland
- SIB Swiss Institute of Bioinformatics, Zürich, Switzerland
| |
Collapse
|
128
|
Chen X, Zhang B, Wang T, Bonni A, Zhao G. Robust principal component analysis for accurate outlier sample detection in RNA-Seq data. BMC Bioinformatics 2020; 21:269. [PMID: 32600248 PMCID: PMC7324992 DOI: 10.1186/s12859-020-03608-0] [Citation(s) in RCA: 48] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2020] [Accepted: 06/16/2020] [Indexed: 01/07/2023] Open
Abstract
BACKGROUND High throughput RNA sequencing is a powerful approach to study gene expression. Due to the complex multiple-steps protocols in data acquisition, extreme deviation of a sample from samples of the same treatment group may occur due to technical variation or true biological differences. The high-dimensionality of the data with few biological replicates make it challenging to accurately detect those samples, and this issue is not well studied in the literature currently. Robust statistics is a family of theories and techniques aim to detect the outliers by first fitting the majority of the data and then flagging data points that deviate from it. Robust statistics have been widely used in multivariate data analysis for outlier detection in chemometrics and engineering. Here we apply robust statistics on RNA-seq data analysis. RESULTS We report the use of two robust principal component analysis (rPCA) methods, PcaHubert and PcaGrid, to detect outlier samples in multiple simulated and real biological RNA-seq data sets with positive control outlier samples. PcaGrid achieved 100% sensitivity and 100% specificity in all the tests using positive control outliers with varying degrees of divergence. We applied rPCA methods and classical principal component analysis (cPCA) on an RNA-Seq data set profiling gene expression of the external granule layer in the cerebellum of control and conditional SnoN knockout mice. Both rPCA methods detected the same two outlier samples but cPCA failed to detect any. We performed differentially expressed gene detection before and after outlier removal as well as with and without batch effect modeling. We validated gene expression changes using quantitative reverse transcription PCR and used the result as reference to compare the performance of eight different data analysis strategies. Removing outliers without batch effect modeling performed the best in term of detecting biologically relevant differentially expressed genes. CONCLUSIONS rPCA implemented in the PcaGrid function is an accurate and objective method to detect outlier samples. It is well suited for high-dimensional data with small sample sizes like RNA-seq data. Outlier removal can significantly improve the performance of differential gene detection and downstream functional analysis.
Collapse
Affiliation(s)
- Xiaoying Chen
- Department of Neuroscience, Washington University School of Medicine, St. Louis, MO, USA
| | - Bo Zhang
- Center of Regenerative Medicine, Department of Developmental Biology, Washington University School of Medicine, St. Louis, MO, USA
| | - Ting Wang
- Department of Genetics, Washington University School of Medicine, St. Louis, MO, USA
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO, USA
| | - Azad Bonni
- Department of Neuroscience, Washington University School of Medicine, St. Louis, MO, USA
| | - Guoyan Zhao
- Department of Neuroscience, Washington University School of Medicine, St. Louis, MO, USA.
| |
Collapse
|
129
|
Boonekamp FJ, Dashko S, Duiker D, Gehrmann T, van den Broek M, den Ridder M, Pabst M, Robert V, Abeel T, Postma ED, Daran JM, Daran-Lapujade P. Design and Experimental Evaluation of a Minimal, Innocuous Watermarking Strategy to Distinguish Near-Identical DNA and RNA Sequences. ACS Synth Biol 2020; 9:1361-1375. [PMID: 32413257 PMCID: PMC7309318 DOI: 10.1021/acssynbio.0c00045] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The construction of powerful cell factories requires intensive and extensive remodelling of microbial genomes. Considering the rapidly increasing number of these synthetic biology endeavors, there is an increasing need for DNA watermarking strategies that enable the discrimination between synthetic and native gene copies. While it is well documented that codon usage can affect translation, and most likely mRNA stability in eukaryotes, remarkably few quantitative studies explore the impact of watermarking on transcription, protein expression, and physiology in the popular model and industrial yeast Saccharomyces cerevisiae. The present study, using S. cerevisiae as eukaryotic paradigm, designed, implemented, and experimentally validated a systematic strategy to watermark DNA with minimal alteration of yeast physiology. The 13 genes encoding proteins involved in the major pathway for sugar utilization (i.e., glycolysis and alcoholic fermentation) were simultaneously watermarked in a yeast strain using the previously published pathway swapping strategy. Carefully swapping codons of these naturally codon optimized, highly expressed genes, did not affect yeast physiology and did not alter transcript abundance, protein abundance, and protein activity besides a mild effect on Gpm1. The markerQuant bioinformatics method could reliably discriminate native from watermarked genes and transcripts. Furthermore, presence of watermarks enabled selective CRISPR/Cas genome editing, specifically targeting the native gene copy while leaving the synthetic, watermarked variant intact. This study offers a validated strategy to simply watermark genes in S. cerevisiae.
Collapse
Affiliation(s)
- Francine J. Boonekamp
- Department of Biotechnology, Delft University of Technology, van der Maasweg 9, 2629HZ Delft, The Netherlands
| | - Sofia Dashko
- Department of Biotechnology, Delft University of Technology, van der Maasweg 9, 2629HZ Delft, The Netherlands
| | - Donna Duiker
- Department of Biotechnology, Delft University of Technology, van der Maasweg 9, 2629HZ Delft, The Netherlands
| | - Thies Gehrmann
- Westerdijk Institute, Uppsalalaan 8, 3584 CT Utrecht, The Netherlands
| | - Marcel van den Broek
- Department of Biotechnology, Delft University of Technology, van der Maasweg 9, 2629HZ Delft, The Netherlands
| | - Maxime den Ridder
- Department of Biotechnology, Delft University of Technology, van der Maasweg 9, 2629HZ Delft, The Netherlands
| | - Martin Pabst
- Department of Biotechnology, Delft University of Technology, van der Maasweg 9, 2629HZ Delft, The Netherlands
| | - Vincent Robert
- Westerdijk Institute, Uppsalalaan 8, 3584 CT Utrecht, The Netherlands
| | - Thomas Abeel
- Intelligent Systems − Delft Bioinformatics Lab, Delft University of Technology, Van Mourik Broekmanweg 6, 2628XE Delft, The Netherlands
| | - Eline D. Postma
- Department of Biotechnology, Delft University of Technology, van der Maasweg 9, 2629HZ Delft, The Netherlands
| | - Jean-Marc Daran
- Department of Biotechnology, Delft University of Technology, van der Maasweg 9, 2629HZ Delft, The Netherlands
| | - Pascale Daran-Lapujade
- Department of Biotechnology, Delft University of Technology, van der Maasweg 9, 2629HZ Delft, The Netherlands
| |
Collapse
|
130
|
Naraine R, Abaffy P, Sidova M, Tomankova S, Pocherniaieva K, Smolik O, Kubista M, Psenicka M, Sindelka R. NormQ: RNASeq normalization based on RT-qPCR derived size factors. Comput Struct Biotechnol J 2020; 18:1173-1181. [PMID: 32514328 PMCID: PMC7264052 DOI: 10.1016/j.csbj.2020.05.010] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2019] [Revised: 05/07/2020] [Accepted: 05/07/2020] [Indexed: 02/04/2023] Open
Abstract
The merit of RNASeq data relies heavily on correct normalization. However, most methods assume that the majority of transcripts show no differential expression between conditions. This assumption may not always be correct, especially when one condition results in overexpression. We present a new method (NormQ) to normalize the RNASeq library size, using the relative proportion observed from RT-qPCR of selected marker genes. The method was compared against the popular median-of-ratios method, using simulated and real-datasets. NormQ produced more matches to differentially expressed genes in the simulated dataset and more distribution profile matches for both simulated and real datasets.
Collapse
Affiliation(s)
- Ravindra Naraine
- Laboratory of Gene Expression, Institute of Biotechnology of the Czech Academy of Sciences - BIOCEV, Prumyslova 595, Vestec 252 50, Czech Republic
| | - Pavel Abaffy
- Laboratory of Gene Expression, Institute of Biotechnology of the Czech Academy of Sciences - BIOCEV, Prumyslova 595, Vestec 252 50, Czech Republic
| | - Monika Sidova
- Laboratory of Gene Expression, Institute of Biotechnology of the Czech Academy of Sciences - BIOCEV, Prumyslova 595, Vestec 252 50, Czech Republic
| | - Silvie Tomankova
- Laboratory of Gene Expression, Institute of Biotechnology of the Czech Academy of Sciences - BIOCEV, Prumyslova 595, Vestec 252 50, Czech Republic
| | - Kseniia Pocherniaieva
- University of South Bohemia in Ceske Budejovice, Faculty of Fisheries and Protection of Waters, South Bohemian Research Center of Aquaculture and Biodiversity of Hydrocenoses, Research Institute of Fish Culture and Hydrobiology, Vodnany, Czech Republic
| | - Ondrej Smolik
- Laboratory of Gene Expression, Institute of Biotechnology of the Czech Academy of Sciences - BIOCEV, Prumyslova 595, Vestec 252 50, Czech Republic
- Department of Cell Biology, Faculty of Science, Charles University, Prague, Czech Republic
| | - Mikael Kubista
- Laboratory of Gene Expression, Institute of Biotechnology of the Czech Academy of Sciences - BIOCEV, Prumyslova 595, Vestec 252 50, Czech Republic
| | - Martin Psenicka
- University of South Bohemia in Ceske Budejovice, Faculty of Fisheries and Protection of Waters, South Bohemian Research Center of Aquaculture and Biodiversity of Hydrocenoses, Research Institute of Fish Culture and Hydrobiology, Vodnany, Czech Republic
| | - Radek Sindelka
- Laboratory of Gene Expression, Institute of Biotechnology of the Czech Academy of Sciences - BIOCEV, Prumyslova 595, Vestec 252 50, Czech Republic
| |
Collapse
|
131
|
Wilson-Sánchez D, Lup SD, Sarmiento-Mañús R, Ponce MR, Micol JL. Next-generation forward genetic screens: using simulated data to improve the design of mapping-by-sequencing experiments in Arabidopsis. Nucleic Acids Res 2020; 47:e140. [PMID: 31544937 PMCID: PMC6868388 DOI: 10.1093/nar/gkz806] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2019] [Revised: 09/07/2019] [Accepted: 09/10/2019] [Indexed: 12/25/2022] Open
Abstract
Forward genetic screens have successfully identified many genes and continue to be powerful tools for dissecting biological processes in Arabidopsis and other model species. Next-generation sequencing technologies have revolutionized the time-consuming process of identifying the mutations that cause a phenotype of interest. However, due to the cost of such mapping-by-sequencing experiments, special attention should be paid to experimental design and technical decisions so that the read data allows to map the desired mutation. Here, we simulated different mapping-by-sequencing scenarios. We first evaluated which short-read technology was best suited for analyzing gene-rich genomic regions in Arabidopsis and determined the minimum sequencing depth required to confidently call single nucleotide variants. We also designed ways to discriminate mutagenesis-induced mutations from background Single Nucleotide Polymorphisms in mutants isolated in Arabidopsis non-reference lines. In addition, we simulated bulked segregant mapping populations for identifying point mutations and monitored how the size of the mapping population and the sequencing depth affect mapping precision. Finally, we provide the computational basis of a protocol that we already used to map T-DNA insertions with paired-end Illumina-like reads, using very low sequencing depths and pooling several mutants together; this approach can also be used with single-end reads as well as to map any other insertional mutagen. All these simulations proved useful for designing experiments that allowed us to map several mutations in Arabidopsis.
Collapse
Affiliation(s)
- David Wilson-Sánchez
- Instituto de Bioingeniería, Universidad Miguel Hernández, Campus de Elche, 03202 Elche, Spain
| | - Samuel Daniel Lup
- Instituto de Bioingeniería, Universidad Miguel Hernández, Campus de Elche, 03202 Elche, Spain
| | - Raquel Sarmiento-Mañús
- Instituto de Bioingeniería, Universidad Miguel Hernández, Campus de Elche, 03202 Elche, Spain
| | - María Rosa Ponce
- Instituto de Bioingeniería, Universidad Miguel Hernández, Campus de Elche, 03202 Elche, Spain
| | - José Luis Micol
- Instituto de Bioingeniería, Universidad Miguel Hernández, Campus de Elche, 03202 Elche, Spain
| |
Collapse
|
132
|
Marcelino VR, Clausen PTLC, Buchmann JP, Wille M, Iredell JR, Meyer W, Lund O, Sorrell TC, Holmes EC. CCMetagen: comprehensive and accurate identification of eukaryotes and prokaryotes in metagenomic data. Genome Biol 2020; 21:103. [PMID: 32345331 PMCID: PMC7189439 DOI: 10.1186/s13059-020-02014-2] [Citation(s) in RCA: 87] [Impact Index Per Article: 17.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2019] [Accepted: 04/13/2020] [Indexed: 01/19/2023] Open
Abstract
There is an increasing demand for accurate and fast metagenome classifiers that can not only identify bacteria, but all members of a microbial community. We used a recently developed concept in read mapping to develop a highly accurate metagenomic classification pipeline named CCMetagen. The pipeline substantially outperforms other commonly used software in identifying bacteria and fungi and can efficiently use the entire NCBI nucleotide collection as a reference to detect species with incomplete genome data from all biological kingdoms. CCMetagen is user-friendly, and the results can be easily integrated into microbial community analysis software for streamlined and automated microbiome studies.
Collapse
Affiliation(s)
- Vanessa R Marcelino
- Marie Bashir Institute for Infectious Diseases and Biosecurity and Faculty of Medicine and Health, Sydney Medical School, Westmead Clinical School, The University of Sydney, Sydney, NSW, 2006, Australia.
- Centre for Infectious Diseases and Microbiology, Westmead Institute for Medical Research, Westmead, NSW, 2145, Australia.
- School of Life & Environmental Sciences, Charles Perkins Centre, The University of Sydney, Sydney, NSW, 2006, Australia.
| | - Philip T L C Clausen
- National Food Institute, Technical University of Denmark, 2800, Kgs Lyngby, Denmark
| | - Jan P Buchmann
- School of Life & Environmental Sciences, Charles Perkins Centre, The University of Sydney, Sydney, NSW, 2006, Australia
| | - Michelle Wille
- WHO Collaborating Centre for Reference and Research on Influenza, The Peter Doherty Institute for Infection and Immunity, Melbourne, VIC, 3000, Australia
| | - Jonathan R Iredell
- Marie Bashir Institute for Infectious Diseases and Biosecurity and Faculty of Medicine and Health, Sydney Medical School, Westmead Clinical School, The University of Sydney, Sydney, NSW, 2006, Australia
- Centre for Infectious Diseases and Microbiology, Westmead Institute for Medical Research, Westmead, NSW, 2145, Australia
- Westmead Hospital (Research and Education Network), Westmead, NSW, 2145, Australia
| | - Wieland Meyer
- Marie Bashir Institute for Infectious Diseases and Biosecurity and Faculty of Medicine and Health, Sydney Medical School, Westmead Clinical School, The University of Sydney, Sydney, NSW, 2006, Australia
- Westmead Hospital (Research and Education Network), Westmead, NSW, 2145, Australia
- Molecular Mycology Research Laboratory, Centre for Infectious Diseases and Microbiology, Westmead Institute for Medical Research, Westmead, NSW, 2145, Australia
| | - Ole Lund
- National Food Institute, Technical University of Denmark, 2800, Kgs Lyngby, Denmark
| | - Tania C Sorrell
- Marie Bashir Institute for Infectious Diseases and Biosecurity and Faculty of Medicine and Health, Sydney Medical School, Westmead Clinical School, The University of Sydney, Sydney, NSW, 2006, Australia
- Centre for Infectious Diseases and Microbiology, Westmead Institute for Medical Research, Westmead, NSW, 2145, Australia
| | - Edward C Holmes
- Marie Bashir Institute for Infectious Diseases and Biosecurity and Faculty of Medicine and Health, Sydney Medical School, Westmead Clinical School, The University of Sydney, Sydney, NSW, 2006, Australia
- School of Life & Environmental Sciences, Charles Perkins Centre, The University of Sydney, Sydney, NSW, 2006, Australia
| |
Collapse
|
133
|
Ozuna A, Liberto D, Joyce RM, Arnvig KB, Nobeli I. baerhunter: an R package for the discovery and analysis of expressed non-coding regions in bacterial RNA-seq data. Bioinformatics 2020; 36:966-969. [PMID: 31418770 DOI: 10.1093/bioinformatics/btz643] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2019] [Revised: 07/29/2019] [Accepted: 08/13/2019] [Indexed: 12/12/2022] Open
Abstract
SUMMARY Standard bioinformatics pipelines for the analysis of bacterial transcriptomic data commonly ignore non-coding but functional elements e.g. small RNAs, long antisense RNAs or untranslated regions (UTRs) of mRNA transcripts. The root of this problem is the use of incomplete genome annotation files. Here, we present baerhunter, a coverage-based method implemented in R, that automates the discovery of expressed non-coding RNAs and UTRs from RNA-seq reads mapped to a reference genome. The core algorithm is part of a pipeline that facilitates downstream analysis of both coding and non-coding features. The method is simple, easy to extend and customize and, in limited tests with simulated and real data, compares favourably against the currently most popular alternative. AVAILABILITY AND IMPLEMENTATION The baerhunter R package is available from: https://github.com/irilenia/baerhunter. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- A Ozuna
- Department of Biological Sciences, Institute of Structural and Molecular Biology, London, WC1E 7HX, UK
| | - D Liberto
- Department of Biological Sciences, Institute of Structural and Molecular Biology, London, WC1E 7HX, UK
| | - R M Joyce
- Department of Biological Sciences, Institute of Structural and Molecular Biology, London, WC1E 7HX, UK
| | - K B Arnvig
- Institute of Structural and Molecular Biology, Division of Biosciences, University College London, London, WC1E 6BT, UK
| | - I Nobeli
- Department of Biological Sciences, Institute of Structural and Molecular Biology, London, WC1E 7HX, UK
| |
Collapse
|
134
|
Górczak K, Claesen J, Burzykowski T. A Conceptual Framework for Abundance Estimation of Genomic Targets in the Presence of Ambiguous Short Sequencing Reads. J Comput Biol 2020; 27:1232-1247. [PMID: 31895597 DOI: 10.1089/cmb.2019.0272] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
RNA sequencing (RNA-seq) is widely used to study gene-, transcript-, or exon expression. To quantify the expression level, millions of short sequenced reads need to be mapped back to a reference genome or transcriptome. Read mapping makes it possible to find a location to which a read is identical or similar. Based upon this alignment, expression summaries, that is, read counts are generated. However, reads may be matched to multiple locations. Such ambiguously mapped reads are often ignored in the analysis, which is a potential loss of information and may cause bias in expression estimation. We present the general principles underlying multiread allocation and unbiased estimation of the expression level of genes, exons, or transcripts in the presence of multiple mapped reads. The underlying principles are derived from a theoretical concept that identifies important sources of information such as the number of uniquely mapped reads, the total target length, and the length of the shared target regions. We show with simulation studies that methods incorporating some or all of the aforementioned sources of information estimate the expression levels of genes, exons, and/or transcripts with a higher precision and accuracy than methods that do not use this information. We identify important sources of information that should be taken into account by methods that estimate the abundance of genes, exons, and/or transcripts to achieve good precision and accuracy.
Collapse
Affiliation(s)
- Katarzyna Górczak
- Interuniversity Institute for Biostatistics and statistical Bioinformatics, Hasselt University, Diepenbeek, Belgium.,Department of Mathematical and Statistical Methods, Poznań University of Life Sciences, Poznań, Poland
| | - Jürgen Claesen
- Interuniversity Institute for Biostatistics and statistical Bioinformatics, Hasselt University, Diepenbeek, Belgium.,Microbiology Unit, Belgian Nuclear Research Centre (SCK•CEN), Mol, Belgium
| | - Tomasz Burzykowski
- Interuniversity Institute for Biostatistics and statistical Bioinformatics, Hasselt University, Diepenbeek, Belgium.,Department of Statistics and Medical Informatics, Medical University of Bialystok, Bialystok, Poland
| |
Collapse
|
135
|
Yang A, Kishore A, Phipps B, Ho JWK. Cloud accelerated alignment and assembly of full-length single-cell RNA-seq data using Falco. BMC Genomics 2019; 20:927. [PMID: 31888474 PMCID: PMC6936136 DOI: 10.1186/s12864-019-6341-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2019] [Accepted: 11/26/2019] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND Read alignment and transcript assembly are the core of RNA-seq analysis for transcript isoform discovery. Nonetheless, current tools are not designed to be scalable for analysis of full-length bulk or single cell RNA-seq (scRNA-seq) data. The previous version of our cloud-based tool Falco only focuses on RNA-seq read counting, but does not allow for more flexible steps such as alignment and read assembly. RESULTS The Falco framework can harness the parallel and distributed computing environment in modern cloud platforms to accelerate read alignment and transcript assembly of full-length bulk RNA-seq and scRNA-seq data. There are two new modes in Falco: alignment-only and transcript assembly. In the alignment-only mode, Falco can speed up the alignment process by 2.5-16.4x based on two public scRNA-seq datasets when compared to alignment on a highly optimised standalone computer. Furthermore, it also provides a 10x average speed-up compared to alignment using published cloud-enabled tool for read alignment, Rail-RNA. In the transcript assembly mode, Falco can speed up the transcript assembly process by 1.7-16.5x compared to performing transcript assembly on a highly optimised computer. CONCLUSION Falco is a significantly updated open source big data processing framework that enables scalable and accelerated alignment and assembly of full-length scRNA-seq data on the cloud. The source code can be found at https://github.com/VCCRI/Falco.
Collapse
Affiliation(s)
- Andrian Yang
- Victor Chang Cardiac Research Institute, 405 Liverpool St, Darlinghurst, 2010, New South Wales, Australia.,St. Vincent's Clinical School, University of New South Wales, Darlinghurst, 2010, New South Wales, Australia
| | - Abhinav Kishore
- Victor Chang Cardiac Research Institute, 405 Liverpool St, Darlinghurst, 2010, New South Wales, Australia
| | - Benjamin Phipps
- Victor Chang Cardiac Research Institute, 405 Liverpool St, Darlinghurst, 2010, New South Wales, Australia
| | - Joshua W K Ho
- Victor Chang Cardiac Research Institute, 405 Liverpool St, Darlinghurst, 2010, New South Wales, Australia. .,St. Vincent's Clinical School, University of New South Wales, Darlinghurst, 2010, New South Wales, Australia. .,School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong, China.
| |
Collapse
|
136
|
Ma C, Kingsford C. Detecting, Categorizing, and Correcting Coverage Anomalies of RNA-Seq Quantification. Cell Syst 2019; 9:589-599.e7. [PMID: 31786209 DOI: 10.1016/j.cels.2019.10.005] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2019] [Revised: 07/09/2019] [Accepted: 10/17/2019] [Indexed: 11/13/2022]
Abstract
Because of incomplete reference transcriptomes, incomplete sequencing bias models, or other modeling defects, algorithms to infer isoform expression from RNA sequencing (RNA-seq) sometimes do not accurately model expression. We present a computational method to detect instances where a quantification algorithm could not completely explain the input reads. Our approach identifies regions where the read coverage significantly deviates from expectation. We call these regions "expression anomalies." We further present a method to attribute their cause to either the incompleteness of the reference transcriptome or algorithmic mistakes. We detect anomalies for 30 GEUVADIS and 16 Human Body Map samples. By correcting anomalies when possible, we reduce the number of falsely predicted instances of differential expression. Anomalies that cannot be corrected are suspected to indicate the existence of isoforms unannotated by the reference. We detected 88 common anomalies of this type and find that they tend to have a lower-than-expected coverage toward their 3' ends.
Collapse
Affiliation(s)
- Cong Ma
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA 15213, USA
| | - Carl Kingsford
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA 15213, USA.
| |
Collapse
|
137
|
Zheng H, Brennan K, Hernaez M, Gevaert O. Benchmark of long non-coding RNA quantification for RNA sequencing of cancer samples. Gigascience 2019; 8:giz145. [PMID: 31808800 PMCID: PMC6897288 DOI: 10.1093/gigascience/giz145] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2019] [Revised: 09/30/2019] [Accepted: 11/15/2019] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Long non-coding RNAs (lncRNAs) are emerging as important regulators of various biological processes. While many studies have exploited public resources such as RNA sequencing (RNA-Seq) data in The Cancer Genome Atlas to study lncRNAs in cancer, it is crucial to choose the optimal method for accurate expression quantification. RESULTS In this study, we compared the performance of pseudoalignment methods Kallisto and Salmon, alignment-based transcript quantification method RSEM, and alignment-based gene quantification methods HTSeq and featureCounts, in combination with read aligners STAR, Subread, and HISAT2, in lncRNA quantification, by applying them to both un-stranded and stranded RNA-Seq datasets. Full transcriptome annotation, including protein-coding and non-coding RNAs, greatly improves the specificity of lncRNA expression quantification. Pseudoalignment methods and RSEM outperform HTSeq and featureCounts for lncRNA quantification at both sample- and gene-level comparison, regardless of RNA-Seq protocol type, choice of aligners, and transcriptome annotation. Pseudoalignment methods and RSEM detect more lncRNAs and correlate highly with simulated ground truth. On the contrary, HTSeq and featureCounts often underestimate lncRNA expression. Antisense lncRNAs are poorly quantified by alignment-based gene quantification methods, which can be improved using stranded protocols and pseudoalignment methods. CONCLUSIONS Considering the consistency with ground truth and computational resources, pseudoalignment methods Kallisto or Salmon in combination with full transcriptome annotation is our recommended strategy for RNA-Seq analysis for lncRNAs.
Collapse
Affiliation(s)
- Hong Zheng
- Stanford Center for Biomedical Informatics Research, Department of Medicine, Stanford University, 1265 Welch Road, Stanford, 94305, CA, USA
| | - Kevin Brennan
- Stanford Center for Biomedical Informatics Research, Department of Medicine, Stanford University, 1265 Welch Road, Stanford, 94305, CA, USA
| | - Mikel Hernaez
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, 1206 W. Gregory Dr, Urbana, 61805, IL, USA
| | - Olivier Gevaert
- Stanford Center for Biomedical Informatics Research, Department of Medicine, Stanford University, 1265 Welch Road, Stanford, 94305, CA, USA
- Department of Biomedical Data Science, Stanford University, 1265 Welch Road, Stanford, 94305, CA, USA
| |
Collapse
|
138
|
Li WV, Li S, Tong X, Deng L, Shi H, Li JJ. AIDE: annotation-assisted isoform discovery with high precision. Genome Res 2019; 29:2056-2072. [PMID: 31694868 PMCID: PMC6886511 DOI: 10.1101/gr.251108.119] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2019] [Accepted: 09/27/2019] [Indexed: 02/06/2023]
Abstract
Genome-wide accurate identification and quantification of full-length mRNA isoforms is crucial for investigating transcriptional and posttranscriptional regulatory mechanisms of biological phenomena. Despite continuing efforts in developing effective computational tools to identify or assemble full-length mRNA isoforms from second-generation RNA-seq data, it remains a challenge to accurately identify mRNA isoforms from short sequence reads owing to the substantial information loss in RNA-seq experiments. Here, we introduce a novel statistical method, annotation-assisted isoform discovery (AIDE), the first approach that directly controls false isoform discoveries by implementing the testing-based model selection principle. Solving the isoform discovery problem in a stepwise and conservative manner, AIDE prioritizes the annotated isoforms and precisely identifies novel isoforms whose addition significantly improves the explanation of observed RNA-seq reads. We evaluate the performance of AIDE based on multiple simulated and real RNA-seq data sets followed by PCR-Sanger sequencing validation. Our results show that AIDE effectively leverages the annotation information to compensate the information loss owing to short read lengths. AIDE achieves the highest precision in isoform discovery and the lowest error rates in isoform abundance estimation, compared with three state-of-the-art methods Cufflinks, SLIDE, and StringTie. As a robust bioinformatics tool for transcriptome analysis, AIDE enables researchers to discover novel transcripts with high confidence.
Collapse
Affiliation(s)
- Wei Vivian Li
- Department of Biostatistics and Epidemiology, Rutgers School of Public Health, Rutgers, The State University of New Jersey, Piscataway, New Jersey 08854, USA.,Department of Statistics, University of California, Los Angeles, California 90095, USA
| | - Shan Li
- Laboratory of Tumor Targeted and Immune Therapy, Clinical Research Center for Breast, State Key Laboratory of Biotherapy, West China Hospital, Sichuan University and Collaborative Innovation Center, Chengdu 610041, China
| | - Xin Tong
- Department of Data Sciences and Operations, Marshall School of Business, University of Southern California, Los Angeles, California 90089, USA
| | - Ling Deng
- Laboratory of Molecular Diagnosis of Cancer, Clinical Research Center for Breast, West China Hospital, Sichuan University, Chengdu 610041, China
| | - Hubing Shi
- Laboratory of Tumor Targeted and Immune Therapy, Clinical Research Center for Breast, State Key Laboratory of Biotherapy, West China Hospital, Sichuan University and Collaborative Innovation Center, Chengdu 610041, China
| | - Jingyi Jessica Li
- Department of Statistics, University of California, Los Angeles, California 90095, USA.,Department of Human Genetics, University of California, Los Angeles, California 90095, USA
| |
Collapse
|
139
|
Song L, Sabunciyan S, Yang G, Florea L. A multi-sample approach increases the accuracy of transcript assembly. Nat Commun 2019; 10:5000. [PMID: 31676772 PMCID: PMC6825223 DOI: 10.1038/s41467-019-12990-0] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2019] [Accepted: 10/11/2019] [Indexed: 01/21/2023] Open
Abstract
Transcript assembly from RNA-seq reads is a critical step in gene expression and subsequent functional analyses. Here we present PsiCLASS, an accurate and efficient transcript assembler based on an approach that simultaneously analyzes multiple RNA-seq samples. PsiCLASS combines mixture statistical models for exonic feature selection across multiple samples with splice graph based dynamic programming algorithms and a weighted voting scheme for transcript selection. PsiCLASS achieves significantly better sensitivity-precision tradeoff, and renders precision up to 2-3 fold higher than the StringTie system and Scallop plus TACO, the two best current approaches. PsiCLASS is efficient and scalable, assembling 667 GEUVADIS samples in 9 h, and has robust accuracy with large numbers of samples. Transcript assembly is an important step in analysis of RNA-seq data whose accuracy influences downstream quantification, detection and characterization of alternative splice variants. Here, the authors develop PsiCLASS, a transcript assembler leveraging simultaneous analysis of multiple RNA-seq samples.
Collapse
Affiliation(s)
- Li Song
- McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA.,Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.,Department of Data Sciences, Dana Farber Cancer Institute, Boston, MA, USA
| | - Sarven Sabunciyan
- Department of Pediatrics, Johns Hopkins School of Medicine, Baltimore, MD, USA
| | - Guangyu Yang
- McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA.,Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Liliana Florea
- McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA. .,Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA. .,Department of Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA.
| |
Collapse
|
140
|
Zhu A, Srivastava A, Ibrahim JG, Patro R, Love MI. Nonparametric expression analysis using inferential replicate counts. Nucleic Acids Res 2019; 47:e105. [PMID: 31372651 PMCID: PMC6765120 DOI: 10.1093/nar/gkz622] [Citation(s) in RCA: 59] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2019] [Revised: 06/11/2019] [Accepted: 07/11/2019] [Indexed: 11/13/2022] Open
Abstract
A primary challenge in the analysis of RNA-seq data is to identify differentially expressed genes or transcripts while controlling for technical biases. Ideally, a statistical testing procedure should incorporate the inherent uncertainty of the abundance estimates arising from the quantification step. Most popular methods for RNA-seq differential expression analysis fit a parametric model to the counts for each gene or transcript, and a subset of methods can incorporate uncertainty. Previous work has shown that nonparametric models for RNA-seq differential expression may have better control of the false discovery rate, and adapt well to new data types without requiring reformulation of a parametric model. Existing nonparametric models do not take into account inferential uncertainty, leading to an inflated false discovery rate, in particular at the transcript level. We propose a nonparametric model for differential expression analysis using inferential replicate counts, extending the existing SAMseq method to account for inferential uncertainty. We compare our method, Swish, with popular differential expression analysis methods. Swish has improved control of the false discovery rate, in particular for transcripts with high inferential uncertainty. We apply Swish to a single-cell RNA-seq dataset, assessing differential expression between sub-populations of cells, and compare its performance to the Wilcoxon test.
Collapse
Affiliation(s)
- Anqi Zhu
- Department of Biostatistics, University of North Carolina-Chapel Hill, 135 Dauer Drive, Chapel Hill, NC 27599, USA
| | - Avi Srivastava
- Department of Computer Science, Stony Brook University, Computer Science Building, Engineering Dr, Stony Brook, NY 11794, USA
| | - Joseph G Ibrahim
- Department of Biostatistics, University of North Carolina-Chapel Hill, 135 Dauer Drive, Chapel Hill, NC 27599, USA
| | - Rob Patro
- Department of Computer Science, Stony Brook University, Computer Science Building, Engineering Dr, Stony Brook, NY 11794, USA
| | - Michael I Love
- Department of Biostatistics, University of North Carolina-Chapel Hill, 135 Dauer Drive, Chapel Hill, NC 27599, USA
- Department of Genetics, University of North Carolina-Chapel Hill, 120 Mason Farm Rd, Chapel Hill, NC 27514, USA
| |
Collapse
|
141
|
Zhou L, Chi-Hau Sue A, Bin Goh WW. Examining the practical limits of batch effect-correction algorithms: When should you care about batch effects? J Genet Genomics 2019; 46:433-443. [PMID: 31611172 DOI: 10.1016/j.jgg.2019.08.002] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2019] [Revised: 08/02/2019] [Accepted: 08/04/2019] [Indexed: 12/20/2022]
Abstract
Batch effects are technical sources of variation and can confound analysis. While many performance ranking exercises have been conducted to establish the best batch effect-correction algorithm (BECA), we hold the viewpoint that the notion of best is context-dependent. Moreover, alternative questions beyond the simplistic notion of "best" are also interesting: are BECAs robust against various degrees of confounding and if so, what is the limit? Using two different methods for simulating class (phenotype) and batch effects and taking various representative datasets across both genomics (RNA-Seq) and proteomics platforms, we demonstrate that under situations where sample classes and batch factors are moderately confounded, most BECAs are remarkably robust and only weakly affected by upstream normalization procedures. This observation is consistently supported across the multitude of test datasets. BECAs do have limits: When sample classes and batch factors are strongly confounded, BECA performance declines, with variable performance in precision, recall and also batch correction. We also report that while conventional normalization methods have minimal impact on batch effect correction, they do not affect downstream statistical feature selection, and in strongly confounded scenarios, may even outperform BECAs. In other words, removing batch effects is no guarantee of optimal functional analysis. Overall, this study suggests that simplistic performance ranking exercises are quite trivial, and all BECAs are compromises in some context or another.
Collapse
Affiliation(s)
- Longjian Zhou
- School of Pharmaceutical Science and Technology, Tianjin University, Tianjin, 30072, China
| | - Andrew Chi-Hau Sue
- School of Pharmaceutical Science and Technology, Tianjin University, Tianjin, 30072, China
| | - Wilson Wen Bin Goh
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, 637551, Singapore.
| |
Collapse
|
142
|
Kerkvliet J, de Fouchier A, van Wijk M, Groot AT. The Bellerophon pipeline, improving de novo transcriptomes and removing chimeras. Ecol Evol 2019; 9:10513-10521. [PMID: 31624564 PMCID: PMC6787812 DOI: 10.1002/ece3.5571] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2019] [Revised: 07/22/2019] [Accepted: 07/28/2019] [Indexed: 12/31/2022] Open
Abstract
Transcriptome quality control is an important step in RNA-Seq experiments. However, the quality of de novo assembled transcriptomes is difficult to assess, due to the lack of reference genome to compare the assembly to. We developed a method to assess and improve the quality of de novo assembled transcriptomes by focusing on the removal of chimeric sequences. These chimeric sequences can be the result of faulty assembled contigs, merging two transcripts into one. The developed method is incorporated into a pipeline, which we named Bellerophon, that is broadly applicable and easy to use. Bellerophon first uses the quality assessment tool TransRate to indicate the quality, after which it uses a transcripts per million (TPM) filter to remove lowly expressed contigs and CD-HIT-EST to remove highly identical contigs. To validate the quality of this method, we performed three benchmark experiments: (1) a computational creation of chimeras, (2) identification of chimeric contigs in a transcriptome assembly, (3) a simulated RNA-Seq experiment using a known reference transcriptome. Overall, the Bellerophon pipeline was able to remove between 40% and 91.9% of the chimeras in transcriptome assemblies and removed more chimeric than nonchimeric contigs. Thus, the Bellerophon sequence of filtration steps is a broadly applicable solution to improve transcriptome assemblies.
Collapse
Affiliation(s)
- Jesse Kerkvliet
- Institute for Biodiversity and Ecosystem DynamicsUniversity of AmsterdamAmsterdamThe Netherlands
| | - Arthur de Fouchier
- Institute for Biodiversity and Ecosystem DynamicsUniversity of AmsterdamAmsterdamThe Netherlands
- Departement of EntomolgyMax Planck Institute for Chemical EcologyJenaGermany
- Present address:
Laboratory of Experimental and Comparative Ethology, Université Paris 13Sorbonne Paris CitéVilletaneuseFrance
| | - Michiel van Wijk
- Institute for Biodiversity and Ecosystem DynamicsUniversity of AmsterdamAmsterdamThe Netherlands
| | - Astrid Tatjana Groot
- Institute for Biodiversity and Ecosystem DynamicsUniversity of AmsterdamAmsterdamThe Netherlands
- Departement of EntomolgyMax Planck Institute for Chemical EcologyJenaGermany
| |
Collapse
|
143
|
Gunady MK, Mount SM, Corrada Bravo H. Yanagi: Fast and interpretable segment-based alternative splicing and gene expression analysis. BMC Bioinformatics 2019; 20:421. [PMID: 31409274 PMCID: PMC6693274 DOI: 10.1186/s12859-019-2947-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2018] [Accepted: 06/12/2019] [Indexed: 12/13/2022] Open
Abstract
Background Ultra-fast pseudo-alignment approaches are the tool of choice in transcript-level RNA sequencing (RNA-seq) analyses. Unfortunately, these methods couple the tasks of pseudo-alignment and transcript quantification. This coupling precludes the direct usage of pseudo-alignment to other expression analyses, including alternative splicing or differential gene expression analysis, without including a non-essential transcript quantification step. Results In this paper, we introduce a transcriptome segmentation approach to decouple these two tasks. We propose an efficient algorithm to generate maximal disjoint segments given a transcriptome reference library on which ultra-fast pseudo-alignment can be used to produce per-sample segment counts. We show how to apply these maximally unambiguous count statistics in two specific expression analyses – alternative splicing and gene differential expression – without the need of a transcript quantification step. Our experiments based on simulated and experimental data showed that the use of segment counts, like other methods that rely on local coverage statistics, provides an advantage over approaches that rely on transcript quantification in detecting and correctly estimating local splicing in the case of incomplete transcript annotations. Conclusions The transcriptome segmentation approach implemented in Yanagi exploits the computational and space efficiency of pseudo-alignment approaches. It significantly expands their applicability and interpretability in a variety of RNA-seq analyses by providing the means to model and capture local coverage variation in these analyses. Electronic supplementary material The online version of this article (10.1186/s12859-019-2947-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Mohamed K Gunady
- Department of Computer Science, University of Maryland, College Park, Maryland, USA.,Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, USA
| | - Stephen M Mount
- Department of Cell Biology and Molecular Genetics, University of Maryland, College Park, Maryland, USA
| | - Héctor Corrada Bravo
- Department of Computer Science, University of Maryland, College Park, Maryland, USA. .,Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, USA.
| |
Collapse
|
144
|
Deng W, Mou T, Kalari KR, Niu N, Wang L, Pawitan Y, Vu TN. Alternating EM algorithm for a bilinear model in isoform quantification from RNA-seq data. Bioinformatics 2019; 36:805-812. [PMID: 31400221 PMCID: PMC9883676 DOI: 10.1093/bioinformatics/btz640] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2018] [Revised: 06/13/2019] [Accepted: 08/09/2019] [Indexed: 02/02/2023] Open
Abstract
MOTIVATION Estimation of isoform-level gene expression from RNA-seq data depends on simplifying assumptions, such as uniform read distribution, that are easily violated in real data. Such violations typically lead to biased estimates. Most existing methods provide bias correction step(s), which is based on biological considerations-such as GC content-and applied in single samples separately. The main problem is that not all biases are known. RESULTS We have developed a novel method called XAEM based on a more flexible and robust statistical model. Existing methods are essentially based on a linear model Xβ, where the design matrix X is known and is computed based on the simplifying assumptions. In contrast XAEM considers Xβ as a bilinear model with both X and β unknown. Joint estimation of X and β is made possible by a simultaneous analysis of multi-sample RNA-seq data. Compared to existing methods, XAEM automatically performs empirical correction of potentially unknown biases. We use an alternating expectation-maximization (AEM) algorithm, alternating between estimation of X and β. For speed XAEM utilizes quasi-mapping for read alignment, thus leading to a fast algorithm. Overall XAEM performs favorably compared to recent advanced methods. For simulated datasets, XAEM obtains higher accuracy for multiple-isoform genes. In a differential-expression analysis of a real single-cell RNA-seq dataset, XAEM achieves substantially better rediscovery rates in independent validation sets. AVAILABILITY AND IMPLEMENTATION The method and pipeline are implemented as a tool and freely available for use at http://fafner.meb.ki.se/biostatwiki/xaem/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Wenjiang Deng
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm 17177, Sweden
| | - Tian Mou
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm 17177, Sweden
| | | | - Nifang Niu
- Department of Molecular Pharmacology and Experimental Therapeutics, Mayo Clinic, Rochester, MN 55905, USA
| | - Liewei Wang
- Department of Molecular Pharmacology and Experimental Therapeutics, Mayo Clinic, Rochester, MN 55905, USA
| | | | | |
Collapse
|
145
|
Anwar MZ, Lanzen A, Bang-Andreasen T, Jacobsen CS. To assemble or not to resemble-A validated Comparative Metatranscriptomics Workflow (CoMW). Gigascience 2019; 8:giz096. [PMID: 31363751 PMCID: PMC6667343 DOI: 10.1093/gigascience/giz096] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2019] [Revised: 05/15/2019] [Accepted: 07/16/2019] [Indexed: 01/08/2023] Open
Abstract
BACKGROUND Metatranscriptomics has been used widely for investigation and quantification of microbial communities' activity in response to external stimuli. By assessing the genes expressed, metatranscriptomics provides an understanding of the interactions between different major functional guilds and the environment. Here, we present a de novo assembly-based Comparative Metatranscriptomics Workflow (CoMW) implemented in a modular, reproducible structure. Metatranscriptomics typically uses short sequence reads, which can either be directly aligned to external reference databases ("assembly-free approach") or first assembled into contigs before alignment ("assembly-based approach"). We also compare CoMW (assembly-based implementation) with an assembly-free alternative workflow, using simulated and real-world metatranscriptomes from Arctic and temperate terrestrial environments. We evaluate their accuracy in precision and recall using generic and specialized hierarchical protein databases. RESULTS CoMW provided significantly fewer false-positive results, resulting in more precise identification and quantification of functional genes in metatranscriptomes. Using the comprehensive database M5nr, the assembly-based approach identified genes with only 0.6% false-positive results at thresholds ranging from inclusive to stringent compared with the assembly-free approach, which yielded up to 15% false-positive results. Using specialized databases (carbohydrate-active enzyme and nitrogen cycle), the assembly-based approach identified and quantified genes with 3-5 times fewer false-positive results. We also evaluated the impact of both approaches on real-world datasets. CONCLUSIONS We present an open source de novo assembly-based CoMW. Our benchmarking findings support assembling short reads into contigs before alignment to a reference database because this provides higher precision and minimizes false-positive results.
Collapse
Affiliation(s)
- Muhammad Zohaib Anwar
- Department of Environmental Science, Aarhus University RISØ Campus, Frederiksborgvej 399, 4000 Roskilde, Denmark
| | - Anders Lanzen
- AZTI, Herrera Kaia, Portualdea z/g, 20110 Pasaia, Basque Country, Spain
- IKERBASQUE, Basque Foundation for Science, 48011 Bilbao, Spain
| | - Toke Bang-Andreasen
- Department of Environmental Science, Aarhus University RISØ Campus, Frederiksborgvej 399, 4000 Roskilde, Denmark
- Department of Biology, University of Copenhagen, Ole Maaloes Vej 5, 2200 Copenhagen, Denmark
| | - Carsten Suhr Jacobsen
- Department of Environmental Science, Aarhus University RISØ Campus, Frederiksborgvej 399, 4000 Roskilde, Denmark
| |
Collapse
|
146
|
Raghupathy N, Choi K, Vincent MJ, Beane GL, Sheppard KS, Munger SC, Korstanje R, Pardo-Manual de Villena F, Churchill GA. Hierarchical analysis of RNA-seq reads improves the accuracy of allele-specific expression. Bioinformatics 2019; 34:2177-2184. [PMID: 29444201 DOI: 10.1093/bioinformatics/bty078] [Citation(s) in RCA: 58] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2017] [Accepted: 02/09/2018] [Indexed: 02/06/2023] Open
Abstract
Motivation Allele-specific expression (ASE) refers to the differential abundance of the allelic copies of a transcript. RNA sequencing (RNA-seq) can provide quantitative estimates of ASE for genes with transcribed polymorphisms. When short-read sequences are aligned to a diploid transcriptome, read-mapping ambiguities confound our ability to directly count reads. Multi-mapping reads aligning equally well to multiple genomic locations, isoforms or alleles can comprise the majority (>85%) of reads. Discarding them can result in biases and substantial loss of information. Methods have been developed that use weighted allocation of read counts but these methods treat the different types of multi-reads equivalently. We propose a hierarchical approach to allocation of read counts that first resolves ambiguities among genes, then among isoforms, and lastly between alleles. We have implemented our model in EMASE software (Expectation-Maximization for Allele Specific Expression) to estimate total gene expression, isoform usage and ASE based on this hierarchical allocation. Results Methods that align RNA-seq reads to a diploid transcriptome incorporating known genetic variants improve estimates of ASE and total gene expression compared to methods that use reference genome alignments. Weighted allocation methods outperform methods that discard multi-reads. Hierarchical allocation of reads improves estimation of ASE even when data are simulated from a non-hierarchical model. Analysis of RNA-seq data from F1 hybrid mice using EMASE reveals widespread ASE associated with cis-acting polymorphisms and a small number of parent-of-origin effects. Availability and implementation EMASE software is available at https://github.com/churchill-lab/emase. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
147
|
Sarkar H, Srivastava A, Patro R. Minnow: a principled framework for rapid simulation of dscRNA-seq data at the read level. Bioinformatics 2019; 35:i136-i144. [PMID: 31510649 PMCID: PMC6612833 DOI: 10.1093/bioinformatics/btz351] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
SUMMARY With the advancements of high-throughput single-cell RNA-sequencing protocols, there has been a rapid increase in the tools available to perform an array of analyses on the gene expression data that results from such studies. For example, there exist methods for pseudo-time series analysis, differential cell usage, cell-type detection RNA-velocity in single cells, etc. Most analysis pipelines validate their results using known marker genes (which are not widely available for all types of analysis) and by using simulated data from gene-count-level simulators. Typically, the impact of using different read-alignment or unique molecular identifier (UMI) deduplication methods has not been widely explored. Assessments based on simulation tend to start at the level of assuming a simulated count matrix, ignoring the effect that different approaches for resolving UMI counts from the raw read data may produce. Here, we present minnow, a comprehensive sequence-level droplet-based single-cell RNA-sequencing (dscRNA-seq) experiment simulation framework. Minnow accounts for important sequence-level characteristics of experimental scRNA-seq datasets and models effects such as polymerase chain reaction amplification, cellular barcodes (CB) and UMI selection and sequence fragmentation and sequencing. It also closely matches the gene-level ambiguity characteristics that are observed in real scRNA-seq experiments. Using minnow, we explore the performance of some common processing pipelines to produce gene-by-cell count matrices from droplet-bases scRNA-seq data, demonstrate the effect that realistic levels of gene-level sequence ambiguity can have on accurate quantification and show a typical use-case of minnow in assessing the output generated by different quantification pipelines on the simulated experiment. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hirak Sarkar
- Department of Computer Science, Stony Brook University, Stony Brook, NY, USA
| | - Avi Srivastava
- Department of Computer Science, Stony Brook University, Stony Brook, NY, USA
| | - Rob Patro
- Department of Computer Science, Stony Brook University, Stony Brook, NY, USA
| |
Collapse
|
148
|
Afsari B, Guo T, Considine M, Florea L, Kagohara LT, Stein-O'Brien GL, Kelley D, Flam E, Zambo KD, Ha PK, Geman D, Ochs MF, Califano JA, Gaykalova DA, Favorov AV, Fertig EJ. Splice Expression Variation Analysis (SEVA) for inter-tumor heterogeneity of gene isoform usage in cancer. Bioinformatics 2019; 34:1859-1867. [PMID: 29342249 DOI: 10.1093/bioinformatics/bty004] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2017] [Accepted: 01/10/2018] [Indexed: 12/22/2022] Open
Abstract
Motivation Current bioinformatics methods to detect changes in gene isoform usage in distinct phenotypes compare the relative expected isoform usage in phenotypes. These statistics model differences in isoform usage in normal tissues, which have stable regulation of gene splicing. Pathological conditions, such as cancer, can have broken regulation of splicing that increases the heterogeneity of the expression of splice variants. Inferring events with such differential heterogeneity in gene isoform usage requires new statistical approaches. Results We introduce Splice Expression Variability Analysis (SEVA) to model increased heterogeneity of splice variant usage between conditions (e.g. tumor and normal samples). SEVA uses a rank-based multivariate statistic that compares the variability of junction expression profiles within one condition to the variability within another. Simulated data show that SEVA is unique in modeling heterogeneity of gene isoform usage, and benchmark SEVA's performance against EBSeq, DiffSplice and rMATS that model differential isoform usage instead of heterogeneity. We confirm the accuracy of SEVA in identifying known splice variants in head and neck cancer and perform cross-study validation of novel splice variants. A novel comparison of splice variant heterogeneity between subtypes of head and neck cancer demonstrated unanticipated similarity between the heterogeneity of gene isoform usage in HPV-positive and HPV-negative subtypes and anticipated increased heterogeneity among HPV-negative samples with mutations in genes that regulate the splice variant machinery. These results show that SEVA accurately models differential heterogeneity of gene isoform usage from RNA-seq data. Availability and implementation SEVA is implemented in the R/Bioconductor package GSReg. Contact bahman@jhu.edu or favorov@sensi.org or ejfertig@jhmi.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bahman Afsari
- Division of Biostatistics and Bioinformatics, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center
| | - Theresa Guo
- Department of Otolaryngology-Head and Neck Surgery
| | - Michael Considine
- Division of Biostatistics and Bioinformatics, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center
| | - Liliana Florea
- McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, MD 21205, USA
| | - Luciane T Kagohara
- Division of Biostatistics and Bioinformatics, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center
| | - Genevieve L Stein-O'Brien
- Division of Biostatistics and Bioinformatics, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center
| | - Dylan Kelley
- Department of Otolaryngology-Head and Neck Surgery
| | - Emily Flam
- Department of Otolaryngology-Head and Neck Surgery
| | | | - Patrick K Ha
- Department of Otolaryngology-Head and Neck Surgery, University of California, San Francisco, CA 94158, USA
| | - Donald Geman
- Department of Applied Mathematics & Statistics, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Michael F Ochs
- Department of Mathematics & Statistics, The College of New Jersey, Ewing, NJ 08628, USA
| | - Joseph A Califano
- Division of Otolaryngology, Department of Surgery, University of California, San Diego, CA 92093, USA
| | | | - Alexander V Favorov
- Division of Biostatistics and Bioinformatics, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center.,Laboratory of Systems Biology and Computational Genetics, Vavilov Institute of General Genetics, RAS, Moscow 119333, Russia
| | - Elana J Fertig
- Division of Biostatistics and Bioinformatics, Department of Oncology, Sidney Kimmel Comprehensive Cancer Center
| |
Collapse
|
149
|
Korthauer K, Kimes PK, Duvallet C, Reyes A, Subramanian A, Teng M, Shukla C, Alm EJ, Hicks SC. A practical guide to methods controlling false discoveries in computational biology. Genome Biol 2019; 20:118. [PMID: 31164141 PMCID: PMC6547503 DOI: 10.1186/s13059-019-1716-1] [Citation(s) in RCA: 238] [Impact Index Per Article: 39.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2018] [Accepted: 05/10/2019] [Indexed: 01/06/2023] Open
Abstract
BACKGROUND In high-throughput studies, hundreds to millions of hypotheses are typically tested. Statistical methods that control the false discovery rate (FDR) have emerged as popular and powerful tools for error rate control. While classic FDR methods use only p values as input, more modern FDR methods have been shown to increase power by incorporating complementary information as informative covariates to prioritize, weight, and group hypotheses. However, there is currently no consensus on how the modern methods compare to one another. We investigate the accuracy, applicability, and ease of use of two classic and six modern FDR-controlling methods by performing a systematic benchmark comparison using simulation studies as well as six case studies in computational biology. RESULTS Methods that incorporate informative covariates are modestly more powerful than classic approaches, and do not underperform classic approaches, even when the covariate is completely uninformative. The majority of methods are successful at controlling the FDR, with the exception of two modern methods under certain settings. Furthermore, we find that the improvement of the modern FDR methods over the classic methods increases with the informativeness of the covariate, total number of hypothesis tests, and proportion of truly non-null hypotheses. CONCLUSIONS Modern FDR methods that use an informative covariate provide advantages over classic FDR-controlling procedures, with the relative gain dependent on the application and informativeness of available covariates. We present our findings as a practical guide and provide recommendations to aid researchers in their choice of methods to correct for false discoveries.
Collapse
Affiliation(s)
- Keegan Korthauer
- Department of Data Sciences, Dana-Farber Cancer Institute, 450 Brookline Avenue, Boston, 02215 USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Avenue, Boston, 02215 USA
| | - Patrick K. Kimes
- Department of Data Sciences, Dana-Farber Cancer Institute, 450 Brookline Avenue, Boston, 02215 USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Avenue, Boston, 02215 USA
| | - Claire Duvallet
- Department of Biological Engineering, MIT, 77 Massachusetts Avenue, Cambridge, USA
- Center for Microbiome Informatics and Therapeutics, MIT, 77 Massachusetts Avenue, Cambridge, USA
| | - Alejandro Reyes
- Department of Data Sciences, Dana-Farber Cancer Institute, 450 Brookline Avenue, Boston, 02215 USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Avenue, Boston, 02215 USA
| | | | - Mingxiang Teng
- Department of Biostatistics & Bioinformatics, Moffitt Cancer Center, 12902 Magnolia Drive, Tampa, 33612 USA
| | - Chinmay Shukla
- Biological and Biomedical Sciences Program, Harvard University, Boston, USA
| | - Eric J. Alm
- Department of Biological Engineering, MIT, 77 Massachusetts Avenue, Cambridge, USA
- Center for Microbiome Informatics and Therapeutics, MIT, 77 Massachusetts Avenue, Cambridge, USA
- Broad Institute, 415 Main Street, Cambridge, USA
| | - Stephanie C. Hicks
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 N. Wolfe Street, Baltimore, 21205 USA
| |
Collapse
|
150
|
Singer JM, Fu DY, Hughey JJ. Simphony: simulating large-scale, rhythmic data. PeerJ 2019; 7:e6985. [PMID: 31198637 PMCID: PMC6535214 DOI: 10.7717/peerj.6985] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2018] [Accepted: 04/15/2019] [Indexed: 12/26/2022] Open
Abstract
Simulated data are invaluable for assessing a computational method’s ability to distinguish signal from noise. Although many biological systems show rhythmicity, there is no general-purpose tool to simulate large-scale, rhythmic data. Here we present Simphony, an R package for simulating data from experiments in which the abundances of rhythmic and non-rhythmic features (e.g., genes) are measured at multiple time points in multiple conditions. Simphony has parameters for specifying experimental design and each feature’s rhythmic properties (e.g., amplitude and phase). In addition, Simphony can sample measurements from Gaussian and negative binomial distributions, the latter of which approximates read counts from RNA-seq data. We show an example of using Simphony to evaluate the accuracy of rhythm detection. Our results suggest that Simphony will aid experimental design and computational method development. Simphony is thoroughly documented and freely available at https://github.com/hugheylab/simphony.
Collapse
Affiliation(s)
- Jordan M Singer
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States of America
| | - Darwin Y Fu
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States of America
| | - Jacob J Hughey
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States of America.,Department of Biological Sciences, Vanderbilt University, Nashville, TN, United States of America
| |
Collapse
|