1
|
Górczak K, Claesen J, Burzykowski T. A Conceptual Framework for Abundance Estimation of Genomic Targets in the Presence of Ambiguous Short Sequencing Reads. J Comput Biol 2020; 27:1232-1247. [PMID: 31895597 DOI: 10.1089/cmb.2019.0272] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
RNA sequencing (RNA-seq) is widely used to study gene-, transcript-, or exon expression. To quantify the expression level, millions of short sequenced reads need to be mapped back to a reference genome or transcriptome. Read mapping makes it possible to find a location to which a read is identical or similar. Based upon this alignment, expression summaries, that is, read counts are generated. However, reads may be matched to multiple locations. Such ambiguously mapped reads are often ignored in the analysis, which is a potential loss of information and may cause bias in expression estimation. We present the general principles underlying multiread allocation and unbiased estimation of the expression level of genes, exons, or transcripts in the presence of multiple mapped reads. The underlying principles are derived from a theoretical concept that identifies important sources of information such as the number of uniquely mapped reads, the total target length, and the length of the shared target regions. We show with simulation studies that methods incorporating some or all of the aforementioned sources of information estimate the expression levels of genes, exons, and/or transcripts with a higher precision and accuracy than methods that do not use this information. We identify important sources of information that should be taken into account by methods that estimate the abundance of genes, exons, and/or transcripts to achieve good precision and accuracy.
Collapse
Affiliation(s)
- Katarzyna Górczak
- Interuniversity Institute for Biostatistics and statistical Bioinformatics, Hasselt University, Diepenbeek, Belgium.,Department of Mathematical and Statistical Methods, Poznań University of Life Sciences, Poznań, Poland
| | - Jürgen Claesen
- Interuniversity Institute for Biostatistics and statistical Bioinformatics, Hasselt University, Diepenbeek, Belgium.,Microbiology Unit, Belgian Nuclear Research Centre (SCK•CEN), Mol, Belgium
| | - Tomasz Burzykowski
- Interuniversity Institute for Biostatistics and statistical Bioinformatics, Hasselt University, Diepenbeek, Belgium.,Department of Statistics and Medical Informatics, Medical University of Bialystok, Bialystok, Poland
| |
Collapse
|
2
|
Soneson C, Love MI, Patro R, Hussain S, Malhotra D, Robinson MD. A junction coverage compatibility score to quantify the reliability of transcript abundance estimates and annotation catalogs. Life Sci Alliance 2019; 2:2/1/e201800175. [PMID: 30655364 PMCID: PMC6337739 DOI: 10.26508/lsa.201800175] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2018] [Revised: 01/07/2019] [Accepted: 01/08/2019] [Indexed: 02/01/2023] Open
Abstract
Comparison of observed exon–exon junction counts to those predicted from estimated transcript abundances can identify genes with misannotated or misquantified transcripts. Most methods for statistical analysis of RNA-seq data take a matrix of abundance estimates for some type of genomic features as their input, and consequently the quality of any obtained results is directly dependent on the quality of these abundances. Here, we present the junction coverage compatibility score, which provides a way to evaluate the reliability of transcript-level abundance estimates and the accuracy of transcript annotation catalogs. It works by comparing the observed number of reads spanning each annotated splice junction in a genomic region to the predicted number of junction-spanning reads, inferred from the estimated transcript abundances and the genomic coordinates of the corresponding annotated transcripts. We show that although most genes show good agreement between the observed and predicted junction coverages, there is a small set of genes that do not. Genes with poor agreement are found regardless of the method used to estimate transcript abundances, and the corresponding transcript abundances should be treated with care in any downstream analyses.
Collapse
Affiliation(s)
- Charlotte Soneson
- Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland .,SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland
| | - Michael I Love
- Department of Biostatistics, University of North Carolina-Chapel Hill, Chapel Hill, NC, USA.,Department of Genetics, University of North Carolina-Chapel Hill, Chapel Hill, NC, USA
| | - Rob Patro
- Department of Computer Science, Stony Brook University, NY, USA
| | - Shobbir Hussain
- Department of Biology and Biochemistry, University of Bath, Bath, UK
| | - Dheeraj Malhotra
- F. Hoffmann-La Roche Ltd, Pharma Research and Early Development, Neuroscience, Ophthalmology and Rare Diseases, Roche Innovation Center Basel, Basel, Switzerland
| | - Mark D Robinson
- Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland .,SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland
| |
Collapse
|
3
|
miR-MaGiC improves quantification accuracy for small RNA-seq. BMC Res Notes 2018; 11:296. [PMID: 29764489 PMCID: PMC5952827 DOI: 10.1186/s13104-018-3418-2] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2018] [Accepted: 05/09/2018] [Indexed: 12/17/2022] Open
Abstract
Objective Many tools have been developed to profile microRNA (miRNA) expression from small RNA-seq data. These tools must contend with several issues: the small size of miRNAs, the small number of unique miRNAs, the fact that similar miRNAs can be transcribed from multiple loci, and the presence of miRNA isoforms known as isomiRs. Methods failing to address these issues can return misleading information. We propose a novel quantification method designed to address these concerns. Results We present miR-MaGiC, a novel miRNA quantification method, implemented as a cross-platform tool in Java. miR-MaGiC performs stringent mapping to a core region of each miRNA and defines a meaningful set of target miRNA sequences by collapsing the miRNA space to “functional groups”. We hypothesize that these two features, mapping stringency and collapsing, provide more optimal quantification to a more meaningful unit (i.e., miRNA family). We test miR-MaGiC and several published methods on 210 small RNA-seq libraries, evaluating each method’s ability to accurately reflect global miRNA expression profiles. We define accuracy as total counts close to the total number of input reads originating from miRNAs. We find that miR-MaGiC, which incorporates both stringency and collapsing, provides the most accurate counts. Electronic supplementary material The online version of this article (10.1186/s13104-018-3418-2) contains supplementary material, which is available to authorized users.
Collapse
|
4
|
Webster PJ, Littlejohns AT, Gaunt HJ, Young RS, Rode B, Ritchie JE, Stead LF, Harrison S, Droop A, Martin HL, Tomlinson DC, Hyman AJ, Appleby HL, Boxall S, Bruns AF, Li J, Prasad RK, Lodge JPA, Burke DA, Beech DJ. Upregulated WEE1 protects endothelial cells of colorectal cancer liver metastases. Oncotarget 2018; 8:42288-42299. [PMID: 28178688 PMCID: PMC5522067 DOI: 10.18632/oncotarget.15039] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2016] [Accepted: 01/09/2017] [Indexed: 12/26/2022] Open
Abstract
Surgical resection of colorectal cancer liver metastases (CLM) can be curative, yet 80% of patients are unsuitable for this treatment. As angiogenesis is a determinant of CLM progression we isolated endothelial cells from CLM and sought a mechanism which is upregulated, essential for angiogenic properties of these cells and relevant to emerging therapeutic options. Matched CLM endothelial cells (CLMECs) and endothelial cells of normal adjacent liver (LiECs) were superficially similar but transcriptome sequencing revealed molecular differences, one of which was unexpected upregulation and functional significance of the checkpoint kinase WEE1. Western blotting confirmed that WEE1 protein was upregulated in CLMECs. Knockdown of WEE1 by targeted short interfering RNA or the WEE1 inhibitor AZD1775 suppressed proliferation and migration of CLMECs. Investigation of the underlying mechanism suggested induction of double-stranded DNA breaks due to nucleotide shortage which then led to caspase 3-dependent apoptosis. The implication for CLMEC tube formation was striking with AZD1775 inhibiting tube branch points by 83%. WEE1 inhibitors might therefore be a therapeutic option for CLM and could be considered more broadly as anti-angiogenic agents in cancer treatment.
Collapse
Affiliation(s)
| | | | - Hannah J Gaunt
- School of Medicine, University of Leeds, Leeds LS2 9JT, UK
| | | | - Baptiste Rode
- School of Medicine, University of Leeds, Leeds LS2 9JT, UK
| | | | - Lucy F Stead
- School of Medicine, University of Leeds, Leeds LS2 9JT, UK
| | - Sally Harrison
- School of Medicine, University of Leeds, Leeds LS2 9JT, UK
| | - Alastair Droop
- School of Medicine, University of Leeds, Leeds LS2 9JT, UK.,MRC Medical Bioinformatics Centre, University of Leeds, Leeds LS2 9NL, UK
| | - Heather L Martin
- School of Biological Sciences, University of Leeds, Leeds LS2 9JT, UK
| | | | - Adam J Hyman
- School of Medicine, University of Leeds, Leeds LS2 9JT, UK
| | | | - Sally Boxall
- School of Biological Sciences, University of Leeds, Leeds LS2 9JT, UK
| | | | - Jing Li
- School of Medicine, University of Leeds, Leeds LS2 9JT, UK
| | - Raj K Prasad
- Department of Hepatobiliary and Transplant Surgery, St. James's University Hospital, Leeds LS9 7TF, UK
| | - J Peter A Lodge
- Department of Hepatobiliary and Transplant Surgery, St. James's University Hospital, Leeds LS9 7TF, UK
| | - Dermot A Burke
- School of Medicine, University of Leeds, Leeds LS2 9JT, UK.,Department of Colorectal Surgery, St. James's University Hospital, Leeds LS9 7TF, UK
| | - David J Beech
- School of Medicine, University of Leeds, Leeds LS2 9JT, UK
| |
Collapse
|
5
|
Bisgin H, Gong B, Wang Y, Tong W. Evaluation of Bioinformatics Approaches for Next-Generation Sequencing Analysis of microRNAs with a Toxicogenomics Study Design. Front Genet 2018; 9:22. [PMID: 29467792 PMCID: PMC5808213 DOI: 10.3389/fgene.2018.00022] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2017] [Accepted: 01/17/2018] [Indexed: 12/18/2022] Open
Abstract
MicroRNAs (miRNAs) are key post-transcriptional regulators that affect protein translation by targeting mRNAs. Their role in disease etiology and toxicity are well recognized. Given the rapid advancement of next-generation sequencing techniques, miRNA profiling has been increasingly conducted with RNA-seq, namely miRNA-seq. Analysis of miRNA-seq data requires several steps: (1) mapping the reads to miRBase, (2) considering mismatches during the hairpin alignment (windowing), and (3) counting the reads (quantification). The choice made in each step with respect to the parameter settings could affect miRNA quantification, differentially expressed miRNAs (DEMs) detection and novel miRNA identification. Furthermore, these parameters do not act in isolation and their joint effects impact miRNA-seq results and interpretation. In toxicogenomics, the variation associated with parameter setting should not overpower the treatment effect (such as the dose/time-dependent effect). In this study, four commonly used miRNA-seq analysis tools (i.e., miRDeep2, miRExpress, miRNAkey, sRNAbench) were comparatively evaluated with a standard toxicogenomics study design. We tested 30 different parameter settings on miRNA-seq data generated from thioacetamide-treated rat liver samples for three dose levels across four time points, followed by four normalization options. Because both miRExpress and miRNAkey yielded larger variation than that of the treatment effects across multiple parameter settings, our analyses mainly focused on the side-by-side comparison between miRDeep2 and sRNAbench. While the number of miRNAs detected by miRDeep2 was almost the subset of those detected by sRNAbench, the number of DEMs identified by both tools was comparable under the same parameter settings and normalization method. Change in the number of nucleotides out of the mature sequence in the hairpin alignment (window option) yielded the largest variation for miRNA quantification and DEMs detection. However, such a variation is relatively small compared to the treatment effect when the study focused on DEMs that are more critical to interpret the toxicological effect. While the normalization methods introduced a large variation in DEMs, toxic behavior of thioacetamide showed consistency in the trend of time-dose responses. Overall, miRDeep2 was found to be preferable over other choices when the window option allowed up to three nucleotides from both ends.
Collapse
Affiliation(s)
- Halil Bisgin
- Department of Computer Science, Engineering, and Physics, University of Michigan-Flint, Flint, MI, United States
| | - Binsheng Gong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research (FDA), Jefferson, AR, United States
| | - Yuping Wang
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research (FDA), Jefferson, AR, United States
| | - Weida Tong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research (FDA), Jefferson, AR, United States
| |
Collapse
|
6
|
Ju CJT, Zhao Z, Wang W. Efficient Approach to Correct Read Alignment for Pseudogene Abundance Estimates. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:522-533. [PMID: 27429446 PMCID: PMC5514313 DOI: 10.1109/tcbb.2016.2591533] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
RNA-Sequencing has been the leading technology to quantify expression of thousands of genes simultaneously. The data analysis of an RNA-Seq experiment starts from aligning short reads to the reference genome/transcriptome or reconstructed transcriptome. However, current aligners lack the sensitivity to distinguish reads that come from homologous regions of an genome. One group of these homologies is the paralog pseudogenes. Pseudogenes arise from duplication of a set of protein coding genes, and have been considered as degraded paralogs in the genome due to their lost of functionality. Recent studies have provided evidence to support their novel regulatory roles in biological processes. With the growing interests in quantifying the expression level of pseudogenes at different tissues or cell lines, it is critical to have a sensitive method that can correctly align ambiguous reads and accurately estimate the expression level among homologous genes. Previously in PseudoLasso, we proposed a linear regression approach to learn read alignment behaviors, and to leverage this knowledge for abundance estimation and alignment correction. In this paper, we extend the work of PseudoLasso by grouping the homologous genomic regions into different communities using a community detection algorithm, followed by building a linear regression model separately for each community. The results show that this approach is able to retain the same accuracy as PseudoLasso. By breaking the genome into smaller homologous communities, the running time is improved from quadratic growth to linear with respect to the number of genes.
Collapse
|
7
|
Ockendon NF, O'Connell LA, Bush SJ, Monzón-Sandoval J, Barnes H, Székely T, Hofmann HA, Dorus S, Urrutia AO. Optimization of next-generation sequencing transcriptome annotation for species lacking sequenced genomes. Mol Ecol Resour 2015; 16:446-58. [PMID: 26358618 PMCID: PMC4982090 DOI: 10.1111/1755-0998.12465] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2014] [Revised: 08/01/2015] [Accepted: 08/14/2015] [Indexed: 01/10/2023]
Abstract
Next‐generation sequencing methods, such as RNA‐seq, have permitted the exploration of gene expression in a range of organisms which have been studied in ecological contexts but lack a sequenced genome. However, the efficacy and accuracy of RNA‐seq annotation methods using reference genomes from related species have yet to be robustly characterized. Here we conduct a comprehensive power analysis employing RNA‐seq data from Drosophila melanogaster in conjunction with 11 additional genomes from related Drosophila species to compare annotation methods and quantify the impact of evolutionary divergence between transcriptome and the reference genome. Our analyses demonstrate that, regardless of the level of sequence divergence, direct genome mapping (DGM), where transcript short reads are aligned directly to the reference genome, significantly outperforms the widely used de novo and guided assembly‐based methods in both the quantity and accuracy of gene detection. Our analysis also reveals that DGM recovers a more representative profile of Gene Ontology functional categories, which are often used to interpret emergent patterns in genomewide expression analyses. Lastly, analysis of available primate RNA‐seq data demonstrates the applicability of our observations across diverse taxa. Our quantification of annotation accuracy and reduced gene detection associated with sequence divergence thus provides empirically derived guidelines for the design of future gene expression studies in species without sequenced genomes.
Collapse
Affiliation(s)
- Nina F Ockendon
- Department of Biology and Biochemistry, University of Bath, Bath, BA2 7AY, UK.,Milner Centre, University of Bath, Bath, BA2 7AY, UK
| | - Lauren A O'Connell
- FAS Centre for Systems Biology, Harvard University, Cambridge, MA, 02138, USA
| | - Stephen J Bush
- Department of Biology and Biochemistry, University of Bath, Bath, BA2 7AY, UK.,Milner Centre, University of Bath, Bath, BA2 7AY, UK
| | - Jimena Monzón-Sandoval
- Department of Biology and Biochemistry, University of Bath, Bath, BA2 7AY, UK.,Milner Centre, University of Bath, Bath, BA2 7AY, UK
| | - Holly Barnes
- Department of Biology and Biochemistry, University of Bath, Bath, BA2 7AY, UK.,Milner Centre, University of Bath, Bath, BA2 7AY, UK
| | - Tamás Székely
- Department of Biology and Biochemistry, University of Bath, Bath, BA2 7AY, UK.,Milner Centre, University of Bath, Bath, BA2 7AY, UK
| | - Hans A Hofmann
- Center for Computational Biology and Bioinformatics, Department of Integrative Biology, The University of Texas, Austin, TX, 78712, USA
| | - Steve Dorus
- Department of Biology, Syracuse University, Syracuse, NY, 13244, USA
| | - Araxi O Urrutia
- Department of Biology and Biochemistry, University of Bath, Bath, BA2 7AY, UK.,Milner Centre, University of Bath, Bath, BA2 7AY, UK
| |
Collapse
|
8
|
Lee S, Seo CH, Alver BH, Lee S, Park PJ. EMSAR: estimation of transcript abundance from RNA-seq data by mappability-based segmentation and reclustering. BMC Bioinformatics 2015; 16:278. [PMID: 26335049 PMCID: PMC4559005 DOI: 10.1186/s12859-015-0704-z] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2015] [Accepted: 08/13/2015] [Indexed: 11/10/2022] Open
Abstract
Background RNA-seq has been widely used for genome-wide expression profiling. RNA-seq data typically consists of tens of millions of short sequenced reads from different transcripts. However, due to sequence similarity among genes and among isoforms, the source of a given read is often ambiguous. Existing approaches for estimating expression levels from RNA-seq reads tend to compromise between accuracy and computational cost. Results We introduce a new approach for quantifying transcript abundance from RNA-seq data. EMSAR (Estimation by Mappability-based Segmentation And Reclustering) groups reads according to the set of transcripts to which they are mapped and finds maximum likelihood estimates using a joint Poisson model for each optimal set of segments of transcripts. The method uses nearly all mapped reads, including those mapped to multiple genes. With an efficient transcriptome indexing based on modified suffix arrays, EMSAR minimizes the use of CPU time and memory while achieving accuracy comparable to the best existing methods. Conclusions EMSAR is a method for quantifying transcripts from RNA-seq data with high accuracy and low computational cost. EMSAR is available at https://github.com/parklab/emsar Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0704-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Soohyun Lee
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Chae Hwa Seo
- Emerging Technology Center, DNA link, Seoul, South Korea
| | - Burak Han Alver
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Sanghyuk Lee
- Emerging Technology Center, DNA link, Seoul, South Korea.,Ewha Womans University, Seoul, Korea
| | - Peter J Park
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA. .,Informatics Program, Boston Children's Hospital and Division of Genetics, Brigham and Women's Hospital, Boston, MA, USA.
| |
Collapse
|
9
|
|
10
|
Stead LF, Egan P, Devery A, Conway C, Daly C, Berri S, Wood H, Belvedere O, Papagiannopoulos K, Ryan A, Rabbitts P. An integrated inspection of the somatic mutations in a lung squamous cell carcinoma using next-generation sequencing. PLoS One 2013; 8:e78823. [PMID: 24244370 PMCID: PMC3823931 DOI: 10.1371/journal.pone.0078823] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2013] [Accepted: 09/16/2013] [Indexed: 01/10/2023] Open
Abstract
Squamous cell carcinoma (SCC) of the lung kills over 350,000 people annually worldwide, and is the main lung cancer histotype with no targeted treatments. High-coverage whole-genome sequencing of the other main subtypes, small-cell and adenocarcinoma, gave insights into carcinogenic mechanisms and disease etiology. The genomic complexity within the lung SCC subtype, as revealed by The Cancer Genome Atlas, means this subtype is likely to benefit from a more integrated approach in which the transcriptional consequences of somatic mutations are simultaneously inspected. Here we present such an approach: the integrated analysis of deep sequencing data from both the whole genome and whole transcriptome (coding and non-coding) of LUDLU-1, a SCC lung cell line. Our results show that LUDLU-1 lacks the mutational signature that has been previously associated with tobacco exposure in other lung cancer subtypes, and suggests that DNA-repair efficiency is adversely affected; LUDLU-1 contains somatic mutations in TP53 and BRCA2, allelic imbalance in the expression of two cancer-associated BRCA1 germline polymorphisms and reduced transcription of a potentially endogenous PARP2 inhibitor. Functional assays were performed and compared with a control lung cancer cell line. LUDLU-1 did not exhibit radiosensitisation or an increase in sensitivity to PARP inhibitors. However, LUDLU-1 did exhibit small but significant differences with respect to cisplatin sensitivity. Our research shows how integrated analyses of high-throughput data can generate hypotheses to be tested in the lab.
Collapse
Affiliation(s)
- Lucy F. Stead
- Leeds Institute of Cancer and Pathology, University of Leeds, Leeds, West Yorkshire, United Kingdom
| | - Philip Egan
- Leeds Institute of Cancer and Pathology, University of Leeds, Leeds, West Yorkshire, United Kingdom
| | - Aoife Devery
- Gray Institute for Radiation Oncology and Biology, University of Oxford, Oxford, Oxfordshire, United Kingdom
| | - Caroline Conway
- Leeds Institute of Cancer and Pathology, University of Leeds, Leeds, West Yorkshire, United Kingdom
| | - Catherine Daly
- Leeds Institute of Cancer and Pathology, University of Leeds, Leeds, West Yorkshire, United Kingdom
| | - Stefano Berri
- Leeds Institute of Cancer and Pathology, University of Leeds, Leeds, West Yorkshire, United Kingdom
| | - Henry Wood
- Leeds Institute of Cancer and Pathology, University of Leeds, Leeds, West Yorkshire, United Kingdom
| | - Ornella Belvedere
- Leeds Institute of Cancer and Pathology, University of Leeds, Leeds, West Yorkshire, United Kingdom
| | - Kostas Papagiannopoulos
- Department of Thoracic Surgery, St. James’s University Hospital, Leeds, West Yorkshire, United Kingdom
| | - Anderson Ryan
- Gray Institute for Radiation Oncology and Biology, University of Oxford, Oxford, Oxfordshire, United Kingdom
| | - Pamela Rabbitts
- Leeds Institute of Cancer and Pathology, University of Leeds, Leeds, West Yorkshire, United Kingdom
| |
Collapse
|
11
|
Dao P, Numanagić I, Lin YY, Hach F, Karakoc E, Donmez N, Collins C, Eichler EE, Sahinalp SC. ORMAN: optimal resolution of ambiguous RNA-Seq multimappings in the presence of novel isoforms. ACTA ACUST UNITED AC 2013; 30:644-51. [PMID: 24130305 DOI: 10.1093/bioinformatics/btt591] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION RNA-Seq technology is promising to uncover many novel alternative splicing events, gene fusions and other variations in RNA transcripts. For an accurate detection and quantification of transcripts, it is important to resolve the mapping ambiguity for those RNA-Seq reads that can be mapped to multiple loci: >17% of the reads from mouse RNA-Seq data and 50% of the reads from some plant RNA-Seq data have multiple mapping loci. In this study, we show how to resolve the mapping ambiguity in the presence of novel transcriptomic events such as exon skipping and novel indels towards accurate downstream analysis. We introduce ORMAN ( O ptimal R esolution of M ultimapping A mbiguity of R N A-Seq Reads), which aims to compute the minimum number of potential transcript products for each gene and to assign each multimapping read to one of these transcripts based on the estimated distribution of the region covering the read. ORMAN achieves this objective through a combinatorial optimization formulation, which is solved through well-known approximation algorithms, integer linear programs and heuristics. RESULTS On a simulated RNA-Seq dataset including a random subset of transcripts from the UCSC database, the performance of several state-of-the-art methods for identifying and quantifying novel transcripts, such as Cufflinks, IsoLasso and CLIIQ, is significantly improved through the use of ORMAN. Furthermore, in an experiment using real RNA-Seq reads, we show that ORMAN is able to resolve multimapping to produce coverage values that are similar to the original distribution, even in genes with highly non-uniform coverage. AVAILABILITY ORMAN is available at http://orman.sf.net
Collapse
Affiliation(s)
- Phuong Dao
- School of Computing Science, Simon Fraser University, Burnaby, BC, Canada, Department of Genome Sciences, University of Washington, Seattle, WA, USA, Vancouver Prostate Centre & Department of Urologic Sciences, University of British Columbia, Vancouver, BC, Canada and Division of Computer Science, School of Informatics and Computing, Indiana University, Bloomington, IN, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
12
|
The transcriptional consequences of somatic amplifications, deletions, and rearrangements in a human lung squamous cell carcinoma. Neoplasia 2013; 14:1075-86. [PMID: 23226101 DOI: 10.1593/neo.121380] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2012] [Revised: 09/25/2012] [Accepted: 09/28/2012] [Indexed: 12/16/2022] Open
Abstract
Lung cancer causes more deaths, worldwide, than any other cancer. Several histologic subtypes exist. Currently, there is a dearth of targeted therapies for treating one of the main subtypes: squamous cell carcinoma (SCC). As for many cancers, lung SCC karyotypes are often highly anomalous owing to large somatic structural variants, some of which are seen repeatedly in lung SCC, indicating a potential causal association for genes therein. We chose to characterize a lung SCC genome to unprecedented detail and integrate our findings with the concurrently characterized transcriptome. We aimed to ascertain how somatic structural changes affected gene expression within the cell in ways that could confer a pathogenic phenotype. We sequenced the genomes of a lung SCC cell line (LUDLU-1) and its matched lymphocyte cell line (AGLCL) to more than 50x coverage. We also sequenced the transcriptomes of LUDLU-1 and a normal bronchial epithelium cell line (LIMM-NBE1), resulting in more than 600 million aligned reads per sample, including both coding and non-coding RNA (ncRNA), in a strand-directional manner. We also captured small RNA (<30 bp). We discovered significant, but weak, correlations between copy number and expression for protein-coding genes, antisense transcripts, long intergenic ncRNA, and microRNA (miRNA). We found that miRNA undergo the largest change in overall expression pattern between the normal bronchial epithelium and the tumor cell line. We found evidence of transcription across the novel genomic sequence created from six somatic structural variants. For each part of our integrated analysis, we highlight candidate genes that have undergone the largest expression changes.
Collapse
|
13
|
Jimenez-Lopez JC, Gachomo EW, Sharma S, Kotchoni SO. Genome sequencing and next-generation sequence data analysis: A comprehensive compilation of bioinformatics tools and databases. ACTA ACUST UNITED AC 2013. [DOI: 10.4236/ajmb.2013.32016] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
14
|
Li L, Huang D, Cheung MK, Nong W, Huang Q, Kwan HS. BSRD: a repository for bacterial small regulatory RNA. Nucleic Acids Res 2012. [PMID: 23203879 PMCID: PMC3531160 DOI: 10.1093/nar/gks1264] [Citation(s) in RCA: 88] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022] Open
Abstract
In bacteria, small regulatory non-coding RNAs (sRNAs) are the most abundant class of post-transcriptional regulators. They are involved in diverse processes including quorum sensing, stress response, virulence and carbon metabolism. Recent developments in high-throughput techniques, such as genomic tiling arrays and RNA-Seq, have allowed efficient detection and characterization of bacterial sRNAs. However, a comprehensive repository to host sRNAs and their annotations is not available. Existing databases suffer from a limited number of bacterial species or sRNAs included. In addition, these databases do not have tools to integrate or analyse high-throughput sequencing data. Here, we have developed BSRD (http://kwanlab.bio.cuhk.edu.hk/BSRD), a comprehensive bacterial sRNAs database, as a repository for published bacterial sRNA sequences with annotations and expression profiles. BSRD contains over nine times more experimentally validated sRNAs than any other available databases. BSRD also provides combinatorial regulatory networks of transcription factors and sRNAs with their common targets. We have built and implemented in BSRD a novel RNA-Seq analysis platform, sRNADeep, to characterize sRNAs in large-scale transcriptome sequencing projects. We will update BSRD regularly.
Collapse
Affiliation(s)
- Lei Li
- Biology Programme, School of Life Sciences, The Chinese University of Hong Kong, Hong Kong SAR, China
| | | | | | | | | | | |
Collapse
|
15
|
Xuan J, Yu Y, Qing T, Guo L, Shi L. Next-generation sequencing in the clinic: promises and challenges. Cancer Lett 2012; 340:284-95. [PMID: 23174106 DOI: 10.1016/j.canlet.2012.11.025] [Citation(s) in RCA: 199] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2012] [Revised: 11/13/2012] [Accepted: 11/13/2012] [Indexed: 02/06/2023]
Abstract
The advent of next generation sequencing (NGS) technologies has revolutionized the field of genomics, enabling fast and cost-effective generation of genome-scale sequence data with exquisite resolution and accuracy. Over the past years, rapid technological advances led by academic institutions and companies have continued to broaden NGS applications from research to the clinic. A recent crop of discoveries have highlighted the medical impact of NGS technologies on Mendelian and complex diseases, particularly cancer. However, the ever-increasing pace of NGS adoption presents enormous challenges in terms of data processing, storage, management and interpretation as well as sequencing quality control, which hinder the translation from sequence data into clinical practice. In this review, we first summarize the technical characteristics and performance of current NGS platforms. We further highlight advances in the applications of NGS technologies towards the development of clinical diagnostics and therapeutics. Common issues in NGS workflows are also discussed to guide the selection of NGS platforms and pipelines for specific research purposes.
Collapse
Affiliation(s)
- Jiekun Xuan
- School of Pharmacy, Fudan University, 826 Zhangheng Road, Shanghai 201203, China; National Center for Toxicological Research, US Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA
| | | | | | | | | |
Collapse
|
16
|
Bonfert T, Csaba G, Zimmer R, Friedel CC. A context-based approach to identify the most likely mapping for RNA-seq experiments. BMC Bioinformatics 2012; 13 Suppl 6:S9. [PMID: 22537048 PMCID: PMC3358662 DOI: 10.1186/1471-2105-13-s6-s9] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Background Sequencing of mRNA (RNA-seq) by next generation sequencing technologies is widely used for analyzing the transcriptomic state of a cell. Here, one of the main challenges is the mapping of a sequenced read to its transcriptomic origin. As a simple alignment to the genome will fail to identify reads crossing splice junctions and a transcriptome alignment will miss novel splice sites, several approaches have been developed for this purpose. Most of these approaches have two drawbacks. First, each read is assigned to a location independent on whether the corresponding gene is expressed or not, i.e. information from other reads is not taken into account. Second, in case of multiple possible mappings, the mapping with the fewest mismatches is usually chosen which may lead to wrong assignments due to sequencing errors. Results To address these problems, we developed ContextMap which efficiently uses information on the context of a read, i.e. reads mapping to the same expressed region. The context information is used to resolve possible ambiguities and, thus, a much larger degree of ambiguities can be allowed in the initial stage in order to detect all possible candidate positions. Although ContextMap can be used as a stand-alone version using either a genome or transcriptome as input, the version presented in this article is focused on refining initial mappings provided by other mapping algorithms. Evaluation results on simulated sequencing reads showed that the application of ContextMap to either TopHat or MapSplice mappings improved the mapping accuracy of both initial mappings considerably. Conclusions In this article, we show that the context of reads mapping to nearby locations provides valuable information for identifying the best unique mapping for a read. Using our method, mappings provided by other state-of-the-art methods can be refined and alignment accuracy can be further improved. Availability http://www.bio.ifi.lmu.de/ContextMap.
Collapse
Affiliation(s)
- Thomas Bonfert
- Institute for Informatics, Ludwig-Maximilians-University Munich, Amalienstr, 17, 80333 Munich, Germany
| | | | | | | |
Collapse
|
17
|
Rozov R, Halperin E, Shamir R. MGMR: leveraging RNA-Seq population data to optimize expression estimation. BMC Bioinformatics 2012; 13 Suppl 6:S2. [PMID: 22537041 PMCID: PMC3358656 DOI: 10.1186/1471-2105-13-s6-s2] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Background RNA-Seq is a technique that uses Next Generation Sequencing to identify transcripts and estimate transcription levels. When applying this technique for quantification, one must contend with reads that align to multiple positions in the genome (multireads). Previous efforts to resolve multireads have shown that RNA-Seq expression estimation can be improved using probabilistic allocation of reads to genes. These methods use a probabilistic generative model for data generation and resolve ambiguity using likelihood-based approaches. In many instances, RNA-seq experiments are performed in the context of a population. The generative models of current methods do not take into account such population information, and it is an open question whether this information can improve quantification of the individual samples Results In order to explore the contribution of population level information in RNA-seq quantification, we apply a hierarchical probabilistic generative model, which assumes that expression levels of different individuals are sampled from a Dirichlet distribution with parameters specific to the population, and reads are sampled from the distribution of expression levels. We introduce an optimization procedure for the estimation of the model parameters, and use HapMap data and simulated data to demonstrate that the model yields a significant improvement in the accuracy of expression levels of paralogous genes. Conclusions We provide a proof of principal of the benefit of drawing on population commonalities to estimate expression. The results of our experiments demonstrate this approach can be beneficial, primarily for estimation at the gene level.
Collapse
Affiliation(s)
- Roye Rozov
- The Blavatnik School of Computer Science, Tel-Aviv University, Tel Aviv 69978, Israel
| | | | | |
Collapse
|
18
|
Isakov O, Ronen R, Kovarsky J, Gabay A, Gan I, Modai S, Shomron N. Novel insight into the non-coding repertoire through deep sequencing analysis. Nucleic Acids Res 2012; 40:e86. [PMID: 22406831 PMCID: PMC3367215 DOI: 10.1093/nar/gks228] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
Non-coding RNAs (ncRNA) account for a large portion of the transcribed genomic output. This diverse family of untranslated RNA molecules play a crucial role in cellular function. The use of ‘deep sequencing’ technology (also known as ‘next generation sequencing’) to infer transcript expression levels in general, and ncRNA specifically, is becoming increasingly common in molecular and clinical laboratories. We developed a software termed ‘RandA’ (which stands for ncRNA Read-and-Analyze) that performs comprehensive ncRNA profiling and differential expression analysis on deep sequencing generated data through a graphical user interface running on a local personal computer. Using RandA, we reveal the complexity of the ncRNA repertoire in a given cell population. We further demonstrate the relevance of such an extensive ncRNA analysis by elucidating a multitude of characterizing features in pathogen infected mammalian cells. RandA is available for download at http://ibis.tau.ac.il/RandA.
Collapse
Affiliation(s)
- Ofer Isakov
- Department of Cell and Developmental Biology, Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv 69978, Israel
| | | | | | | | | | | | | |
Collapse
|
19
|
Optimizing a massive parallel sequencing workflow for quantitative miRNA expression analysis. PLoS One 2012; 7:e31630. [PMID: 22363693 PMCID: PMC3282730 DOI: 10.1371/journal.pone.0031630] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2011] [Accepted: 01/14/2012] [Indexed: 11/19/2022] Open
Abstract
Background Massive Parallel Sequencing methods (MPS) can extend and improve the knowledge obtained by conventional microarray technology, both for mRNAs and short non-coding RNAs, e.g. miRNAs. The processing methods used to extract and interpret the information are an important aspect of dealing with the vast amounts of data generated from short read sequencing. Although the number of computational tools for MPS data analysis is constantly growing, their strengths and weaknesses as part of a complex analytical pipe-line have not yet been well investigated. Primary findings A benchmark MPS miRNA dataset, resembling a situation in which miRNAs are spiked in biological replication experiments was assembled by merging a publicly available MPS spike-in miRNAs data set with MPS data derived from healthy donor peripheral blood mononuclear cells. Using this data set we observed that short reads counts estimation is strongly under estimated in case of duplicates miRNAs, if whole genome is used as reference. Furthermore, the sensitivity of miRNAs detection is strongly dependent by the primary tool used in the analysis. Within the six aligners tested, specifically devoted to miRNA detection, SHRiMP and MicroRazerS show the highest sensitivity. Differential expression estimation is quite efficient. Within the five tools investigated, two of them (DESseq, baySeq) show a very good specificity and sensitivity in the detection of differential expression. Conclusions The results provided by our analysis allow the definition of a clear and simple analytical optimized workflow for miRNAs digital quantitative analysis.
Collapse
|
20
|
Derrien T, Estellé J, Marco Sola S, Knowles DG, Raineri E, Guigó R, Ribeca P. Fast computation and applications of genome mappability. PLoS One 2012; 7:e30377. [PMID: 22276185 PMCID: PMC3261895 DOI: 10.1371/journal.pone.0030377] [Citation(s) in RCA: 327] [Impact Index Per Article: 25.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2011] [Accepted: 12/19/2011] [Indexed: 01/17/2023] Open
Abstract
We present a fast mapping-based algorithm to compute the mappability of each region of a reference genome up to a specified number of mismatches. Knowing the mappability of a genome is crucial for the interpretation of massively parallel sequencing experiments. We investigate the properties of the mappability of eukaryotic DNA/RNA both as a whole and at the level of the gene family, providing for various organisms tracks which allow the mappability information to be visually explored. In addition, we show that mappability varies greatly between species and gene classes. Finally, we suggest several practical applications where mappability can be used to refine the analysis of high-throughput sequencing data (SNP calling, gene expression quantification and paired-end experiments). This work highlights mappability as an important concept which deserves to be taken into full account, in particular when massively parallel sequencing technologies are employed. The GEM mappability program belongs to the GEM (GEnome Multitool) suite of programs, which can be freely downloaded for any use from its website (http://gemlibrary.sourceforge.net).
Collapse
Affiliation(s)
- Thomas Derrien
- Institut de Génétique et Développement (IGDR), Université Rennes 1, Rennes, France
- * E-mail: (TD); (PR)
| | - Jordi Estellé
- Centro Nacional de Análisis Genómico (CNAG), Barcelona, Spain
| | | | - David G. Knowles
- Centre for Genomic Regulation (CRG), Universitat Pompeu Fabra, Barcelona, Spain
| | | | - Roderic Guigó
- Centre for Genomic Regulation (CRG), Universitat Pompeu Fabra, Barcelona, Spain
| | - Paolo Ribeca
- Centro Nacional de Análisis Genómico (CNAG), Barcelona, Spain
- * E-mail: (TD); (PR)
| |
Collapse
|
21
|
Abstract
Abstract
Background
RNA-Seq is revolutionizing the way transcript abundances are measured. A key challenge in transcript quantification from RNA-Seq data is the handling of reads that map to multiple genes or isoforms. This issue is particularly important for quantification with de novo transcriptome assemblies in the absence of sequenced genomes, as it is difficult to determine which transcripts are isoforms of the same gene. A second significant issue is the design of RNA-Seq experiments, in terms of the number of reads, read length, and whether reads come from one or both ends of cDNA fragments.
Results
We present RSEM, an user-friendly software package for quantifying gene and isoform abundances from single-end or paired-end RNA-Seq data. RSEM outputs abundance estimates, 95% credibility intervals, and visualization files and can also simulate RNA-Seq data. In contrast to other existing tools, the software does not require a reference genome. Thus, in combination with a de novo transcriptome assembler, RSEM enables accurate transcript quantification for species without sequenced genomes. On simulated and real data sets, RSEM has superior or comparable performance to quantification methods that rely on a reference genome. Taking advantage of RSEM's ability to effectively use ambiguously-mapping reads, we show that accurate gene-level abundance estimates are best obtained with large numbers of short single-end reads. On the other hand, estimates of the relative frequencies of isoforms within single genes may be improved through the use of paired-end reads, depending on the number of possible splice forms for each gene.
Conclusions
RSEM is an accurate and user-friendly software tool for quantifying transcript abundances from RNA-Seq data. As it does not rely on the existence of a reference genome, it is particularly useful for quantification with de novo transcriptome assemblies. In addition, RSEM has enabled valuable guidance for cost-efficient design of quantification experiments with RNA-Seq, which is currently relatively expensive.
Collapse
|
22
|
Esteve-Codina A, Kofler R, Palmieri N, Bussotti G, Notredame C, Pérez-Enciso M. Exploring the gonad transcriptome of two extreme male pigs with RNA-seq. BMC Genomics 2011; 12:552. [PMID: 22067327 PMCID: PMC3221674 DOI: 10.1186/1471-2164-12-552] [Citation(s) in RCA: 76] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2011] [Accepted: 11/08/2011] [Indexed: 12/11/2022] Open
Abstract
BACKGROUND Although RNA-seq greatly advances our understanding of complex transcriptome landscapes, such as those found in mammals, complete RNA-seq studies in livestock and in particular in the pig are still lacking. Here, we used high-throughput RNA sequencing to gain insight into the characterization of the poly-A RNA fraction expressed in pig male gonads. An expression analysis comparing different mapping approaches and detection of allele specific expression is also discussed in this study. RESULTS By sequencing testicle mRNA of two phenotypically extreme pigs, one Iberian and one Large White, we identified hundreds of unannotated protein-coding genes (PcGs) in intergenic regions, some of them presenting orthology with closely related species. Interestingly, we also detected 2047 putative long non-coding RNA (lncRNA), including 469 with human homologues. Two methods, DEGseq and Cufflinks, were used for analyzing expression. DEGseq identified 15% less expressed genes than Cufflinks, because DEGseq utilizes only unambiguously mapped reads. Moreover, a large fraction of the transcriptome is made up of transposable elements (14500 elements encountered), as has been reported in previous studies. Gene expression results between microarray and RNA-seq technologies were relatively well correlated (r = 0.71 across individuals). Differentially expressed genes between Large White and Iberian showed a significant overrepresentation of gamete production and lipid metabolism gene ontology categories. Finally, allelic imbalance was detected in ~ 4% of heterozygous sites. CONCLUSIONS RNA-seq is a powerful tool to gain insight into complex transcriptomes. In addition to uncovering many unnanotated genes, our study allowed us to determine that a considerable fraction is made up of long non-coding transcripts and transposable elements. Their biological roles remain to be determined in future studies. In terms of differences in expression between Large White and Iberian pigs, these were largest for genes involved in spermatogenesis and lipid metabolism, which is consistent with phenotypic extreme differences in prolificacy and fat deposition between these two breeds.
Collapse
Affiliation(s)
- Anna Esteve-Codina
- Departament de Ciència Animal i dels Aliments, Universitat Autònoma de Barcelona, 08193 Bellaterra, Spain
- Center for Research in Agricultural Genomics (CRAG), Campus UAB, 08193 Bellaterra, Spain
| | - Robert Kofler
- Institut für Populationsgenetik, Vetmeduni Vienna, Veterinärplatz 1, 1210 Vienna, Austria
| | - Nicola Palmieri
- Institut für Populationsgenetik, Vetmeduni Vienna, Veterinärplatz 1, 1210 Vienna, Austria
| | - Giovanni Bussotti
- Bioinformatics and Genomics, Centre for Genomic Regulation (CRG) and Universitat Pompeu Fabra (UPF), Carrer del Doctor Aiguader 88, Barcelona, Spain
| | - Cedric Notredame
- Bioinformatics and Genomics, Centre for Genomic Regulation (CRG) and Universitat Pompeu Fabra (UPF), Carrer del Doctor Aiguader 88, Barcelona, Spain
| | - Miguel Pérez-Enciso
- Departament de Ciència Animal i dels Aliments, Universitat Autònoma de Barcelona, 08193 Bellaterra, Spain
- Center for Research in Agricultural Genomics (CRAG), Campus UAB, 08193 Bellaterra, Spain
- Institut Català de Recerca i Estudis Avançats (ICREA), Carrer de Lluís Companys 23, 08010 Barcelona, Spain
| |
Collapse
|
23
|
Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 2011; 12:323. [PMID: 21816040 DOI: 10.1007/978-1-4939-0512-63] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2011] [Accepted: 08/04/2011] [Indexed: 05/28/2023] Open
Abstract
BACKGROUND RNA-Seq is revolutionizing the way transcript abundances are measured. A key challenge in transcript quantification from RNA-Seq data is the handling of reads that map to multiple genes or isoforms. This issue is particularly important for quantification with de novo transcriptome assemblies in the absence of sequenced genomes, as it is difficult to determine which transcripts are isoforms of the same gene. A second significant issue is the design of RNA-Seq experiments, in terms of the number of reads, read length, and whether reads come from one or both ends of cDNA fragments. RESULTS We present RSEM, an user-friendly software package for quantifying gene and isoform abundances from single-end or paired-end RNA-Seq data. RSEM outputs abundance estimates, 95% credibility intervals, and visualization files and can also simulate RNA-Seq data. In contrast to other existing tools, the software does not require a reference genome. Thus, in combination with a de novo transcriptome assembler, RSEM enables accurate transcript quantification for species without sequenced genomes. On simulated and real data sets, RSEM has superior or comparable performance to quantification methods that rely on a reference genome. Taking advantage of RSEM's ability to effectively use ambiguously-mapping reads, we show that accurate gene-level abundance estimates are best obtained with large numbers of short single-end reads. On the other hand, estimates of the relative frequencies of isoforms within single genes may be improved through the use of paired-end reads, depending on the number of possible splice forms for each gene. CONCLUSIONS RSEM is an accurate and user-friendly software tool for quantifying transcript abundances from RNA-Seq data. As it does not rely on the existence of a reference genome, it is particularly useful for quantification with de novo transcriptome assemblies. In addition, RSEM has enabled valuable guidance for cost-efficient design of quantification experiments with RNA-Seq, which is currently relatively expensive.
Collapse
Affiliation(s)
- Bo Li
- Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI, USA
| | | |
Collapse
|
24
|
Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 2011; 12:323. [PMID: 21816040 PMCID: PMC3163565 DOI: 10.1186/1471-2105-12-323] [Citation(s) in RCA: 14158] [Impact Index Per Article: 1011.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2011] [Accepted: 08/04/2011] [Indexed: 02/07/2023] Open
Abstract
Background RNA-Seq is revolutionizing the way transcript abundances are measured. A key challenge in transcript quantification from RNA-Seq data is the handling of reads that map to multiple genes or isoforms. This issue is particularly important for quantification with de novo transcriptome assemblies in the absence of sequenced genomes, as it is difficult to determine which transcripts are isoforms of the same gene. A second significant issue is the design of RNA-Seq experiments, in terms of the number of reads, read length, and whether reads come from one or both ends of cDNA fragments. Results We present RSEM, an user-friendly software package for quantifying gene and isoform abundances from single-end or paired-end RNA-Seq data. RSEM outputs abundance estimates, 95% credibility intervals, and visualization files and can also simulate RNA-Seq data. In contrast to other existing tools, the software does not require a reference genome. Thus, in combination with a de novo transcriptome assembler, RSEM enables accurate transcript quantification for species without sequenced genomes. On simulated and real data sets, RSEM has superior or comparable performance to quantification methods that rely on a reference genome. Taking advantage of RSEM's ability to effectively use ambiguously-mapping reads, we show that accurate gene-level abundance estimates are best obtained with large numbers of short single-end reads. On the other hand, estimates of the relative frequencies of isoforms within single genes may be improved through the use of paired-end reads, depending on the number of possible splice forms for each gene. Conclusions RSEM is an accurate and user-friendly software tool for quantifying transcript abundances from RNA-Seq data. As it does not rely on the existence of a reference genome, it is particularly useful for quantification with de novo transcriptome assemblies. In addition, RSEM has enabled valuable guidance for cost-efficient design of quantification experiments with RNA-Seq, which is currently relatively expensive.
Collapse
Affiliation(s)
- Bo Li
- Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI, USA
| | | |
Collapse
|
25
|
Abstract
BACKGROUND RNA-Seq is revolutionizing the way transcript abundances are measured. A key challenge in transcript quantification from RNA-Seq data is the handling of reads that map to multiple genes or isoforms. This issue is particularly important for quantification with de novo transcriptome assemblies in the absence of sequenced genomes, as it is difficult to determine which transcripts are isoforms of the same gene. A second significant issue is the design of RNA-Seq experiments, in terms of the number of reads, read length, and whether reads come from one or both ends of cDNA fragments. RESULTS We present RSEM, an user-friendly software package for quantifying gene and isoform abundances from single-end or paired-end RNA-Seq data. RSEM outputs abundance estimates, 95% credibility intervals, and visualization files and can also simulate RNA-Seq data. In contrast to other existing tools, the software does not require a reference genome. Thus, in combination with a de novo transcriptome assembler, RSEM enables accurate transcript quantification for species without sequenced genomes. On simulated and real data sets, RSEM has superior or comparable performance to quantification methods that rely on a reference genome. Taking advantage of RSEM's ability to effectively use ambiguously-mapping reads, we show that accurate gene-level abundance estimates are best obtained with large numbers of short single-end reads. On the other hand, estimates of the relative frequencies of isoforms within single genes may be improved through the use of paired-end reads, depending on the number of possible splice forms for each gene. CONCLUSIONS RSEM is an accurate and user-friendly software tool for quantifying transcript abundances from RNA-Seq data. As it does not rely on the existence of a reference genome, it is particularly useful for quantification with de novo transcriptome assemblies. In addition, RSEM has enabled valuable guidance for cost-efficient design of quantification experiments with RNA-Seq, which is currently relatively expensive.
Collapse
Affiliation(s)
- Bo Li
- Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI, USA
| | | |
Collapse
|
26
|
Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 2011. [PMID: 21816040 DOI: 10.1186/1471‐2105‐12‐323] [Citation(s) in RCA: 66] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND RNA-Seq is revolutionizing the way transcript abundances are measured. A key challenge in transcript quantification from RNA-Seq data is the handling of reads that map to multiple genes or isoforms. This issue is particularly important for quantification with de novo transcriptome assemblies in the absence of sequenced genomes, as it is difficult to determine which transcripts are isoforms of the same gene. A second significant issue is the design of RNA-Seq experiments, in terms of the number of reads, read length, and whether reads come from one or both ends of cDNA fragments. RESULTS We present RSEM, an user-friendly software package for quantifying gene and isoform abundances from single-end or paired-end RNA-Seq data. RSEM outputs abundance estimates, 95% credibility intervals, and visualization files and can also simulate RNA-Seq data. In contrast to other existing tools, the software does not require a reference genome. Thus, in combination with a de novo transcriptome assembler, RSEM enables accurate transcript quantification for species without sequenced genomes. On simulated and real data sets, RSEM has superior or comparable performance to quantification methods that rely on a reference genome. Taking advantage of RSEM's ability to effectively use ambiguously-mapping reads, we show that accurate gene-level abundance estimates are best obtained with large numbers of short single-end reads. On the other hand, estimates of the relative frequencies of isoforms within single genes may be improved through the use of paired-end reads, depending on the number of possible splice forms for each gene. CONCLUSIONS RSEM is an accurate and user-friendly software tool for quantifying transcript abundances from RNA-Seq data. As it does not rely on the existence of a reference genome, it is particularly useful for quantification with de novo transcriptome assemblies. In addition, RSEM has enabled valuable guidance for cost-efficient design of quantification experiments with RNA-Seq, which is currently relatively expensive.
Collapse
Affiliation(s)
- Bo Li
- Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI, USA
| | | |
Collapse
|
27
|
Chung D, Kuan PF, Li B, Sanalkumar R, Liang K, Bresnick EH, Dewey C, Keleş S. Discovering transcription factor binding sites in highly repetitive regions of genomes with multi-read analysis of ChIP-Seq data. PLoS Comput Biol 2011; 7:e1002111. [PMID: 21779159 PMCID: PMC3136429 DOI: 10.1371/journal.pcbi.1002111] [Citation(s) in RCA: 65] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2011] [Accepted: 05/18/2011] [Indexed: 11/19/2022] Open
Abstract
Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) is rapidly replacing chromatin immunoprecipitation combined with genome-wide tiling array analysis (ChIP-chip) as the preferred approach for mapping transcription-factor binding sites and chromatin modifications. The state of the art for analyzing ChIP-seq data relies on using only reads that map uniquely to a relevant reference genome (uni-reads). This can lead to the omission of up to 30% of alignable reads. We describe a general approach for utilizing reads that map to multiple locations on the reference genome (multi-reads). Our approach is based on allocating multi-reads as fractional counts using a weighted alignment scheme. Using human STAT1 and mouse GATA1 ChIP-seq datasets, we illustrate that incorporation of multi-reads significantly increases sequencing depths, leads to detection of novel peaks that are not otherwise identifiable with uni-reads, and improves detection of peaks in mappable regions. We investigate various genome-wide characteristics of peaks detected only by utilization of multi-reads via computational experiments. Overall, peaks from multi-read analysis have similar characteristics to peaks that are identified by uni-reads except that the majority of them reside in segmental duplications. We further validate a number of GATA1 multi-read only peaks by independent quantitative real-time ChIP analysis and identify novel target genes of GATA1. These computational and experimental results establish that multi-reads can be of critical importance for studying transcription factor binding in highly repetitive regions of genomes with ChIP-seq experiments.
Collapse
Affiliation(s)
- Dongjun Chung
- Department of Statistics, University of Wisconsin, Madison, Wisconsin, United States of America
- Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, Wisconsin, United States of America
| | - Pei Fen Kuan
- Department of Biostatistics, University of North Carolina, Chapel Hill, North Carolina, United States of America
| | - Bo Li
- Department of Computer Sciences, University of Wisconsin, Madison, Wisconsin, United States of America
| | - Rajendran Sanalkumar
- Wisconsin Institutes for Medical Research, UW Carbone Cancer Center, Department of Cell and Regenerative Biology, University of Wisconsin School of Medicine and Public Health, Madison, Wisconsin, United States of America
| | - Kun Liang
- Department of Statistics, University of Wisconsin, Madison, Wisconsin, United States of America
- Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, Wisconsin, United States of America
| | - Emery H. Bresnick
- Wisconsin Institutes for Medical Research, UW Carbone Cancer Center, Department of Cell and Regenerative Biology, University of Wisconsin School of Medicine and Public Health, Madison, Wisconsin, United States of America
| | - Colin Dewey
- Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, Wisconsin, United States of America
- Department of Computer Sciences, University of Wisconsin, Madison, Wisconsin, United States of America
| | - Sündüz Keleş
- Department of Statistics, University of Wisconsin, Madison, Wisconsin, United States of America
- Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, Wisconsin, United States of America
| |
Collapse
|