1
|
Shabbir M, Mithani A. Roast: a tool for reference-free optimization of supertranscriptome assemblies. BMC Bioinformatics 2024; 25:2. [PMID: 38166712 PMCID: PMC10763045 DOI: 10.1186/s12859-023-05614-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Accepted: 12/12/2023] [Indexed: 01/05/2024] Open
Abstract
BACKGROUND Transcriptomic studies involving organisms for which reference genomes are not available typically start by generating de novo transcriptome or supertranscriptome assembly from the raw RNA-seq reads. Assembling a supertranscriptome is, however, a challenging task due to significantly varying abundance of mRNA transcripts, alternative splicing, and sequencing errors. As a result, popular de novo supertranscriptome assembly tools generate assemblies containing contigs that are partially-assembled, fragmented, false chimeras or have local mis-assemblies leading to decreased assembly accuracy. Commonly available tools for assembly improvement rely primarily on running BLAST using closely related species making their accuracy and reliability conditioned on the availability of the data for closely related organisms. RESULTS We present ROAST, a tool for optimization of supertranscriptome assemblies that uses paired-end RNA-seq data from Illumina sequencing platform to iteratively identify and fix assembly errors solely using the error signatures generated by RNA-seq alignment tools including soft-clips, unexpected expression coverage, and reads with mates unmapped or mapped on a different contig to identify and fix various supertranscriptome assembly errors without performing BLAST searches against other organisms. Evaluation results using simulated as well as real datasets show that ROAST significantly improves assembly quality by identifying and fixing various assembly errors. CONCLUSION ROAST provides a reference-free approach to optimizing supertranscriptome assemblies highlighting its utility in refining de novo supertranscriptome assemblies of non-model organisms.
Collapse
Affiliation(s)
- Madiha Shabbir
- Department of Life Sciences, Syed Babar Ali School of Science and Engineering, Lahore University of Management Sciences (LUMS), DHA, Lahore, 54792, Pakistan
| | - Aziz Mithani
- Department of Life Sciences, Syed Babar Ali School of Science and Engineering, Lahore University of Management Sciences (LUMS), DHA, Lahore, 54792, Pakistan.
| |
Collapse
|
2
|
Nanni AV, Martinez N, Graze R, Morse A, Newman JRB, Jain V, Vlaho S, Signor S, Nuzhdin SV, Renne R, McIntyre LM. Sex-Biased Expression Is Associated With Chromatin State in Drosophila melanogaster and Drosophila simulans. Mol Biol Evol 2023; 40:msad078. [PMID: 37116218 PMCID: PMC10162771 DOI: 10.1093/molbev/msad078] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2022] [Revised: 02/24/2023] [Accepted: 03/13/2023] [Indexed: 04/30/2023] Open
Abstract
In Drosophila melanogaster and D. simulans head tissue, 60% of orthologous genes show evidence of sex-biased expression in at least one species. Of these, ∼39% (2,192) are conserved in direction. We hypothesize enrichment of open chromatin in the sex where we see expression bias and closed chromatin in the opposite sex. Male-biased orthologs are significantly enriched for H3K4me3 marks in males of both species (∼89% of male-biased orthologs vs. ∼76% of unbiased orthologs). Similarly, female-biased orthologs are significantly enriched for H3K4me3 marks in females of both species (∼90% of female-biased orthologs vs. ∼73% of unbiased orthologs). The sex-bias ratio in female-biased orthologs was similar in magnitude between the two species, regardless of the closed chromatin (H3K27me2me3) marks in males. However, in male-biased orthologs, the presence of H3K27me2me3 in both species significantly reduced the correlation between D. melanogaster sex-bias ratio and the D. simulans sex-bias ratio. Male-biased orthologs are enriched for evidence of positive selection in the D. melanogaster group. There are more male-biased genes than female-biased genes in both species. For orthologs with gains/losses of sex-bias between the two species, there is an excess of male-bias compared to female-bias, but there is no consistent pattern in the relationship between H3K4me3 or H3K27me2me3 chromatin marks and expression. These data suggest chromatin state is a component of the maintenance of sex-biased expression and divergence of sex-bias between species is reflected in the complexity of the chromatin status.
Collapse
Affiliation(s)
- Adalena V Nanni
- Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, FL
- University of Florida Genetics Institute, University of Florida, Gainesville, FL
| | - Natalie Martinez
- Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, FL
| | - Rita Graze
- Department of Biological Sciences, Auburn University, Auburn, AL
| | - Alison Morse
- Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, FL
- University of Florida Genetics Institute, University of Florida, Gainesville, FL
| | - Jeremy R B Newman
- University of Florida Genetics Institute, University of Florida, Gainesville, FL
| | - Vaibhav Jain
- Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, FL
| | - Srna Vlaho
- Department of Biological Sciences, University of Southern California, Los Angeles, CA
| | - Sarah Signor
- Department of Biological Sciences, North Dakota State University, Fargo, ND
| | - Sergey V Nuzhdin
- Department of Biological Sciences, University of Southern California, Los Angeles, CA
| | - Rolf Renne
- Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, FL
- University of Florida Genetics Institute, University of Florida, Gainesville, FL
| | - Lauren M McIntyre
- Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, FL
- University of Florida Genetics Institute, University of Florida, Gainesville, FL
| |
Collapse
|
3
|
Nanni AV, Martinez N, Graze R, Morse A, Newman JRB, Jain V, Vlaho S, Signor S, Nuzhdin SV, Renne R, McIntyre LM. Sex-biased expression is associated with chromatin state in D. melanogaster and D. simulans. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.13.523946. [PMID: 36711631 PMCID: PMC9882225 DOI: 10.1101/2023.01.13.523946] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
Abstract
We propose a new model for the association of chromatin state and sex-bias in expression. We hypothesize enrichment of open chromatin in the sex where we see expression bias (OS) and closed chromatin in the opposite sex (CO). In this study of D. melanogaster and D. simulans head tissue, sex-bias in expression is associated with H3K4me3 (open mark) in males for male-biased genes and in females for female-biased genes in both species. Sex-bias in expression is also largely conserved in direction and magnitude between the two species on the X and autosomes. In male-biased orthologs, the sex-bias ratio is more divergent between species if both species have H3K27me2me3 marks in females compared to when either or neither species has H3K27me2me3 in females. H3K27me2me3 marks in females are associated with male-bias in expression on the autosomes in both species, but on the X only in D. melanogaster . In female-biased orthologs the relationship between the species for the sex-bias ratio is similar regardless of the H3K27me2me3 marks in males. Female-biased orthologs are more similar in the ratio of sex-bias than male-biased orthologs and there is an excess of male-bias in expression in orthologs that gain/lose sex-bias. There is an excess of male-bias in sex-limited expression in both species suggesting excess male-bias is due to rapid evolution between the species. The X chromosome has an enrichment in male-limited H3K4me3 in both species and an enrichment of sex-bias in expression compared to the autosomes.
Collapse
Affiliation(s)
- Adalena V Nanni
- Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, FL
- University of Florida Genetics Institute, University of Florida, Gainesville, FL, USA
| | - Natalie Martinez
- Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, FL
| | - Rita Graze
- Department of Biological Sciences, Auburn University, Auburn, AL, USA
| | - Alison Morse
- Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, FL
- University of Florida Genetics Institute, University of Florida, Gainesville, FL, USA
| | - Jeremy R B Newman
- University of Florida Genetics Institute, University of Florida, Gainesville, FL, USA
| | - Vaibhav Jain
- Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, FL
| | - Srna Vlaho
- Department of Biological Sciences, University of Southern California, Los Angeles, CA, USA
| | - Sarah Signor
- Department of Biological Sciences, North Dakota State University, Fargo, ND, USA
| | - Sergey V Nuzhdin
- Department of Biological Sciences, University of Southern California, Los Angeles, CA, USA
| | - Rolf Renne
- Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, FL
- University of Florida Genetics Institute, University of Florida, Gainesville, FL, USA
| | - Lauren M McIntyre
- Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, FL
- University of Florida Genetics Institute, University of Florida, Gainesville, FL, USA
| |
Collapse
|
4
|
Alser M, Rotman J, Deshpande D, Taraszka K, Shi H, Baykal PI, Yang HT, Xue V, Knyazev S, Singer BD, Balliu B, Koslicki D, Skums P, Zelikovsky A, Alkan C, Mutlu O, Mangul S. Technology dictates algorithms: recent developments in read alignment. Genome Biol 2021; 22:249. [PMID: 34446078 PMCID: PMC8390189 DOI: 10.1186/s13059-021-02443-7] [Citation(s) in RCA: 45] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2020] [Accepted: 07/28/2021] [Indexed: 01/08/2023] Open
Abstract
Aligning sequencing reads onto a reference is an essential step of the majority of genomic analysis pipelines. Computational algorithms for read alignment have evolved in accordance with technological advances, leading to today's diverse array of alignment methods. We provide a systematic survey of algorithmic foundations and methodologies across 107 alignment methods, for both short and long reads. We provide a rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read alignment. We discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology.
Collapse
Affiliation(s)
- Mohammed Alser
- Computer Science Department, ETH Zürich, 8092, Zürich, Switzerland
- Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey
- Information Technology and Electrical Engineering Department, ETH Zürich, Zürich, 8092, Switzerland
| | - Jeremy Rotman
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Dhrithi Deshpande
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, 90089, USA
| | - Kodi Taraszka
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Huwenbo Shi
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| | - Pelin Icer Baykal
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
| | - Harry Taegyun Yang
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
- Bioinformatics Interdepartmental Ph.D. Program, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Victor Xue
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Sergey Knyazev
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
| | - Benjamin D Singer
- Division of Pulmonary and Critical Care Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
- Department of Biochemistry & Molecular Genetics, Northwestern University Feinberg School of Medicine, Chicago, USA
- Simpson Querrey Institute for Epigenetics, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
| | - Brunilda Balliu
- Department of Computational Medicine, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - David Koslicki
- Computer Science and Engineering, Pennsylvania State University, University Park, PA, 16801, USA
- Biology Department, Pennsylvania State University, University Park, PA, 16801, USA
- The Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, PA, 16801, USA
| | - Pavel Skums
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
- The Laboratory of Bioinformatics, I.M. Sechenov First Moscow State Medical University, Moscow, 119991, Russia
| | - Can Alkan
- Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey
- Bilkent-Hacettepe Health Sciences and Technologies Program, Ankara, Turkey
| | - Onur Mutlu
- Computer Science Department, ETH Zürich, 8092, Zürich, Switzerland
- Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey
- Information Technology and Electrical Engineering Department, ETH Zürich, Zürich, 8092, Switzerland
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, 90089, USA.
| |
Collapse
|
5
|
Event Analysis: Using Transcript Events To Improve Estimates of Abundance in RNA-seq Data. G3-GENES GENOMES GENETICS 2018; 8:2923-2940. [PMID: 30021829 PMCID: PMC6118309 DOI: 10.1534/g3.118.200373] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Alternative splicing leverages genomic content by allowing the synthesis of multiple transcripts and, by implication, protein isoforms, from a single gene. However, estimating the abundance of transcripts produced in a given tissue from short sequencing reads is difficult and can result in both the construction of transcripts that do not exist, and the failure to identify true transcripts. An alternative approach is to catalog the events that make up isoforms (splice junctions and exons). We present here the Event Analysis (EA) approach, where we project transcripts onto the genome and identify overlapping/unique regions and junctions. In addition, all possible logical junctions are assembled into a catalog. Transcripts are filtered before quantitation based on simple measures: the proportion of the events detected, and the coverage. We find that mapping to a junction catalog is more efficient at detecting novel junctions than mapping in a splice aware manner. We identify 99.8% of true transcripts while iReckon identifies 82% of the true transcripts and creates more transcripts not included in the simulation than were initially used in the simulation. Using PacBio Iso-seq data from a mouse neural progenitor cell model, EA detects 60% of the novel junctions that are combinations of existing exons while only 43% are detected by STAR. EA further detects ∼5,000 annotated junctions missed by STAR. Filtering transcripts based on the proportion of the transcript detected and the number of reads on average supporting that transcript captures 95% of the PacBio transcriptome. Filtering the reference transcriptome before quantitation, results in is a more stable estimate of isoform abundance, with improved correlation between replicates. This was particularly evident when EA is applied to an RNA-seq study of type 1 diabetes (T1D), where the coefficient of variation among subjects (n = 81) in the transcript abundance estimates was substantially reduced compared to the estimation using the full reference. EA focuses on individual transcriptional events. These events can be quantitate and analyzed directly or used to identify the probable set of expressed transcripts. Simple rules based on detected events and coverage used in filtering result in a dramatic improvement in isoform estimation without the use of ancillary data (e.g., ChIP, long reads) that may not be available for many studies.
Collapse
|
6
|
Nellore A, Collado-Torres L, Jaffe AE, Alquicira-Hernández J, Wilks C, Pritt J, Morton J, Leek JT, Langmead B. Rail-RNA: scalable analysis of RNA-seq splicing and coverage. Bioinformatics 2017; 33:4033-4040. [PMID: 27592709 PMCID: PMC5860083 DOI: 10.1093/bioinformatics/btw575] [Citation(s) in RCA: 38] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2016] [Revised: 06/29/2016] [Accepted: 08/26/2016] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION RNA sequencing (RNA-seq) experiments now span hundreds to thousands of samples. Current spliced alignment software is designed to analyze each sample separately. Consequently, no information is gained from analyzing multiple samples together, and it requires extra work to obtain analysis products that incorporate data from across samples. RESULTS We describe Rail-RNA, a cloud-enabled spliced aligner that analyzes many samples at once. Rail-RNA eliminates redundant work across samples, making it more efficient as samples are added. For many samples, Rail-RNA is more accurate than annotation-assisted aligners. We use Rail-RNA to align 667 RNA-seq samples from the GEUVADIS project on Amazon Web Services in under 16 h for US$0.91 per sample. Rail-RNA outputs alignments in SAM/BAM format; but it also outputs (i) base-level coverage bigWigs for each sample; (ii) coverage bigWigs encoding normalized mean and median coverages at each base across samples analyzed; and (iii) exon-exon splice junctions and indels (features) in columnar formats that juxtapose coverages in samples in which a given feature is found. Supplementary outputs are ready for use with downstream packages for reproducible statistical analysis. We use Rail-RNA to identify expressed regions in the GEUVADIS samples and show that both annotated and unannotated (novel) expressed regions exhibit consistent patterns of variation across populations and with respect to known confounding variables. AVAILABILITY AND IMPLEMENTATION Rail-RNA is open-source software available at http://rail.bio. CONTACTS anellore@gmail.com or langmea@cs.jhu.edu. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Abhinav Nellore
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Leonardo Collado-Torres
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, USA
| | - Andrew E Jaffe
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, USA
- Department of Mental Health, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | - José Alquicira-Hernández
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
- Undergraduate Program on Genomic Sciences, National Autonomous University of Mexico, Mexico City, D.F., Mexico
| | - Christopher Wilks
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Jacob Pritt
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| | - James Morton
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Jeffrey T Leek
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| |
Collapse
|
7
|
Rezaeian I, Tavakoli A, Cavallo-Medved D, Porter LA, Rueda L. A novel model used to detect differential splice junctions as biomarkers in prostate cancer from RNA-Seq data. J Biomed Inform 2016; 60:422-30. [PMID: 26992567 DOI: 10.1016/j.jbi.2016.03.010] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2015] [Revised: 02/10/2016] [Accepted: 03/15/2016] [Indexed: 11/15/2022]
Abstract
BACKGROUND In cancer alternative RNA splicing represents one mechanism for flexible gene regulation, whereby protein isoforms can be created to promote cell growth, division and survival. Detecting novel splice junctions in the cancer transcriptome may reveal pathways driving tumorigenic events. In this regard, RNA-Seq, a high-throughput sequencing technology, has expanded the study of cancer transcriptomics in the areas of gene expression, chimeric events and alternative splicing in search of novel biomarkers for the disease. RESULTS In this study, we propose a new two-dimensional peak finding method for detecting differential splice junctions in prostate cancer using RNA-Seq data. We have designed an integrative process that involves a new two-dimensional peak finding algorithm to combine junctions and then remove irrelevant introns across different samples within a population. We have also designed a scoring mechanism to select the most common junctions. CONCLUSIONS Our computational analysis on three independent datasets collected from patients diagnosed with prostate cancer reveals a small subset of junctions that may potentially serve as biomarkers for prostate cancer. AVAILABILITY The pipeline, along with their corresponding algorithms, are available upon request.
Collapse
Affiliation(s)
- Iman Rezaeian
- School of Computer Science, University of Windsor, 401 Sunset Ave., Windsor, Ontario N9B 3P4, Canada.
| | - Ahmad Tavakoli
- School of Computer Science, University of Windsor, 401 Sunset Ave., Windsor, Ontario N9B 3P4, Canada.
| | - Dora Cavallo-Medved
- Department of Biological Sciences, University of Windsor, 401 Sunset Ave., Windsor, Ontario N9B 3P4, Canada.
| | - Lisa A Porter
- Department of Biological Sciences, University of Windsor, 401 Sunset Ave., Windsor, Ontario N9B 3P4, Canada.
| | - Luis Rueda
- School of Computer Science, University of Windsor, 401 Sunset Ave., Windsor, Ontario N9B 3P4, Canada.
| |
Collapse
|
8
|
Chu C, Li X, Wu Y. SpliceJumper: a classification-based approach for calling splicing junctions from RNA-seq data. BMC Bioinformatics 2015; 16 Suppl 17:S10. [PMID: 26678515 PMCID: PMC4674845 DOI: 10.1186/1471-2105-16-s17-s10] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Next-generation RNA sequencing technologies have been widely applied in transcriptome profiling. This facilitates further studies of gene structure and expression on the genome wide scale. It is an important step to align reads to the reference genome and call out splicing junctions for the following analysis, such as the analysis of alternative splicing and isoform construction. However, because of the existence of introns, when RNA-seq reads are aligned to the reference genome, reads can not be fully mapped at splicing sites. Thus, it is challenging to align reads and call out splicing junctions accurately. RESULTS In this paper, we present a classification based approach for calling splicing junctions from RNA-seq data, which is implemented in the program SpliceJumper. SpliceJumper uses a machine learning approach which combines multiple features extracted from RNA-seq data. We compare SpliceJumper with two existing RNA-seq analysis approaches, TopHat2 and MapSplice2, on both simulated and real data. Our results show that SpliceJumper outperforms TopHat2 and MapSplice2 in accuracy. The program SpliceJumper can be downloaded at https://github.com/Reedwarbler/SpliceJumper.
Collapse
|
9
|
Systematic discovery of complex insertions and deletions in human cancers. Nat Med 2015; 22:97-104. [PMID: 26657142 PMCID: PMC5003782 DOI: 10.1038/nm.4002] [Citation(s) in RCA: 70] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2015] [Accepted: 11/03/2015] [Indexed: 12/25/2022]
Abstract
Complex insertions and deletions (indels) are formed by simultaneously deleting and inserting DNA fragments of different sizes at a common genomic location. Here we present a systematic analysis of somatic complex indels in the coding sequences of samples from over 8,000 cancer cases using Pindel-C. We discovered 285 complex indels in cancer-associated genes (such as PIK3R1, TP53, ARID1A, GATA3 and KMT2D) in approximately 3.5% of cases analyzed; nearly all instances of complex indels were overlooked (81.1%) or misannotated (17.6%) in previous reports of 2,199 samples. In-frame complex indels are enriched in PIK3R1 and EGFR, whereas frameshifts are prevalent in VHL, GATA3, TP53, ARID1A, PTEN and ATRX. Furthermore, complex indels display strong tissue specificity (such as VHL in kidney cancer samples and GATA3 in breast cancer samples). Finally, structural analyses support findings of previously missed, but potentially druggable, mutations in the EGFR, MET and KIT oncogenes. This study indicates the critical importance of improving complex indel discovery and interpretation in medical research.
Collapse
|
10
|
Kroon M, Lameijer EW, Lakenberg N, Hehir-Kwa JY, Thung DT, Slagboom PE, Kok JN, Ye K. Detecting dispersed duplications in high-throughput sequencing data using a database-free approach. Bioinformatics 2015; 32:505-10. [PMID: 26508759 DOI: 10.1093/bioinformatics/btv621] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2015] [Accepted: 10/20/2015] [Indexed: 11/15/2022] Open
Abstract
MOTIVATION Dispersed duplications (DDs) such as transposon element insertions and copy number variations are ubiquitous in the human genome. They have attracted the interest of biologists as well as medical researchers due to their role in both evolution and disease. The efforts of discovering DDs in high-throughput sequencing data are currently dominated by database-oriented approaches that require pre-existing knowledge of the DD elements to be detected. RESULTS We present DD_DETECTION, a database-free approach to finding DD events in high-throughput sequencing data. DD_DETECTION is able to detect DDs purely from paired-end read alignments. We show in a comparative study that this method is able to compete with database-oriented approaches in recovering validated transposon insertion events. We also experimentally validate the predictions of DD_DETECTION on a human DNA sample, showing that it can find not only duplicated elements present in common databases but also DDs of novel type. AVAILABILITY AND IMPLEMENTATION The software presented in this article is open source and available from https://bitbucket.org/mkroon/dd_detection.
Collapse
Affiliation(s)
- M Kroon
- Department of Molecular Epidemiology, Leiden University Medical Center, Leiden
| | - E W Lameijer
- Department of Molecular Epidemiology, Leiden University Medical Center, Leiden
| | - N Lakenberg
- Department of Molecular Epidemiology, Leiden University Medical Center, Leiden
| | - J Y Hehir-Kwa
- Department of Human Genetics, Nijmegen Center for Molecular Life Sciences, Institute for Genetic and Metabolic Disease, Radboud University Nijmegen Medical Center, Nijmegen, Donders Centre for Neuroscience, Nijmegen, The Netherlands and
| | - D T Thung
- Department of Human Genetics, Nijmegen Center for Molecular Life Sciences, Institute for Genetic and Metabolic Disease, Radboud University Nijmegen Medical Center, Nijmegen
| | - P E Slagboom
- Department of Molecular Epidemiology, Leiden University Medical Center, Leiden
| | - J N Kok
- Department of Molecular Epidemiology, Leiden University Medical Center, Leiden
| | - K Ye
- Department of Molecular Epidemiology, Leiden University Medical Center, Leiden, The Genome Institute, Washington University, St Louis, MO 63108, USA
| |
Collapse
|
11
|
Yang C, Wu PY, Tong L, Phan JH, Wang MD. The impact of RNA-seq aligners on gene expression estimation. ACM-BCB ... ... : THE ... ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICINE. ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICINE 2015; 2015:462-471. [PMID: 27583310 DOI: 10.1145/2808719.2808767] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
While numerous RNA-seq data analysis pipelines are available, research has shown that the choice of pipeline influences the results of differentially expressed gene detection and gene expression estimation. Gene expression estimation is a key step in RNA-seq data analysis, since the accuracy of gene expression estimates profoundly affects the subsequent analysis. Generally, gene expression estimation involves sequence alignment and quantification, and accurate gene expression estimation requires accurate alignment. However, the impact of aligners on gene expression estimation remains unclear. We address this need by constructing nine pipelines consisting of nine spliced aligners and one quantifier. We then use simulated data to investigate the impact of aligners on gene expression estimation. To evaluate alignment, we introduce three alignment performance metrics, (1) the percentage of reads aligned, (2) the percentage of reads aligned with zero mismatch (ZeroMismatchPercentage), and (3) the percentage of reads aligned with at most one mismatch (ZeroOneMismatchPercentage). We then evaluate the impact of alignment performance on gene expression estimation using three metrics, (1) gene detection accuracy, (2) the number of genes falsely quantified (FalseExpNum), and (3) the number of genes with falsely estimated fold changes (FalseFcNum). We found that among various pipelines, FalseExpNum and FalseFcNum are correlated. Moreover, FalseExpNum is linearly correlated with the percentage of reads aligned and ZeroMismatchPercentage, and FalseFcNum is linearly correlated with ZeroMismatchPercentage. Because of this correlation, the percentage of reads aligned and ZeroMismatchPercentage may be used to assess the performance of gene expression estimation for all RNA-seq datasets.
Collapse
Affiliation(s)
- Cheng Yang
- Department of Biomedical Engineering, Georgia Institute of Technology, Emory University, and Peking University, Atlanta, GA 30332, USA
| | - Po-Yen Wu
- School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, 30332, USA
| | - Li Tong
- Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA 30332, USA
| | - John H Phan
- Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA 30332, USA
| | - May D Wang
- Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA 30332, USA
| |
Collapse
|
12
|
Oh S. How are Bayesian and Non-Parametric Methods Doing a Great Job in RNA-Seq Differential Expression Analysis? : A Review. COMMUNICATIONS FOR STATISTICAL APPLICATIONS AND METHODS 2015. [DOI: 10.5351/csam.2015.22.2.181] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Affiliation(s)
- Sunghee Oh
- Department of Veterinary Medicine, Jeju National University, Korea
| |
Collapse
|
13
|
Pulyakhina I, Gazzoli I, 't Hoen PAC, Verwey N, den Dunnen JT, den Dunnen J, Aartsma-Rus A, Laros JFJ. SplicePie: a novel analytical approach for the detection of alternative, non-sequential and recursive splicing. Nucleic Acids Res 2015; 43:e80. [PMID: 25800735 PMCID: PMC4499118 DOI: 10.1093/nar/gkv242] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2014] [Accepted: 03/09/2015] [Indexed: 11/20/2022] Open
Abstract
Alternative splicing is a powerful mechanism present in eukaryotic cells to obtain a wide range of transcripts and protein isoforms from a relatively small number of genes. The mechanisms regulating (alternative) splicing and the paradigm of consecutive splicing have recently been challenged, especially for genes with a large number of introns. RNA-Seq, a powerful technology using deep sequencing in order to determine transcript structure and expression levels, is usually performed on mature mRNA, therefore not allowing detailed analysis of splicing progression. Sequencing pre-mRNA at different stages of splicing potentially provides insight into mRNA maturation. Although the number of tools that analyze total and cytoplasmic RNA in order to elucidate the transcriptome composition is rapidly growing, there are no tools specifically designed for the analysis of nuclear RNA (which contains mixtures of pre- and mature mRNA). We developed dedicated algorithms to investigate the splicing process. In this paper, we present a new classification of RNA-Seq reads based on three major stages of splicing: pre-, intermediate- and post-splicing. Applying this novel classification we demonstrate the possibility to analyze the order of splicing. Furthermore, we uncover the potential to investigate the multi-step nature of splicing, assessing various types of recursive splicing events. We provide the data that gives biological insight into the order of splicing, show that non-sequential splicing of certain introns is reproducible and coinciding in multiple cell lines. We validated our observations with independent experimental technologies and showed the reliability of our method. The pipeline, named SplicePie, is freely available at: https://github.com/pulyakhina/splicing_analysis_pipeline. The example data can be found at: https://barmsijs.lumc.nl/HG/irina/example_data.tar.gz.
Collapse
Affiliation(s)
- Irina Pulyakhina
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Isabella Gazzoli
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Peter A C 't Hoen
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Nisha Verwey
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Johan T den Dunnen
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands Leiden Genome Technology Center, Leiden University Medical Center, Leiden, The Netherlands
| | - Johan den Dunnen
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands Leiden Genome Technology Center, Leiden University Medical Center, Leiden, The Netherlands
| | - Annemieke Aartsma-Rus
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | - Jeroen F J Laros
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands Leiden Genome Technology Center, Leiden University Medical Center, Leiden, The Netherlands
| |
Collapse
|
14
|
Gatto A, Torroja-Fungairiño C, Mazzarotto F, Cook SA, Barton PJR, Sánchez-Cabo F, Lara-Pezzi E. FineSplice, enhanced splice junction detection and quantification: a novel pipeline based on the assessment of diverse RNA-Seq alignment solutions. Nucleic Acids Res 2014; 42:e71. [PMID: 24574529 PMCID: PMC4005686 DOI: 10.1093/nar/gku166] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023] Open
Abstract
Alternative splicing is the main mechanism governing protein diversity. The recent developments in RNA-Seq technology have enabled the study of the global impact and regulation of this biological process. However, the lack of standardized protocols constitutes a major bottleneck in the analysis of alternative splicing. This is particularly important for the identification of exon–exon junctions, which is a critical step in any analysis workflow. Here we performed a systematic benchmarking of alignment tools to dissect the impact of design and method on the mapping, detection and quantification of splice junctions from multi-exon reads. Accordingly, we devised a novel pipeline based on TopHat2 combined with a splice junction detection algorithm, which we have named FineSplice. FineSplice allows effective elimination of spurious junction hits arising from artefactual alignments, achieving up to 99% precision in both real and simulated data sets and yielding superior F1 scores under most tested conditions. The proposed strategy conjugates an efficient mapping solution with a semi-supervised anomaly detection scheme to filter out false positives and allows reliable estimation of expressed junctions from the alignment output. Ultimately this provides more accurate information to identify meaningful splicing patterns. FineSplice is freely available at https://sourceforge.net/p/finesplice/.
Collapse
Affiliation(s)
- Alberto Gatto
- Cardiovascular Development and Repair Department, Centro Nacional de Investigaciones Cardiovasculares, Madrid, 28029, Spain, Bioinformatics Unit, Centro Nacional de Investigaciones Cardiovasculares, Madrid, 28029, Spain, National Heart and Lung Institute, Imperial College London, London SW7 2AZ, UK, Cardiovascular Biomedical Research Unit, NIHR Royal Brompton and Harefield NHS Foundation Trust, London SW3 6NP, UK, Department of Cardiology, National Heart Centre Singapore, Singapore 168752, Singapore and Cardiovascular and Metabolic Disorders Program, Duke-NUS Graduate Medical School, Singapore 169857, Singapore
| | | | | | | | | | | | | |
Collapse
|
15
|
Alamancos GP, Agirre E, Eyras E. Methods to study splicing from high-throughput RNA sequencing data. Methods Mol Biol 2014; 1126:357-97. [PMID: 24549677 DOI: 10.1007/978-1-62703-980-2_26] [Citation(s) in RCA: 54] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022]
Abstract
The development of novel high-throughput sequencing (HTS) methods for RNA (RNA-Seq) has provided a very powerful mean to study splicing under multiple conditions at unprecedented depth. However, the complexity of the information to be analyzed has turned this into a challenging task. In the last few years, a plethora of tools have been developed, allowing researchers to process RNA-Seq data to study the expression of isoforms and splicing events, and their relative changes under different conditions. We provide an overview of the methods available to study splicing from short RNA-Seq data, which could serve as an entry point for users who need to decide on a suitable tool for a specific analysis. We also attempt to propose a classification of the tools according to the operations they do, to facilitate the comparison and choice of methods.
Collapse
Affiliation(s)
- Gael P Alamancos
- Computational Genomics, Universitat Pompeu Fabra, Barcelona, Spain
| | | | | |
Collapse
|
16
|
Burns PD, Li Y, Ma J, Borodovsky M. UnSplicer: mapping spliced RNA-Seq reads in compact genomes and filtering noisy splicing. Nucleic Acids Res 2013; 42:e25. [PMID: 24259430 PMCID: PMC3936741 DOI: 10.1093/nar/gkt1141] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Accurate mapping of spliced RNA-Seq reads to genomic DNA has been known as a challenging problem. Despite significant efforts invested in developing efficient algorithms, with the human genome as a primary focus, the best solution is still not known. A recently introduced tool, TrueSight, has demonstrated better performance compared with earlier developed algorithms such as TopHat and MapSplice. To improve detection of splice junctions, TrueSight uses information on statistical patterns of nucleotide ordering in intronic and exonic DNA. This line of research led to yet another new algorithm, UnSplicer, designed for eukaryotic species with compact genomes where functional alternative splicing is likely to be dominated by splicing noise. Genome-specific parameters of the new algorithm are generated by GeneMark-ES, an ab initio gene prediction algorithm based on unsupervised training. UnSplicer shares several components with TrueSight; the difference lies in the training strategy and the classification algorithm. We tested UnSplicer on RNA-Seq data sets of Arabidopsis thaliana, Caenorhabditis elegans, Cryptococcus neoformans and Drosophila melanogaster. We have shown that splice junctions inferred by UnSplicer are in better agreement with knowledge accumulated on these well-studied genomes than predictions made by earlier developed tools.
Collapse
Affiliation(s)
- Paul D Burns
- Joint Georgia Tech and Emory Wallace H. Coulter Department of Biomedical Engineering, Atlanta, GA 30332, USA, Department of Bioengineering, University of Illinois at Urbana-Champaign, IL 61801, USA, Institute for Genomic Biology, University of Illinois at Urbana-Champaign, IL 61801, USA, School of Computational Science & Engineering, Georgia Tech, Atlanta, GA 30332, USA and Department of Bioinformatics, Moscow Institute of Physics and Technology, Moscow, 141700, Russia
| | | | | | | |
Collapse
|
17
|
Roy B, Haupt LM, Griffiths LR. Review: Alternative Splicing (AS) of Genes As An Approach for Generating Protein Complexity. Curr Genomics 2013; 14:182-94. [PMID: 24179441 PMCID: PMC3664468 DOI: 10.2174/1389202911314030004] [Citation(s) in RCA: 61] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2012] [Revised: 02/08/2013] [Accepted: 02/25/2013] [Indexed: 12/22/2022] Open
Abstract
Prior to the completion of the human genome project, the human genome was thought to have a greater number of genes as it seemed structurally and functionally more complex than other simpler organisms. This along with the belief of “one gene, one protein”, were demonstrated to be incorrect. The inequality in the ratio of gene to protein formation gave rise to the theory of alternative splicing (AS). AS is a mechanism by which one gene gives rise to multiple protein products. Numerous databases and online bioinformatic tools are available for the detection and analysis of AS. Bioinformatics provides an important approach to study mRNA and protein diversity by various tools such as expressed sequence tag (EST) sequences obtained from completely processed mRNA. Microarrays and deep sequencing approaches also aid in the detection of splicing events. Initially it was postulated that AS occurred only in about 5% of all genes but was later found to be more abundant. Using bioinformatic approaches, the level of AS in human genes was found to be fairly high with 35-59% of genes having at least one AS form. Our ability to determine and predict AS is important as disorders in splicing patterns may lead to abnormal splice variants resulting in genetic diseases. In addition, the diversity of proteins produced by AS poses a challenge for successful drug discovery and therefore a greater understanding of AS would be beneficial.
Collapse
Affiliation(s)
- Bishakha Roy
- Genomics Research Centre, Griffith Health Institute, Griffith University Gold Coast, Queensland 4222, Australia
| | | | | |
Collapse
|
18
|
Wu J, Anczuków O, Krainer AR, Zhang MQ, Zhang C. OLego: fast and sensitive mapping of spliced mRNA-Seq reads using small seeds. Nucleic Acids Res 2013; 41:5149-63. [PMID: 23571760 PMCID: PMC3664805 DOI: 10.1093/nar/gkt216] [Citation(s) in RCA: 102] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
Abstract
A crucial step in analyzing mRNA-Seq data is to accurately and efficiently map hundreds of millions of reads to the reference genome and exon junctions. Here we present OLego, an algorithm specifically designed for de novo mapping of spliced mRNA-Seq reads. OLego adopts a multiple-seed-and-extend scheme, and does not rely on a separate external aligner. It achieves high sensitivity of junction detection by strategic searches with small seeds (∼14 nt for mammalian genomes). To improve accuracy and resolve ambiguous mapping at junctions, OLego uses a built-in statistical model to score exon junctions by splice-site strength and intron size. Burrows–Wheeler transform is used in multiple steps of the algorithm to efficiently map seeds, locate junctions and identify small exons. OLego is implemented in C++ with fully multithreaded execution, and allows fast processing of large-scale data. We systematically evaluated the performance of OLego in comparison with published tools using both simulated and real data. OLego demonstrated better sensitivity, higher or comparable accuracy and substantially improved speed. OLego also identified hundreds of novel micro-exons (<30 nt) in the mouse transcriptome, many of which are phylogenetically conserved and can be validated experimentally in vivo. OLego is freely available at http://zhanglab.c2b2.columbia.edu/index.php/OLego.
Collapse
Affiliation(s)
- Jie Wu
- Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY 11724, USA
| | | | | | | | | |
Collapse
|
19
|
Li Y, Li-Byarlay H, Burns P, Borodovsky M, Robinson GE, Ma J. TrueSight: a new algorithm for splice junction detection using RNA-seq. Nucleic Acids Res 2013; 41:e51. [PMID: 23254332 PMCID: PMC3575843 DOI: 10.1093/nar/gks1311] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2012] [Revised: 11/15/2012] [Accepted: 11/16/2012] [Indexed: 01/21/2023] Open
Abstract
RNA-seq has proven to be a powerful technique for transcriptome profiling based on next-generation sequencing (NGS) technologies. However, due to the short length of NGS reads, it is challenging to accurately map RNA-seq reads to splice junctions (SJs), which is a critically important step in the analysis of alternative splicing (AS) and isoform construction. In this article, we describe a new method, called TrueSight, which for the first time combines RNA-seq read mapping quality and coding potential of genomic sequences into a unified model. The model is further utilized in a machine-learning approach to precisely identify SJs. Both simulations and real data evaluations showed that TrueSight achieved higher sensitivity and specificity than other methods. We applied TrueSight to new high coverage honey bee RNA-seq data to discover novel splice forms. We found that 60.3% of honey bee multi-exon genes are alternatively spliced. By utilizing gene models improved by TrueSight, we characterized AS types in honey bee transcriptome. We believe that TrueSight will be highly useful to comprehensively study the biology of alternative splicing.
Collapse
Affiliation(s)
- Yang Li
- Department of Bioengineering, Institute for Genomic Biology, Department of Entomology, University of Illinois at Urbana-Champaign, IL 61801, USA, Wallace H. Coulter Department of Biomedical Engineering, School of Computational Science & Engineering, Georgia Institute of Technology, Atlanta 30332, GA, USA, Department of Molecular and Biological Physics, Moscow Institute for Physics and Technology, Dolgoprudny, 141700, Moscow Region, Russia and Neuroscience Program, University of Illinois at Urbana-Champaign, IL 61801, USA
| | - Hongmei Li-Byarlay
- Department of Bioengineering, Institute for Genomic Biology, Department of Entomology, University of Illinois at Urbana-Champaign, IL 61801, USA, Wallace H. Coulter Department of Biomedical Engineering, School of Computational Science & Engineering, Georgia Institute of Technology, Atlanta 30332, GA, USA, Department of Molecular and Biological Physics, Moscow Institute for Physics and Technology, Dolgoprudny, 141700, Moscow Region, Russia and Neuroscience Program, University of Illinois at Urbana-Champaign, IL 61801, USA
| | - Paul Burns
- Department of Bioengineering, Institute for Genomic Biology, Department of Entomology, University of Illinois at Urbana-Champaign, IL 61801, USA, Wallace H. Coulter Department of Biomedical Engineering, School of Computational Science & Engineering, Georgia Institute of Technology, Atlanta 30332, GA, USA, Department of Molecular and Biological Physics, Moscow Institute for Physics and Technology, Dolgoprudny, 141700, Moscow Region, Russia and Neuroscience Program, University of Illinois at Urbana-Champaign, IL 61801, USA
| | - Mark Borodovsky
- Department of Bioengineering, Institute for Genomic Biology, Department of Entomology, University of Illinois at Urbana-Champaign, IL 61801, USA, Wallace H. Coulter Department of Biomedical Engineering, School of Computational Science & Engineering, Georgia Institute of Technology, Atlanta 30332, GA, USA, Department of Molecular and Biological Physics, Moscow Institute for Physics and Technology, Dolgoprudny, 141700, Moscow Region, Russia and Neuroscience Program, University of Illinois at Urbana-Champaign, IL 61801, USA
| | - Gene E. Robinson
- Department of Bioengineering, Institute for Genomic Biology, Department of Entomology, University of Illinois at Urbana-Champaign, IL 61801, USA, Wallace H. Coulter Department of Biomedical Engineering, School of Computational Science & Engineering, Georgia Institute of Technology, Atlanta 30332, GA, USA, Department of Molecular and Biological Physics, Moscow Institute for Physics and Technology, Dolgoprudny, 141700, Moscow Region, Russia and Neuroscience Program, University of Illinois at Urbana-Champaign, IL 61801, USA
| | - Jian Ma
- Department of Bioengineering, Institute for Genomic Biology, Department of Entomology, University of Illinois at Urbana-Champaign, IL 61801, USA, Wallace H. Coulter Department of Biomedical Engineering, School of Computational Science & Engineering, Georgia Institute of Technology, Atlanta 30332, GA, USA, Department of Molecular and Biological Physics, Moscow Institute for Physics and Technology, Dolgoprudny, 141700, Moscow Region, Russia and Neuroscience Program, University of Illinois at Urbana-Champaign, IL 61801, USA
| |
Collapse
|
20
|
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq aligner. ACTA ACUST UNITED AC 2012; 29:15-21. [PMID: 23104886 DOI: 10.1093/bioinformatics/bts635] [Citation(s) in RCA: 32074] [Impact Index Per Article: 2467.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
MOTIVATION Accurate alignment of high-throughput RNA-seq data is a challenging and yet unsolved problem because of the non-contiguous transcript structure, relatively short read lengths and constantly increasing throughput of the sequencing technologies. Currently available RNA-seq aligners suffer from high mapping error rates, low mapping speed, read length limitation and mapping biases. RESULTS To align our large (>80 billon reads) ENCODE Transcriptome RNA-seq dataset, we developed the Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure. STAR outperforms other aligners by a factor of >50 in mapping speed, aligning to the human genome 550 million 2 × 76 bp paired-end reads per hour on a modest 12-core server, while at the same time improving alignment sensitivity and precision. In addition to unbiased de novo detection of canonical junctions, STAR can discover non-canonical splices and chimeric (fusion) transcripts, and is also capable of mapping full-length RNA sequences. Using Roche 454 sequencing of reverse transcription polymerase chain reaction amplicons, we experimentally validated 1960 novel intergenic splice junctions with an 80-90% success rate, corroborating the high precision of the STAR mapping strategy. AVAILABILITY AND IMPLEMENTATION STAR is implemented as a standalone C++ code. STAR is free open source software distributed under GPLv3 license and can be downloaded from http://code.google.com/p/rna-star/.
Collapse
|
21
|
Fonseca NA, Rung J, Brazma A, Marioni JC. Tools for mapping high-throughput sequencing data. Bioinformatics 2012; 28:3169-77. [DOI: 10.1093/bioinformatics/bts605] [Citation(s) in RCA: 207] [Impact Index Per Article: 15.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
|