1
|
Ma J, Zhao X, Qi E, Han R, Yu T, Li G. Highly efficient clustering of long-read transcriptomic data with GeLuster. Bioinformatics 2024; 40:btae059. [PMID: 38310330 PMCID: PMC10881092 DOI: 10.1093/bioinformatics/btae059] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Revised: 01/08/2024] [Accepted: 01/30/2024] [Indexed: 02/05/2024] Open
Abstract
MOTIVATION The advancement of long-read RNA sequencing technologies leads to a bright future for transcriptome analysis, in which clustering long reads according to their gene family of origin is of great importance. However, existing de novo clustering algorithms require plenty of computing resources. RESULTS We developed a new algorithm GeLuster for clustering long RNA-seq reads. Based on our tests on one simulated dataset and nine real datasets, GeLuster exhibited superior performance. On the tested Nanopore datasets it ran 2.9-17.5 times as fast as the second-fastest method with less than one-seventh of memory consumption, while achieving higher clustering accuracy. And on the PacBio data, GeLuster also had a similar performance. It sets the stage for large-scale transcriptome study in future. AVAILABILITY AND IMPLEMENTATION GeLuster is freely available at https://github.com/yutingsdu/GeLuster.
Collapse
Affiliation(s)
- Junchi Ma
- Research Center for Mathematics and Interdisciplinary Sciences (Frontiers Science Center for Nonlinear Expectations), Shandong University, Qingdao 266237, China
- School of Mathematics, Shandong University, Jinan, Shandong 250100, China
| | - Xiaoyu Zhao
- School of Mathematics, Shandong University, Jinan, Shandong 250100, China
| | - Enfeng Qi
- School of Mathematics and Statistics, Guangxi Normal University, Guilin 541000, China
| | - Renmin Han
- Research Center for Mathematics and Interdisciplinary Sciences (Frontiers Science Center for Nonlinear Expectations), Shandong University, Qingdao 266237, China
| | - Ting Yu
- Research Center for Mathematics and Interdisciplinary Sciences (Frontiers Science Center for Nonlinear Expectations), Shandong University, Qingdao 266237, China
| | - Guojun Li
- Research Center for Mathematics and Interdisciplinary Sciences (Frontiers Science Center for Nonlinear Expectations), Shandong University, Qingdao 266237, China
| |
Collapse
|
2
|
You Y, Clark MB, Shim H. NanoSplicer: Accurate identification of splice junctions using Oxford Nanopore sequencing. Bioinformatics 2022; 38:3741-3748. [PMID: 35639973 PMCID: PMC9344838 DOI: 10.1093/bioinformatics/btac359] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2021] [Revised: 04/02/2022] [Accepted: 05/24/2022] [Indexed: 11/30/2022] Open
Abstract
Motivation Long-read sequencing methods have considerable advantages for characterizing RNA isoforms. Oxford Nanopore sequencing records changes in electrical current when nucleic acid traverses through a pore. However, basecalling of this raw signal (known as a squiggle) is error prone, making it challenging to accurately identify splice junctions. Existing strategies include utilizing matched short-read data and/or annotated splice junctions to correct nanopore reads but add expense or limit junctions to known (incomplete) annotations. Therefore, a method that could accurately identify splice junctions solely from nanopore data would have numerous advantages. Results We developed ‘NanoSplicer’ to identify splice junctions using raw nanopore signal (squiggles). For each splice junction, the observed squiggle is compared to candidate squiggles representing potential junctions to identify the correct candidate. Measuring squiggle similarity enables us to compute the probability of each candidate junction and find the most likely one. We tested our method using (i) synthetic mRNAs with known splice junctions and (ii) biological mRNAs from a lung-cancer cell-line. The results from both datasets demonstrate NanoSplicer improves splice junction identification, especially when the basecalling error rate near the splice junction is elevated. Availability and implementation NanoSplicer is available at https://github.com/shimlab/NanoSplicer and archived at https://doi.org/10.5281/zenodo.6403849. Data is available from ENA: ERS7273757 and ERS7273453. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yupei You
- School of Mathematics and Statistics/Melbourne Integrative Genomics, The University of Melbourne, Melbourne, VIC, 3010, Australia
| | - Michael B Clark
- Centre for Stem Cell Systems, Department of Anatomy and Physiology, The University of Melbourne, Melbourne, VIC, 3010, Australia
| | - Heejung Shim
- School of Mathematics and Statistics/Melbourne Integrative Genomics, The University of Melbourne, Melbourne, VIC, 3010, Australia
| |
Collapse
|
3
|
A model for isoform-level differential expression analysis using RNA-seq data without pre-specifying isoform structure. PLoS One 2022; 17:e0266162. [PMID: 35576204 PMCID: PMC9109925 DOI: 10.1371/journal.pone.0266162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2021] [Accepted: 03/16/2022] [Indexed: 11/19/2022] Open
Abstract
Motivation
Next generation sequencing (NGS) technology has been widely used in biomedical research, particularly on those genomics-related studies. One of NGS applications is the high-throughput mRNA sequencing (RNA-seq), which is usually applied to evaluate gene expression level (i.e. copies of isoforms), to identify differentially expressed genes, and to discover potential alternative splicing events. Popular tools for differential expression (DE) analysis using RNA-seq data include edgeR and DESeq. These methods tend to identify DE genes at the gene-level, which only allows them to compare the total size of isoforms, that is, sum of an isoform’s copy number times its length over all isoforms. Naturally, these methods may fail to detect DE genes when the total size of isoforms remains similar but isoform-wise expression levels change dramatically. Other tools can perform isoform-level DE analysis only if isoform structures are known but would still fail for many non-model species whose isoform information are missing. To overcome these disadvantages, we developed an isoform-free (without need to pre-specify isoform structures) splicing-graph based negative binomial (SGNB) model for differential expression analysis at isoform level. Our model detects not only the change in the total size of isoforms but also the change in the isoform-wise expression level and hence is more powerful.
Results
We performed extensive simulations to compare our method with edgeR and DESeq. Under various scenarios, our method consistently achieved a higher detection power, while controlling pre-specified type I error. We also applied our method to a real data set to illustrate its applicability in practice.
Collapse
|
4
|
Zheng H, Ma C, Kingsford C. Deriving Ranges of Optimal Estimated Transcript Expression due to Nonidentifiability. J Comput Biol 2022; 29:121-139. [PMID: 35041494 PMCID: PMC8892959 DOI: 10.1089/cmb.2021.0444] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Current expression quantification methods suffer from a fundamental but undercharacterized type of error: the most likely estimates for transcript abundances are not unique. This means multiple estimates of transcript abundances generate the observed RNA-seq reads with equal likelihood, and the underlying true expression cannot be determined. This is called nonidentifiability in probabilistic modeling. It is further exacerbated by incomplete reference transcriptomes where reads may be sequenced from unannotated transcripts. Graph quantification is a generalization to transcript quantification, accounting for the reference incompleteness by allowing exponentially many unannotated transcripts to express reads. We propose methods to calculate a "confidence range of expression" for each transcript, representing its possible abundance across equally optimal estimates for both quantification models. This range informs both whether a transcript has potential estimation error due to nonidentifiability and the extent of the error. Applying our methods to the Human Body Map data, we observe that 35%-50% of transcripts potentially suffer from inaccurate quantification caused by nonidentifiability. When comparing the expression between isoforms in one sample, we find that the degree of inaccuracy of 20%-47% transcripts can be so large that the ranking of expression between the transcript and other isoforms from the same gene cannot be determined. When comparing the expression of a transcript between two groups of RNA-seq samples in differential expression analysis, we observe that the majority of detected differentially expressed transcripts are reliable with a few exceptions after considering the ranges of the optimal expression estimates.
Collapse
Affiliation(s)
- Hongyu Zheng
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Cong Ma
- Computer Science Department, Princeton University, Princeton, New Jersey, USA
| | - Carl Kingsford
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| |
Collapse
|
5
|
Alqassem I, Sonthalia Y, Klitzke-Feser E, Shim H, Canzar S. McSplicer: a probabilistic model for estimating splice site usage from RNA-seq data. Bioinformatics 2021; 37:2004–2011. [PMID: 33515239 PMCID: PMC8337008 DOI: 10.1093/bioinformatics/btab050] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2020] [Revised: 01/20/2021] [Accepted: 01/21/2021] [Indexed: 11/23/2022] Open
Abstract
MOTIVATION Alternative splicing removes intronic sequences from pre-mRNAs in alternative ways to produce different forms (isoforms) of mature mRNA. The composition of expressed transcripts gives specific functionalities to cells in a particular condition or developmental stage. In addition, a large fraction of human disease mutations affect splicing and lead to aberrant mRNA and protein products. Current methods that interrogate the transcriptome based on RNA-seq either suffer from short read length when trying to infer full-length transcripts, or are restricted to predefined units of alternative splicing that they quantify from local read evidence. RESULTS Instead of attempting to quantify individual outcomes of the splicing process such as local splicing events or full-length transcripts, we propose to quantify alternative splicing using a simplified probabilistic model of the underlying splicing process. Our model is based on the usage of individual splice sites and can generate arbitrarily complex types of splicing patterns. In our implementation, McSplicer, we estimate the parameters of our model using all read data at once and we demonstrate in our experiments that this yields more accurate estimates compared to competing methods. Our model is able to describe multiple effects of splicing mutations using few, easy to interpret parameters, as we illustrate in an experiment on RNA-seq data from autism spectrum disorder patients. AVAILABILITY McSplicer source code is available at https://github.com/canzarlab/McSplicer and has been deposited in archived format at https://doi.org/10.5281/zenodo.4449881. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Israa Alqassem
- Gene Center, Ludwig-Maximilians-Universität München, Munich, 81377, Germany
| | | | | | - Heejung Shim
- Melbourne Integrative Genomics (MIG), School of Mathematics and Statistics, University of Melbourne, Melbourne, 3010, Australia
| | - Stefan Canzar
- Gene Center, Ludwig-Maximilians-Universität München, Munich, 81377, Germany
| |
Collapse
|
6
|
Ma C, Zheng H, Kingsford C. Exact transcript quantification over splice graphs. Algorithms Mol Biol 2021; 16:5. [PMID: 33971903 PMCID: PMC8112020 DOI: 10.1186/s13015-021-00184-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2021] [Accepted: 04/19/2021] [Indexed: 11/10/2022] Open
Abstract
Background The probability of sequencing a set of RNA-seq reads can be directly modeled using the abundances of splice junctions in splice graphs instead of the abundances of a list of transcripts. We call this model graph quantification, which was first proposed by Bernard et al. (Bioinformatics 30:2447–55, 2014). The model can be viewed as a generalization of transcript expression quantification where every full path in the splice graph is a possible transcript. However, the previous graph quantification model assumes the length of single-end reads or paired-end fragments is fixed. Results We provide an improvement of this model to handle variable-length reads or fragments and incorporate bias correction. We prove that our model is equivalent to running a transcript quantifier with exactly the set of all compatible transcripts. The key to our method is constructing an extension of the splice graph based on Aho-Corasick automata. The proof of equivalence is based on a novel reparameterization of the read generation model of a state-of-art transcript quantification method. Conclusion We propose a new approach for graph quantification, which is useful for modeling scenarios where reference transcriptome is incomplete or not available and can be further used in transcriptome assembly or alternative splicing analysis. Supplementary Information The online version contains supplementary material available at 10.1186/s13015-021-00184-7.
Collapse
|
7
|
Wang J, Wang C, Li L, Yang L, Wang S, Ning X, Gao S, Ren L, Chaulagain A, Tang J, Wang T. Alternative splicing: An important regulatory mechanism in colorectal carcinoma. Mol Carcinog 2021; 60:279-293. [PMID: 33629774 DOI: 10.1002/mc.23291] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2020] [Revised: 02/01/2021] [Accepted: 02/11/2021] [Indexed: 12/17/2022]
Abstract
Alternative splicing (AS) is a process that produces various mRNA splicing isoforms via different splicing patterns of mRNA precursors (pre-mRNAs). AS is the primary mechanism for increasing the types and quantities of proteins to improve biodiversity and influence multiple biological processes, including chromatin modification, signal transduction, and protein expression. It has been reported that AS is involved in the tumorigenesis and development of colorectal carcinoma (CRC). In this review, we delineate the concept, types, regulatory processes, and technical advances of AS and focus on the role of AS in CRC initiation, progression, treatment, and prognosis. This summary of the current knowledge about AS will contribute to our understanding of CRC initiation and development. This study will help in the discovery of novel biomarkers and therapeutic targets for CRC prognosis and treatment.
Collapse
Affiliation(s)
- Jianyi Wang
- Department of Pathology, Harbin Medical University, Harbin, China
| | - Chuhan Wang
- Department of Pathology, Harbin Medical University, Harbin, China
| | - Le Li
- Department of Pathology, Harbin Medical University, Harbin, China
| | - Lirui Yang
- Department of Pathology, Harbin Medical University, Harbin, China
| | - Shuoshuo Wang
- Department of Pathology, Harbin Medical University, Harbin, China
| | - Xuelian Ning
- Department of Pathology, Harbin Medical University, Harbin, China
| | - Shuangshu Gao
- Department of Pathology, Harbin Medical University, Harbin, China
| | - Lili Ren
- Department of Pathology, Harbin Medical University, Harbin, China
| | - Anita Chaulagain
- Department of Microbiology, Harbin Medical University, Harbin, China
| | - Jing Tang
- Department of Pathology, Harbin Medical University, Harbin, China.,Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, Guangzhou, China
| | - Tianzhen Wang
- Department of Pathology, Harbin Medical University, Harbin, China
| |
Collapse
|
8
|
Zhang F, Deng CK, Wang M, Deng B, Barber R, Huang G. Identification of novel alternative splicing biomarkers for breast cancer with LC/MS/MS and RNA-Seq. BMC Bioinformatics 2020; 21:541. [PMID: 33272210 PMCID: PMC7713335 DOI: 10.1186/s12859-020-03824-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2020] [Accepted: 10/19/2020] [Indexed: 01/12/2023] Open
Abstract
Background Alternative splicing isoforms have been reported as a new and robust class of diagnostic biomarkers. Over 95% of human genes are estimated to be alternatively spliced as a powerful means of producing functionally diverse proteins from a single gene. The emergence of next-generation sequencing technologies, especially RNA-seq, provides novel insights into large-scale detection and analysis of alternative splicing at the transcriptional level. Advances in Proteomic Technologies such as liquid chromatography coupled tandem mass spectrometry (LC–MS/MS), have shown tremendous power for the parallel characterization of large amount of proteins in biological samples. Although poor correspondence has been generally found from previous qualitative comparative analysis between proteomics and microarray data, significantly higher degrees of correlation have been observed at the level of exon. Combining protein and RNA data by searching LC–MS/MS data against a customized protein database from RNA-Seq may produce a subset of alternatively spliced protein isoform candidates that have higher confidence. Results We developed a bioinformatics workflow to discover alternative splicing biomarkers from LC–MS/MS using RNA-Seq. First, we retrieved high confident, novel alternative splicing biomarkers from the breast cancer RNA-Seq database. Then, we translated these sequences into in silico Isoform Junction Peptides, and created a customized alternative splicing database for MS searching. Lastly, we ran the Open Mass spectrometry Search Algorithm against the customized alternative splicing database with breast cancer plasma proteome. Twenty six alternative splicing biomarker peptides with one single intron event and one exon skipping event were identified. Further interpretation of biological pathways with our Integrated Pathway Analysis Database showed that these 26 peptides are associated with Cancer, Signaling, Metabolism, Regulation, Immune System and Hemostasis pathways, which are consistent with the 256 alternative splicing biomarkers from the RNA-Seq. Conclusions This paper presents a bioinformatics workflow for using RNA-seq data to discover novel alternative splicing biomarkers from the breast cancer proteome. As a complement to synthetic alternative splicing database technique for alternative splicing identification, this method combines the advantages of two platforms: mass spectrometry and next generation sequencing and can help identify potentially highly sample-specific alternative splicing isoform biomarkers at early-stage of cancer.
Collapse
Affiliation(s)
- Fan Zhang
- Vermont Biomedical Research Network and Department of Biology, University of Vermont, Burlington, VT, 05405, USA. .,Institute for Translational Research and Department of Family Medicine, University of North Texas Health Science Center, Fort Worth, TX, 76107, USA.
| | - Chris K Deng
- School of Molecular and Cellular Biology, University of Illinois at Urbana-Champaign, Champaign, IL, 61801, USA
| | - Mu Wang
- Department of Biochemistry and Molecular Biology, IU School of Medicine, Indianapolis, IN, 46202, USA.,Indiana Center for Systems Biology and Personalized Medicine, Indianapolis, IN, 46202, USA
| | - Bin Deng
- Vermont Biomedical Research Network and Department of Biology, University of Vermont, Burlington, VT, 05405, USA.,Institute for Translational Research and Department of Family Medicine, University of North Texas Health Science Center, Fort Worth, TX, 76107, USA
| | - Robert Barber
- Department of Pharmacology and Neuroscience, University of North Texas Health Science Center, Fort Worth, TX, USA
| | - Gang Huang
- Shanghai Key Laboratory for Molecular Imaging, Shanghai University of Medicine and Health Sciences, Shanghai, 201318, People's Republic of China.
| |
Collapse
|
9
|
Ma C, Kingsford C. Detecting, Categorizing, and Correcting Coverage Anomalies of RNA-Seq Quantification. Cell Syst 2019; 9:589-599.e7. [PMID: 31786209 DOI: 10.1016/j.cels.2019.10.005] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2019] [Revised: 07/09/2019] [Accepted: 10/17/2019] [Indexed: 11/13/2022]
Abstract
Because of incomplete reference transcriptomes, incomplete sequencing bias models, or other modeling defects, algorithms to infer isoform expression from RNA sequencing (RNA-seq) sometimes do not accurately model expression. We present a computational method to detect instances where a quantification algorithm could not completely explain the input reads. Our approach identifies regions where the read coverage significantly deviates from expectation. We call these regions "expression anomalies." We further present a method to attribute their cause to either the incompleteness of the reference transcriptome or algorithmic mistakes. We detect anomalies for 30 GEUVADIS and 16 Human Body Map samples. By correcting anomalies when possible, we reduce the number of falsely predicted instances of differential expression. Anomalies that cannot be corrected are suspected to indicate the existence of isoforms unannotated by the reference. We detected 88 common anomalies of this type and find that they tend to have a lower-than-expected coverage toward their 3' ends.
Collapse
Affiliation(s)
- Cong Ma
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA 15213, USA
| | - Carl Kingsford
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA 15213, USA.
| |
Collapse
|
10
|
Abstract
Identification of differentially expressed genes has been a high priority task of downstream analyses to further advances in biomedical research. Investigators have been faced with an array of issues in dealing with more complicated experiments and metadata, including batch effects, normalization, temporal dynamics (temporally differential expression), and isoform diversity (isoform-level quantification and differential splicing events). To date, there are currently no standard approaches to precisely and efficiently analyze these moderate or large-scale experimental designs, especially with combined metadata. In this report, we propose comprehensive analytical pipelines to precisely characterize temporal dynamics in differential expression of genes and other genomic features, i.e., the variability of transcripts, isoforms and exons, by controlling batch effects and other nuisance factors that could have significant confounding effects on the main effects of interest in comparative models and may result in misleading interpretations.
Collapse
|
11
|
Ramanouskaya TV, Grinev VV. The determinants of alternative RNA splicing in human cells. Mol Genet Genomics 2017; 292:1175-1195. [PMID: 28707092 DOI: 10.1007/s00438-017-1350-0] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2017] [Accepted: 07/06/2017] [Indexed: 12/29/2022]
Abstract
Alternative splicing represents an important level of the regulation of gene function in eukaryotic organisms. It plays a critical role in virtually every biological process within an organism, including regulation of cell division and cell death, differentiation of tissues in the embryo and the adult organism, as well as in cellular response to diverse environmental factors. In turn, studies of the last decade have shown that alternative splicing itself is controlled by different mechanisms. Unfortunately, there is no clear understanding of how these diverse mechanisms, or determinants, regulate and constrain the set of alternative RNA species produced from any particular gene in every cell of the human body. Here, we provide a consolidated overview of alternative splicing determinants including RNA-protein interactions, epigenetic regulation via chromatin remodeling, coupling of transcription-to-alternative splicing, effect of secondary structures in pre-RNA, and function of the RNA quality control systems. We also extensively and critically discuss some mechanistic insights on coordinated inclusion/exclusion of exons during the formation of mature RNA molecules. We conclude that the final structure of RNA is pre-determined by a complex interplay between cis- and trans-acting factors. Altogether, currently available empirical data significantly expand our understanding of the functioning of the alternative splicing machinery of cells in normal and pathological conditions. On the other hand, there are still many blind spots that require further deep investigations.
Collapse
|
12
|
Roy CK, Olson S, Graveley BR, Zamore PD, Moore MJ. Assessing long-distance RNA sequence connectivity via RNA-templated DNA-DNA ligation. eLife 2015; 4. [PMID: 25866926 PMCID: PMC4442144 DOI: 10.7554/elife.03700] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2014] [Accepted: 04/12/2015] [Indexed: 02/04/2023] Open
Abstract
Many RNAs, including pre-mRNAs and long non-coding RNAs, can be thousands of nucleotides long and undergo complex post-transcriptional processing. Multiple sites of alternative splicing within a single gene exponentially increase the number of possible spliced isoforms, with most human genes currently estimated to express at least ten. To understand the mechanisms underlying these complex isoform expression patterns, methods are needed that faithfully maintain long-range exon connectivity information in individual RNA molecules. In this study, we describe SeqZip, a methodology that uses RNA-templated DNA–DNA ligation to retain and compress connectivity between distant sequences within single RNA molecules. Using this assay, we test proposed coordination between distant sites of alternative exon utilization in mouse Fn1, and we characterize the extraordinary exon diversity of Drosophila melanogaster Dscam1. DOI:http://dx.doi.org/10.7554/eLife.03700.001 A flow chart can show how an outcome can be achieved from a particular start point by breaking down an activity into a list of possible steps. Often, a flow chart contains several alternative steps, not all of which are taken every time the flow chart is used. The same can be said of genes, which are biological instructions that often contain many options within their DNA sequences. Proteins—which perform many roles in cells—are built following the instructions contained in genes. First, the DNA sequence of the gene is copied. This produces a molecule of ribonucleic acid (RNA), which is able to move around the cell to find the machinery that can use the genetic information to make a protein. Genes and their RNA copies contain instructions with more steps—called exons—than are necessary to make a working protein, so extra exons are removed (‘spliced’) from the RNA copies. Different combinations of exons can be removed, so splicing can make different versions of the RNA called isoforms. These allow a single gene to build many different proteins. In fruit flies, for example, the different exons of the gene Dscam1 can be spliced into one of 38,016 unique RNA isoforms. Current technology only allows researchers to deduce the sequence of RNA molecules by combining sequences recorded from short fragments of the molecule. However, before splicing, RNA molecules tend to be much longer than this, so this restricts our understanding of the RNA isoforms found in cells. Here, Roy et al. devised and tested a new method called SeqZip to solve this problem. SeqZip uses short fragments of DNA called ligamers that can only stick to the sections of RNA that will remain after the molecule has been spliced. After splicing, the ligamers can be stuck together to make a DNA replica of the spliced RNA. The end product is at least 49 times shorter than the original RNA, so it is easier to sequence. In addition, the combinations of the ligamers in the DNA replica show which exons of a specific gene are kept and which ones are spliced out. To test the method, Roy et al. studied a mouse gene that has six RNA isoforms. SeqZip reduced the length of the RNA by five times and made it possible to measure how frequently the different isoforms naturally arise. Roy et al. also used SeqZip to work out which isoforms of the Dscam1 gene are used at different stages in the life of fruit fly larvae. SeqZip can provide insights into how complex organisms like flies, mice, and humans have evolved with relatively few—a little over 20,000—genes in their genomes. DOI:http://dx.doi.org/10.7554/eLife.03700.002
Collapse
Affiliation(s)
- Christian K Roy
- RNA Therapeutics Institute, Howard Hughes Medical Institute, University of Massachusetts Medical School, Worcester, United States
| | - Sara Olson
- Institute for Systems Genomics, Department of Genetics and Developmental Biology, University of Connecticut Health Center, Farmington, United States
| | - Brenton R Graveley
- Institute for Systems Genomics, Department of Genetics and Developmental Biology, University of Connecticut Health Center, Farmington, United States
| | - Phillip D Zamore
- RNA Therapeutics Institute, Howard Hughes Medical Institute, University of Massachusetts Medical School, Worcester, United States
| | - Melissa J Moore
- RNA Therapeutics Institute, Howard Hughes Medical Institute, University of Massachusetts Medical School, Worcester, United States
| |
Collapse
|
13
|
Liu R, Loraine AE, Dickerson JA. Comparisons of computational methods for differential alternative splicing detection using RNA-seq in plant systems. BMC Bioinformatics 2014; 15:364. [PMID: 25511303 PMCID: PMC4271460 DOI: 10.1186/s12859-014-0364-4] [Citation(s) in RCA: 63] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2014] [Accepted: 10/29/2014] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND Alternative Splicing (AS) as a post-transcription regulation mechanism is an important application of RNA-seq studies in eukaryotes. A number of software and computational methods have been developed for detecting AS. Most of the methods, however, are designed and tested on animal data, such as human and mouse. Plants genes differ from those of animals in many ways, e.g., the average intron size and preferred AS types. These differences may require different computational approaches and raise questions about their effectiveness on plant data. The goal of this paper is to benchmark existing computational differential splicing (or transcription) detection methods so that biologists can choose the most suitable tools to accomplish their goals. RESULTS This study compares the eight popular public available software packages for differential splicing analysis using both simulated and real Arabidopsis thaliana RNA-seq data. All software are freely available. The study examines the effect of varying AS ratio, read depth, dispersion pattern, AS types, sample sizes and the influence of annotation. Using a real data, the study looks at the consistences between the packages and verifies a subset of the detected AS events using PCR studies. CONCLUSIONS No single method performs the best in all situations. The accuracy of annotation has a major impact on which method should be chosen for AS analysis. DEXSeq performs well in the simulated data when the AS signal is relative strong and annotation is accurate. Cufflinks achieve a better tradeoff between precision and recall and turns out to be the best one when incomplete annotation is provided. Some methods perform inconsistently for different AS types. Complex AS events that combine several simple AS events impose problems for most methods, especially for MATS. MATS stands out in the analysis of real RNA-seq data when all the AS events being evaluated are simple AS events.
Collapse
Affiliation(s)
- Ruolin Liu
- Department of Electrical and Computational Engineering, Iowa State University, Howe Hall, Ames, 50011-3060, USA.
| | - Ann E Loraine
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, North Carolina Research Campus, 600 Laureate Way, Kannapolis, 28081, NC, USA.
| | - Julie A Dickerson
- Department of Electrical and Computational Engineering, Iowa State University, Howe Hall, Ames, 50011-3060, USA.
| |
Collapse
|
14
|
Niu L, Huang W, Umbach DM, Li L. IUTA: a tool for effectively detecting differential isoform usage from RNA-Seq data. BMC Genomics 2014; 15:862. [PMID: 25283306 PMCID: PMC4195885 DOI: 10.1186/1471-2164-15-862] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2014] [Accepted: 09/29/2014] [Indexed: 01/10/2023] Open
Abstract
Background Most genes in mammals generate several transcript isoforms that differ in stability and translational efficiency through alternative splicing. Such alternative splicing can be tissue- and developmental stage-specific, and such specificity is sometimes associated with disease. Thus, detecting differential isoform usage for a gene between tissues or cell lines/types (differences in the fraction of total expression of a gene represented by the expression of each of its isoforms) is potentially important for cell and developmental biology. Results We present a new method IUTA that is designed to test each gene in the genome for differential isoform usage between two groups of samples. IUTA also estimates isoform usage for each gene in each sample as well as averaged across samples within each group. IUTA is the first method to formulate the testing problem as testing for equal means of two probability distributions under the Aitchison geometry, which is widely recognized as the most appropriate geometry for compositional data (vectors that contain the relative amount of each component comprising the whole). Evaluation using simulated data showed that IUTA was able to provide test results for many more genes than was Cuffdiff2 (version 2.2.0, released in Mar. 2014), and IUTA performed better than Cuffdiff2 for the limited number of genes that Cuffdiff2 did analyze. When applied to actual mouse RNA-Seq datasets from six tissues, IUTA identified 2,073 significant genes with clear patterns of differential isoform usage between a pair of tissues. IUTA is implemented as an R package and is available at http://www.niehs.nih.gov/research/resources/software/biostatistics/iuta/index.cfm. Conclusions Both simulation and real-data results suggest that IUTA accurately detects differential isoform usage. We believe that our analysis of RNA-seq data from six mouse tissues represents the first comprehensive characterization of isoform usage in these tissues. IUTA will be a valuable resource for those who study the roles of alternative transcripts in cell development and disease. Electronic supplementary material The online version of this article (doi:10.1186/1471-2164-15-862) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
| | | | | | - Leping Li
- Biostatistics Branch, National Institute of Environmental Health Sciences, Research Triangle Park, NC 27709, USA.
| |
Collapse
|