1
|
Kara MF, Guo W, Zhang R, Denby K. LsRTDv1, a reference transcript dataset for accurate transcript-specific expression analysis in lettuce. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2024; 120:370-386. [PMID: 39145419 DOI: 10.1111/tpj.16978] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/25/2024] [Revised: 06/20/2024] [Accepted: 07/31/2024] [Indexed: 08/16/2024]
Abstract
Accurate quantification of gene and transcript-specific expression, with the underlying knowledge of precise transcript isoforms, is crucial to understanding many biological processes. Analysis of RNA sequencing data has benefited from the development of alignment-free algorithms which enhance the precision and speed of expression analysis. However, such algorithms require a reference transcriptome. Here we generate a reference transcript dataset (LsRTDv1) for lettuce (cv. Saladin), combining long- and short-read sequencing with publicly available transcriptome annotations, and filtering to keep only transcripts with high-confidence splice junctions and transcriptional start and end sites. LsRTDv1 identifies novel genes (mostly long non-coding RNAs) and increases the number of transcript isoforms per gene in the lettuce genome from 1.4 to 2.7. We show that LsRTDv1 significantly increases the mapping rate of RNA-seq data from a lettuce time-series experiment (mock- and Botrytis cinerea-inoculated) and enables detection of genes that are differentially alternatively spliced in response to infection as well as transcript-specific expression changes. LsRTDv1 is a valuable resource for investigation of transcriptional and alternative splicing regulation in lettuce.
Collapse
Affiliation(s)
- Mehmet Fatih Kara
- Biology Department, Centre for Novel Agricultural Products (CNAP), University of York, Wentworth Way, York, YO10 5DD, UK
| | - Wenbin Guo
- Information and Computational Sciences, James Hutton Institute, Dundee, DD2 5DA, UK
| | - Runxuan Zhang
- Information and Computational Sciences, James Hutton Institute, Dundee, DD2 5DA, UK
| | - Katherine Denby
- Biology Department, Centre for Novel Agricultural Products (CNAP), University of York, Wentworth Way, York, YO10 5DD, UK
| |
Collapse
|
2
|
Wang D, Gazzara MR, Jewell S, Wales-McGrath B, Brown CD, Choi PS, Barash Y. A Deep Dive into Statistical Modeling of RNA Splicing QTLs Reveals New Variants that Explain Neurodegenerative Disease. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.09.01.610696. [PMID: 39282456 PMCID: PMC11398334 DOI: 10.1101/2024.09.01.610696] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 09/22/2024]
Abstract
Genome-wide association studies (GWAS) have identified thousands of putative disease causing variants with unknown regulatory effects. Efforts to connect these variants with splicing quantitative trait loci (sQTLs) have provided functional insights, yet sQTLs reported by existing methods cannot explain many GWAS signals. We show current sQTL modeling approaches can be improved by considering alternative splicing representation, model calibration, and covariate integration. We then introduce MAJIQTL, a new pipeline for sQTL discovery. MAJIQTL includes two new statistical methods: a weighted multiple testing approach for sGene discovery and a model for sQTL effect size inference to improve variant prioritization. By applying MAJIQTL to GTEx, we find significantly more sGenes harboring sQTLs with functional significance. Notably, our analysis implicates the novel variant rs582283 in Alzheimer's disease. Using antisense oligonucleotides, we validate this variant's effect by blocking the implicated YBX3 binding site, leading to exon skipping in the gene MS4A3.
Collapse
Affiliation(s)
- David Wang
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania
- Graduate Group in Genomics and Computational Biology, Perelman School of Medicine, University of Pennsylvania
| | - Matthew R. Gazzara
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania
- Graduate Group in Genomics and Computational Biology, Perelman School of Medicine, University of Pennsylvania
| | - San Jewell
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania
| | | | | | - Peter S. Choi
- Department of Pathology & Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
- Division of Cancer Pathobiology, The Children’s Hospital of Philadelphia
| | - Yoseph Barash
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania
- Department of Computer and Information Sciences, School of Engineering, University of Pennsylvania
| |
Collapse
|
3
|
Hou Y, Li Q, Zhou H, Kafle S, Li W, Tan L, Liang J, Meng L, Xin H. SMRT sequencing of a full-length transcriptome reveals cold induced alternative splicing in Vitis amurensis root. PLANT PHYSIOLOGY AND BIOCHEMISTRY : PPB 2024; 213:108863. [PMID: 38917739 DOI: 10.1016/j.plaphy.2024.108863] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/02/2024] [Revised: 05/31/2024] [Accepted: 06/19/2024] [Indexed: 06/27/2024]
Abstract
Alternative splicing enhances diversity at the transcriptional and protein levels that widely involved in plant response to biotic and abiotic stresses. V. amurensis is an extremely cold-tolerant wild grape variety, however, studies on alternative splicing (AS) in amur grape at low temperatures are currently poorly understood. In this study, we analyzed full-length transcriptome and RNA seq data at 0, 2, and 24 h after cold stress in V. amurensis roots. Following quality control and correction, 221,170 high-quality full-length non-concatemer (FLNC) reads were identified. A total of 16,181 loci and 30,733 isoforms were identified. These included 22,868 novel isoforms from annotated genes and 2815 isoforms from 2389 novel genes. Among the distinguished novel isoforms, 673 Long non-coding RNAs (LncRNAs) and 18,164 novel isoforms open reading frame (ORF) region were found. A total of 2958 genes produced 8797 AS events, of which 189 genes were involved in the low-temperature response. Twelve transcription factors show AS during cold treatment and VaMYB108 was selected for initial exploration. Two transcripts, Chr05.63.1 (VaMYB108short) and Chr05.63.2 (VaMYB108normal) of VaMYB108, display up-regulated expression after cold treatment in amur grape roots and are both localized in the nucleus. Only VaMYB108normal exhibits transcriptional activation activity. Overexpression of either VaMYB108short or VaMYB108normal in grape roots leads to increased expression of the other transcript and both increased chilling resistance of amur grape roots. The results improve and supplement the genome annotations and provide insights for further investigation into AS mechanisms during cold stress in V. amurensis.
Collapse
Affiliation(s)
- Yujun Hou
- State Key Laboratory of Plant Diversity and Specialty Crops, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan, 430074, China; University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Qingyun Li
- State Key Laboratory of Plant Diversity and Specialty Crops, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan, 430074, China; University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Huimin Zhou
- State Key Laboratory of Plant Diversity and Specialty Crops, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan, 430074, China; University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Subash Kafle
- State Key Laboratory of Plant Diversity and Specialty Crops, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan, 430074, China; University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Wenjuan Li
- State Key Laboratory of Plant Diversity and Specialty Crops, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan, 430074, China; University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Lisha Tan
- State Key Laboratory of Plant Diversity and Specialty Crops, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan, 430074, China
| | - Ju Liang
- Turpan Institute of Agricultural Sciences, Xinjiang Academy of Agricultural Sciences, Xinjiang, 830091, China
| | - Lin Meng
- State Key Laboratory of Plant Diversity and Specialty Crops, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan, 430074, China
| | - Haiping Xin
- State Key Laboratory of Plant Diversity and Specialty Crops, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan, 430074, China; Key Laboratory of Plant Germplasm Enhancement and Specialty Agriculture, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan, 430074, China.
| |
Collapse
|
4
|
Lock C, Gabriel MM, Bentlage B. Transcriptomic signatures across a critical sedimentation threshold in a major reef-building coral. Front Physiol 2024; 15:1303681. [PMID: 38919851 PMCID: PMC11196755 DOI: 10.3389/fphys.2024.1303681] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2023] [Accepted: 05/10/2024] [Indexed: 06/27/2024] Open
Abstract
Sedimentation is a major cause of global near-shore coral reef decline. Although the negative impacts of sedimentation on coral reef community composition have been well-documented, the effects of sedimentation on coral metabolism in situ have received comparatively little attention. Using transcriptomics, we identified gene expression patterns changing across a previously defined sedimentation threshold that was deemed critical due to changes in coral cover and community composition. We identified genes, pathways, and molecular processes associated with this transition that may allow corals, such as Porites lobata, to tolerate chronic, severe sedimentation and persist in turbid environments. Alternative energy generation pathways may help P. lobata maintain a persistent stress response to survive when the availability of light and oxygen is diminished. We found evidence for the expression of genes linked to increased environmental sensing and cellular communication that likely allow P. lobata to efficiently respond to sedimentation stress and associated pathogen challenges. Cell damage increases under stress; consequently, we found apoptosis pathways over-represented under severe sedimentation, a likely consequence of damaged cell removal to maintain colony integrity. The results presented here provide a framework for the response of P. lobata to sedimentation stress under field conditions. Testing this framework and its related hypotheses using multi-omics approaches can deepen our understanding of the metabolic plasticity and acclimation potential of corals to sedimentation and their resilience in turbid reef systems.
Collapse
|
5
|
Meng R, Li Z, Kang X, Zhang Y, Wang Y, Ma Y, Wu Y, Dong S, Li X, Gao L, Chu X, Yang G, Yuan X, Wang J. High Overexpression of SiAAP9 Leads to Growth Inhibition and Protein Ectopic Localization in Transgenic Arabidopsis. Int J Mol Sci 2024; 25:5840. [PMID: 38892028 PMCID: PMC11172308 DOI: 10.3390/ijms25115840] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2024] [Revised: 05/24/2024] [Accepted: 05/24/2024] [Indexed: 06/21/2024] Open
Abstract
Amino acid permeases (AAPs) transporters are crucial for the long-distance transport of amino acids in plants, from source to sink. While Arabidopsis and rice have been extensively studied, research on foxtail millet is limited. This study identified two transcripts of SiAAP9, both of which were induced by NO3- and showed similar expression patterns. The overexpression of SiAAP9L and SiAAP9S in Arabidopsis inhibited plant growth and seed size, although SiAAP9 was found to transport more amino acids into seeds. Furthermore, SiAAP9-OX transgenic Arabidopsis showed increased tolerance to high concentrations of glutamate (Glu) and histidine (His). The high overexpression level of SiAAP9 suggested its protein was not only located on the plasma membrane but potentially on other organelles, as well. Interestingly, sequence deletion reduced SiAAP9's sensitivity to Brefeldin A (BFA), and SiAAP9 had ectopic localization on the endoplasmic reticulum (ER). Protoplast amino acid uptake experiments indicated that SiAAP9 enhanced Glu transport into foxtail millet cells. Overall, the two transcripts of SiAAP9 have similar functions, but SiAAP9L shows a higher colocalization with BFA compartments compared to SiAAP9S. Our research identifies a potential candidate gene for enhancing the nutritional quality of foxtail millet through breeding.
Collapse
Affiliation(s)
- Ru Meng
- College of Agriculture, Shanxi Agricultural University, Jinzhong 030801, China; (R.M.); (Z.L.); (X.K.); (Y.Z.); (Y.W.); (Y.M.); (Y.W.); (S.D.); (X.L.); (L.G.); (X.C.); (G.Y.)
| | - Zhipeng Li
- College of Agriculture, Shanxi Agricultural University, Jinzhong 030801, China; (R.M.); (Z.L.); (X.K.); (Y.Z.); (Y.W.); (Y.M.); (Y.W.); (S.D.); (X.L.); (L.G.); (X.C.); (G.Y.)
| | - Xueting Kang
- College of Agriculture, Shanxi Agricultural University, Jinzhong 030801, China; (R.M.); (Z.L.); (X.K.); (Y.Z.); (Y.W.); (Y.M.); (Y.W.); (S.D.); (X.L.); (L.G.); (X.C.); (G.Y.)
| | - Yujia Zhang
- College of Agriculture, Shanxi Agricultural University, Jinzhong 030801, China; (R.M.); (Z.L.); (X.K.); (Y.Z.); (Y.W.); (Y.M.); (Y.W.); (S.D.); (X.L.); (L.G.); (X.C.); (G.Y.)
| | - Yiru Wang
- College of Agriculture, Shanxi Agricultural University, Jinzhong 030801, China; (R.M.); (Z.L.); (X.K.); (Y.Z.); (Y.W.); (Y.M.); (Y.W.); (S.D.); (X.L.); (L.G.); (X.C.); (G.Y.)
| | - Yuchao Ma
- College of Agriculture, Shanxi Agricultural University, Jinzhong 030801, China; (R.M.); (Z.L.); (X.K.); (Y.Z.); (Y.W.); (Y.M.); (Y.W.); (S.D.); (X.L.); (L.G.); (X.C.); (G.Y.)
| | - Yanfeng Wu
- College of Agriculture, Shanxi Agricultural University, Jinzhong 030801, China; (R.M.); (Z.L.); (X.K.); (Y.Z.); (Y.W.); (Y.M.); (Y.W.); (S.D.); (X.L.); (L.G.); (X.C.); (G.Y.)
| | - Shuqi Dong
- College of Agriculture, Shanxi Agricultural University, Jinzhong 030801, China; (R.M.); (Z.L.); (X.K.); (Y.Z.); (Y.W.); (Y.M.); (Y.W.); (S.D.); (X.L.); (L.G.); (X.C.); (G.Y.)
- State Key Laboratory of Sustainable Dryland Agriculture (in Preparation), Shanxi Agricultural University, Jinzhong 030801, China
| | - Xiaorui Li
- College of Agriculture, Shanxi Agricultural University, Jinzhong 030801, China; (R.M.); (Z.L.); (X.K.); (Y.Z.); (Y.W.); (Y.M.); (Y.W.); (S.D.); (X.L.); (L.G.); (X.C.); (G.Y.)
- State Key Laboratory of Sustainable Dryland Agriculture (in Preparation), Shanxi Agricultural University, Jinzhong 030801, China
| | - Lulu Gao
- College of Agriculture, Shanxi Agricultural University, Jinzhong 030801, China; (R.M.); (Z.L.); (X.K.); (Y.Z.); (Y.W.); (Y.M.); (Y.W.); (S.D.); (X.L.); (L.G.); (X.C.); (G.Y.)
| | - Xiaoqian Chu
- College of Agriculture, Shanxi Agricultural University, Jinzhong 030801, China; (R.M.); (Z.L.); (X.K.); (Y.Z.); (Y.W.); (Y.M.); (Y.W.); (S.D.); (X.L.); (L.G.); (X.C.); (G.Y.)
| | - Guanghui Yang
- College of Agriculture, Shanxi Agricultural University, Jinzhong 030801, China; (R.M.); (Z.L.); (X.K.); (Y.Z.); (Y.W.); (Y.M.); (Y.W.); (S.D.); (X.L.); (L.G.); (X.C.); (G.Y.)
| | - Xiangyang Yuan
- College of Agriculture, Shanxi Agricultural University, Jinzhong 030801, China; (R.M.); (Z.L.); (X.K.); (Y.Z.); (Y.W.); (Y.M.); (Y.W.); (S.D.); (X.L.); (L.G.); (X.C.); (G.Y.)
- State Key Laboratory of Sustainable Dryland Agriculture (in Preparation), Shanxi Agricultural University, Jinzhong 030801, China
| | - Jiagang Wang
- College of Agriculture, Shanxi Agricultural University, Jinzhong 030801, China; (R.M.); (Z.L.); (X.K.); (Y.Z.); (Y.W.); (Y.M.); (Y.W.); (S.D.); (X.L.); (L.G.); (X.C.); (G.Y.)
- Hou Ji Laboratory in Shanxi Province, Shanxi Agricultural University, Jinzhong 030801, China
| |
Collapse
|
6
|
Brooks TG, Lahens NF, Mrčela A, Grant GR. Challenges and best practices in omics benchmarking. Nat Rev Genet 2024; 25:326-339. [PMID: 38216661 DOI: 10.1038/s41576-023-00679-6] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/14/2023] [Indexed: 01/14/2024]
Abstract
Technological advances enabling massively parallel measurement of biological features - such as microarrays, high-throughput sequencing and mass spectrometry - have ushered in the omics era, now in its third decade. The resulting complex landscape of analytical methods has naturally fostered the growth of an omics benchmarking industry. Benchmarking refers to the process of objectively comparing and evaluating the performance of different computational or analytical techniques when processing and analysing large-scale biological data sets, such as transcriptomics, proteomics and metabolomics. With thousands of omics benchmarking studies published over the past 25 years, the field has matured to the point where the foundations of benchmarking have been established and well described. However, generating meaningful benchmarking data and properly evaluating performance in this complex domain remains challenging. In this Review, we highlight some common oversights and pitfalls in omics benchmarking. We also establish a methodology to bring the issues that can be addressed into focus and to be transparent about those that cannot: this takes the form of a spreadsheet template of guidelines for comprehensive reporting, intended to accompany publications. In addition, a survey of recent developments in benchmarking is provided as well as specific guidance for commonly encountered difficulties.
Collapse
Affiliation(s)
- Thomas G Brooks
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Nicholas F Lahens
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Antonijo Mrčela
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Gregory R Grant
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA.
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
7
|
Westrin KJ, Kretzschmar WW, Emanuelsson O. ClusTrast: a short read de novo transcript isoform assembler guided by clustered contigs. BMC Bioinformatics 2024; 25:54. [PMID: 38302873 PMCID: PMC10836024 DOI: 10.1186/s12859-024-05663-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2022] [Accepted: 01/18/2024] [Indexed: 02/03/2024] Open
Abstract
BACKGROUND Transcriptome assembly from RNA-sequencing data in species without a reliable reference genome has to be performed de novo, but studies have shown that de novo methods often have inadequate ability to reconstruct transcript isoforms. We address this issue by constructing an assembly pipeline whose main purpose is to produce a comprehensive set of transcript isoforms. RESULTS We present the de novo transcript isoform assembler ClusTrast, which takes short read RNA-seq data as input, assembles a primary assembly, clusters a set of guiding contigs, aligns the short reads to the guiding contigs, assembles each clustered set of short reads individually, and merges the primary and clusterwise assemblies into the final assembly. We tested ClusTrast on real datasets from six eukaryotic species, and showed that ClusTrast reconstructed more expressed known isoforms than any of the other tested de novo assemblers, at a moderate reduction in precision. For recall, ClusTrast was on top in the lower end of expression levels (<15% percentile) for all tested datasets, and over the entire range for almost all datasets. Reference transcripts were often (35-69% for the six datasets) reconstructed to at least 95% of their length by ClusTrast, and more than half of reference transcripts (58-81%) were reconstructed with contigs that exhibited polymorphism, measuring on a subset of reliably predicted contigs. ClusTrast recall increased when using a union of assembled transcripts from more than one assembly tool as primary assembly. CONCLUSION We suggest that ClusTrast can be a useful tool for studying isoforms in species without a reliable reference genome, in particular when the goal is to produce a comprehensive transcriptome set with polymorphic variants.
Collapse
Affiliation(s)
- Karl Johan Westrin
- Science for Life Laboratory, Department of Gene Technology, KTH Royal Institute of Technology, 171 65, Solna, Sweden
| | - Warren W Kretzschmar
- Science for Life Laboratory, Department of Gene Technology, KTH Royal Institute of Technology, 171 65, Solna, Sweden
- Department of Medicine Huddinge, Center for Hematology and Regenerative Medicine (HERM), Karolinska Institute, 141 52, Flemingsberg, Sweden
| | - Olof Emanuelsson
- Science for Life Laboratory, Department of Gene Technology, KTH Royal Institute of Technology, 171 65, Solna, Sweden.
| |
Collapse
|
8
|
Fenn A, Tsoy O, Faro T, Rößler FM, Dietrich A, Kersting J, Louadi Z, Lio CT, Völker U, Baumbach J, Kacprowski T, List M. Alternative splicing analysis benchmark with DICAST. NAR Genom Bioinform 2023; 5:lqad044. [PMID: 37260511 PMCID: PMC10227362 DOI: 10.1093/nargab/lqad044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2022] [Revised: 04/13/2023] [Accepted: 05/05/2023] [Indexed: 06/02/2023] Open
Abstract
Alternative splicing is a major contributor to transcriptome and proteome diversity in health and disease. A plethora of tools have been developed for studying alternative splicing in RNA-seq data. Previous benchmarks focused on isoform quantification and mapping. They neglected event detection tools, which arguably provide the most detailed insights into the alternative splicing process. DICAST offers a modular and extensible framework for analysing alternative splicing integrating eleven splice-aware mapping and eight event detection tools. We benchmark all tools extensively on simulated as well as whole blood RNA-seq data. STAR and HISAT2 demonstrated the best balance between performance and run time. The performance of event detection tools varies widely with no tool outperforming all others. DICAST allows researchers to employ a consensus approach to consider the most successful tools jointly for robust event detection. Furthermore, we propose the first reporting standard to unify existing formats and to guide future tool development.
Collapse
Affiliation(s)
| | | | - Tim Faro
- Chair of Experimental Bioinformatics, Technical University of Munich, 85354 Freising, Germany
| | - Fanny L M Rößler
- Chair of Experimental Bioinformatics, Technical University of Munich, 85354 Freising, Germany
| | - Alexander Dietrich
- Chair of Experimental Bioinformatics, Technical University of Munich, 85354 Freising, Germany
| | - Johannes Kersting
- Chair of Experimental Bioinformatics, Technical University of Munich, 85354 Freising, Germany
| | - Zakaria Louadi
- Chair of Experimental Bioinformatics, Technical University of Munich, 85354 Freising, Germany
- Institute for Computational Systems Biology, University of Hamburg, Notkestrasse 9, 22607 Hamburg, Germany
| | - Chit Tong Lio
- Chair of Experimental Bioinformatics, Technical University of Munich, 85354 Freising, Germany
- Institute for Computational Systems Biology, University of Hamburg, Notkestrasse 9, 22607 Hamburg, Germany
| | - Uwe Völker
- Interfaculty Institute for Genetics and Functional Genomics, University Medicine Greifswald, Felix-Hausdorff-Straße 8, D-17475 Greifswald, Germany
- DZHK (German Centre for Cardiovascular Research), Partner Site Greifswald, Greifswald, Germany
| | - Jan Baumbach
- Institute for Computational Systems Biology, University of Hamburg, Notkestrasse 9, 22607 Hamburg, Germany
- Institute of Mathematics and Computer Science, University of Southern Denmark, Campusvej 55, 5000 Odense, Denmark
| | | | - Markus List
- To whom correspondence should be addressed. Tel: +49 8161 71 2761;
| |
Collapse
|
9
|
Yang A, Tang JYS, Troup M, Ho JWK. Scavenger: A pipeline for recovery of unaligned reads utilising similarity with aligned reads. F1000Res 2022; 8:1587. [PMID: 32913631 PMCID: PMC7459848 DOI: 10.12688/f1000research.19426.2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 08/08/2022] [Indexed: 11/25/2022] Open
Abstract
Read alignment is an important step in RNA-seq analysis as the result of alignment forms the basis for downstream analyses. However, recent studies have shown that published alignment tools have variable mapping sensitivity and do not necessarily align all the reads which should have been aligned, a problem we termed as the false-negative non-alignment problem. Here we present Scavenger, a python-based bioinformatics pipeline for recovering unaligned reads using a novel mechanism in which a putative alignment location is discovered based on sequence similarity between aligned and unaligned reads. We showed that Scavenger could recover unaligned reads in a range of simulated and real RNA-seq datasets, including single-cell RNA-seq data. We found that recovered reads tend to contain more genetic variants with respect to the reference genome compared to previously aligned reads, indicating that divergence between personal and reference genomes plays a role in the false-negative non-alignment problem. Even when the number of recovered reads is relatively small compared to the total number of reads, the addition of these recovered reads can impact downstream analyses, especially in terms of estimating the expression and differential expression of lowly expressed genes, such as pseudogenes.
Collapse
Affiliation(s)
- Andrian Yang
- Victor Chang Cardiac Research Institute, Sydney, NSW, 2010, Australia
- St. Vincent’s Clinical School, University of New South Wales, Sydney, NSW, 2052, Australia
| | - Joshua Y. S. Tang
- Victor Chang Cardiac Research Institute, Sydney, NSW, 2010, Australia
- St. Vincent’s Clinical School, University of New South Wales, Sydney, NSW, 2052, Australia
| | - Michael Troup
- Victor Chang Cardiac Research Institute, Sydney, NSW, 2010, Australia
| | - Joshua W. K. Ho
- Victor Chang Cardiac Research Institute, Sydney, NSW, 2010, Australia
- St. Vincent’s Clinical School, University of New South Wales, Sydney, NSW, 2052, Australia
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, University of Hong Kong, Hong Kong, China
| |
Collapse
|
10
|
Wang Y, Xue H, Aglave M, Lainé A, Gallopin M, Gautheret D. The contribution of uncharted RNA sequences to tumor identity in lung adenocarcinoma. NAR Cancer 2022; 4:zcac001. [PMID: 35118386 PMCID: PMC8807116 DOI: 10.1093/narcan/zcac001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2021] [Revised: 11/18/2021] [Accepted: 01/10/2022] [Indexed: 11/12/2022] Open
Abstract
The identity of cancer cells is defined by the interplay between genetic, epigenetic transcriptional and post-transcriptional variation. A lot of this variation is present in RNA-seq data and can be captured at once using reference-free, k-mer analysis. An important issue with k-mer analysis, however, is the difficulty of distinguishing signal from noise. Here, we use two independent lung adenocarcinoma datasets to identify all reproducible events at the k-mer level, in a tumor versus normal setting. We find reproducible events in many different locations (introns, intergenic, repeats) and forms (spliced, polyadenylated, chimeric etc.). We systematically analyze events that are ignored in conventional transcriptomics and assess their value as biomarkers and for tumor classification, survival prediction, neoantigen prediction and correlation with the immune microenvironment. We find that unannotated lincRNAs, novel splice variants, endogenous HERV, Line1 and Alu repeats and bacterial RNAs each contribute to different, important aspects of tumor identity. We argue that differential RNA-seq analysis of tumor/normal sample collections would benefit from this type k-mer analysis to cast a wider net on important cancer-related events. The code is available at https://github.com/Transipedia/dekupl-lung-cancer-inter-cohort.
Collapse
Affiliation(s)
- Yunfeng Wang
- Institute for Integrative Biology of the Cell (I2BC), Université Paris-Saclay, CNRS, CEA, 1 avenue de la Terrasse, 91190, Gif-sur-Yvette, France
- Annoroad Gene Technology Co., Ltd, 100176 Beijing, China
| | - Haoliang Xue
- Institute for Integrative Biology of the Cell (I2BC), Université Paris-Saclay, CNRS, CEA, 1 avenue de la Terrasse, 91190, Gif-sur-Yvette, France
| | - Marine Aglave
- Institute for Integrative Biology of the Cell (I2BC), Université Paris-Saclay, CNRS, CEA, 1 avenue de la Terrasse, 91190, Gif-sur-Yvette, France
- Gustave Roussy, 114 rue Edouard Vaillant, 94800, Villejuif, France
| | - Antoine Lainé
- Institute for Integrative Biology of the Cell (I2BC), Université Paris-Saclay, CNRS, CEA, 1 avenue de la Terrasse, 91190, Gif-sur-Yvette, France
| | - Mélina Gallopin
- Institute for Integrative Biology of the Cell (I2BC), Université Paris-Saclay, CNRS, CEA, 1 avenue de la Terrasse, 91190, Gif-sur-Yvette, France
| | - Daniel Gautheret
- Institute for Integrative Biology of the Cell (I2BC), Université Paris-Saclay, CNRS, CEA, 1 avenue de la Terrasse, 91190, Gif-sur-Yvette, France
- Gustave Roussy, 114 rue Edouard Vaillant, 94800, Villejuif, France
| |
Collapse
|
11
|
Experimental Design for Time-Series RNA-Seq Analysis of Gene Expression and Alternative Splicing. Methods Mol Biol 2021. [PMID: 34674176 DOI: 10.1007/978-1-0716-1912-4_14] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/29/2023]
Abstract
RNA-sequencing (RNA-seq) is currently the method of choice for analysis of differential gene expression. To fully exploit the wealth of data generated from genome-wide transcriptomic approaches, the initial design of the experiment is of paramount importance. Biological rhythms in nature are pervasive and are driven by endogenous gene networks collectively known as circadian clocks. Measuring circadian gene expression requires time-course experiments which take into account time-of-day factors influencing variability in expression levels. We describe here an approach for characterizing diurnal changes in expression and alternative splicing for plants undergoing cooling. The method uses inexpensive everyday laboratory equipment and utilizes an RNA-seq application (3D RNA-seq) that can handle complex experimental designs and requires little or no prior bioinformatics expertise.
Collapse
|
12
|
Lahens NF, Brooks TG, Sarantopoulou D, Nayak S, Lawrence C, Mrčela A, Srinivasan A, Schug J, Hogenesch JB, Barash Y, Grant GR. CAMPAREE: a robust and configurable RNA expression simulator. BMC Genomics 2021; 22:692. [PMID: 34563123 PMCID: PMC8467241 DOI: 10.1186/s12864-021-07934-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2021] [Accepted: 08/17/2021] [Indexed: 11/10/2022] Open
Abstract
Background The accurate interpretation of RNA-Seq data presents a moving target as scientists continue to introduce new experimental techniques and analysis algorithms. Simulated datasets are an invaluable tool to accurately assess the performance of RNA-Seq analysis methods. However, existing RNA-Seq simulators focus on modeling the technical biases and artifacts of sequencing, rather than on simulating the original RNA samples. A first step in simulating RNA-Seq is to simulate RNA. Results To fill this need, we developed the Configurable And Modular Program Allowing RNA Expression Emulation (CAMPAREE), a simulator using empirical data to simulate diploid RNA samples at the level of individual molecules. We demonstrated CAMPAREE’s use for generating idealized coverage plots from real data, and for adding the ability to generate allele-specific data to existing RNA-Seq simulators that do not natively support this feature. Conclusions Separating input sample modeling from library preparation/sequencing offers added flexibility for both users and developers to mix-and-match different sample and sequencing simulators to suit their specific needs. Furthermore, the ability to maintain sample and sequencing simulators independently provides greater agility to incorporate new biological findings about transcriptomics and new developments in sequencing technologies. Additionally, by simulating at the level of individual molecules, CAMPAREE has the potential to model molecules transcribed from the same genes as a heterogeneous population of transcripts with different states of degradation and processing (splicing, editing, etc.). CAMPAREE was developed in Python, is open source, and freely available at https://github.com/itmat/CAMPAREE. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-021-07934-2.
Collapse
Affiliation(s)
- Nicholas F Lahens
- The Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Thomas G Brooks
- The Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Dimitra Sarantopoulou
- The Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.,Present address: National Institute on Aging, National Institutes of Health, Baltimore, Maryland, USA
| | - Soumyashant Nayak
- Statistics and Mathematics Unit, Indian Statistical Institute, Bengaluru, Karnataka, India
| | - Cris Lawrence
- The Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Antonijo Mrčela
- The Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Anand Srinivasan
- Perelman School of Medicine, Enterprise Research Applications and High Performance Computing, Penn Medicine Academic Computing Services, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Jonathan Schug
- The Institute for Diabetes, Obesity and Metabolism, The Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - John B Hogenesch
- Division of Human Genetics, Department of Pediatrics, Center for Chronobiology, Cincinnati Children's Hospital Medical Center, Cincinnati, Ohio, USA
| | - Yoseph Barash
- The Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Gregory R Grant
- The Institute for Translational Medicine and Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA. .,The Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.
| |
Collapse
|
13
|
Alqassem I, Sonthalia Y, Klitzke-Feser E, Shim H, Canzar S. McSplicer: a probabilistic model for estimating splice site usage from RNA-seq data. Bioinformatics 2021; 37:2004–2011. [PMID: 33515239 PMCID: PMC8337008 DOI: 10.1093/bioinformatics/btab050] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2020] [Revised: 01/20/2021] [Accepted: 01/21/2021] [Indexed: 11/23/2022] Open
Abstract
MOTIVATION Alternative splicing removes intronic sequences from pre-mRNAs in alternative ways to produce different forms (isoforms) of mature mRNA. The composition of expressed transcripts gives specific functionalities to cells in a particular condition or developmental stage. In addition, a large fraction of human disease mutations affect splicing and lead to aberrant mRNA and protein products. Current methods that interrogate the transcriptome based on RNA-seq either suffer from short read length when trying to infer full-length transcripts, or are restricted to predefined units of alternative splicing that they quantify from local read evidence. RESULTS Instead of attempting to quantify individual outcomes of the splicing process such as local splicing events or full-length transcripts, we propose to quantify alternative splicing using a simplified probabilistic model of the underlying splicing process. Our model is based on the usage of individual splice sites and can generate arbitrarily complex types of splicing patterns. In our implementation, McSplicer, we estimate the parameters of our model using all read data at once and we demonstrate in our experiments that this yields more accurate estimates compared to competing methods. Our model is able to describe multiple effects of splicing mutations using few, easy to interpret parameters, as we illustrate in an experiment on RNA-seq data from autism spectrum disorder patients. AVAILABILITY McSplicer source code is available at https://github.com/canzarlab/McSplicer and has been deposited in archived format at https://doi.org/10.5281/zenodo.4449881. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Israa Alqassem
- Gene Center, Ludwig-Maximilians-Universität München, Munich, 81377, Germany
| | | | | | - Heejung Shim
- Melbourne Integrative Genomics (MIG), School of Mathematics and Statistics, University of Melbourne, Melbourne, 3010, Australia
| | - Stefan Canzar
- Gene Center, Ludwig-Maximilians-Universität München, Munich, 81377, Germany
| |
Collapse
|
14
|
Nanoribbon-Based Electronic Detection of a Glioma-Associated Circular miRNA. BIOSENSORS-BASEL 2021; 11:bios11070237. [PMID: 34356707 PMCID: PMC8301916 DOI: 10.3390/bios11070237] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/02/2021] [Revised: 07/08/2021] [Accepted: 07/09/2021] [Indexed: 12/29/2022]
Abstract
Nanoribbon chips, based on “silicon-on-insulator” structures (SOI-NR chips), have been fabricated. These SOI-NR chips, whose surface was sensitized with covalently immobilized oligonucleotide molecular probes (oDNA probes), have been employed for the nanoribbon biosensor-based detection of a circular ribonucleic acid (circRNA) molecular marker of glioma in humans. The nucleotide sequence of the oDNA probes was complimentary to the sequence of the target oDNA. The latter represents a synthetic analogue of a glioma marker—NFIX circular RNA. In this way, the detection of target oDNA molecules in a pure buffer has been performed. The lowest concentration of the target biomolecules, detectable in our experiments, was of the order of ~10−17 M. The SOI-NR sensor chips proposed herein have allowed us to reveal an elevated level of the NFIX circular RNA in the blood of a glioma patient.
Collapse
|
15
|
Melsted P, Booeshaghi AS, Liu L, Gao F, Lu L, Min KHJ, da Veiga Beltrame E, Hjörleifsson KE, Gehring J, Pachter L. Modular, efficient and constant-memory single-cell RNA-seq preprocessing. Nat Biotechnol 2021; 39:813-818. [PMID: 33795888 DOI: 10.1038/s41587-021-00870-2] [Citation(s) in RCA: 227] [Impact Index Per Article: 56.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2019] [Accepted: 02/09/2021] [Indexed: 11/08/2022]
Abstract
We describe a workflow for preprocessing of single-cell RNA-sequencing data that balances efficiency and accuracy. Our workflow is based on the kallisto and bustools programs, and is near optimal in speed with a constant memory requirement providing scalability for arbitrarily large datasets. The workflow is modular, and we demonstrate its flexibility by showing how it can be used for RNA velocity analyses.
Collapse
Affiliation(s)
- Páll Melsted
- Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjavik, Iceland
| | - A Sina Booeshaghi
- Department of Mechanical Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Lauren Liu
- Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA
| | - Fan Gao
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
- Bioinformatics Resource Center, Beckman Institute, California Institute of Technology, Pasadena, CA, USA
| | - Lambda Lu
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Kyung Hoi Joseph Min
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Eduardo da Veiga Beltrame
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
| | | | - Jase Gehring
- Department of Genome Science, University of Washington, Seattle, WA, USA
| | - Lior Pachter
- Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA.
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA.
| |
Collapse
|
16
|
Hu Y, Fang L, Chen X, Zhong JF, Li M, Wang K. LIQA: long-read isoform quantification and analysis. Genome Biol 2021; 22:182. [PMID: 34140043 PMCID: PMC8212471 DOI: 10.1186/s13059-021-02399-8] [Citation(s) in RCA: 46] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2020] [Accepted: 06/04/2021] [Indexed: 11/10/2022] Open
Abstract
Long-read RNA sequencing (RNA-seq) technologies can sequence full-length transcripts, facilitating the exploration of isoform-specific gene expression over short-read RNA-seq. We present LIQA to quantify isoform expression and detect differential alternative splicing (DAS) events using long-read direct mRNA sequencing or cDNA sequencing data. LIQA incorporates base pair quality score and isoform-specific read length information in a survival model to assign different weights across reads, and uses an expectation-maximization algorithm for parameter estimation. We apply LIQA to long-read RNA-seq data from the Universal Human Reference, acute myeloid leukemia, and esophageal squamous epithelial cells and demonstrate its high accuracy in profiling alternative splicing events.
Collapse
Affiliation(s)
- Yu Hu
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA
| | - Li Fang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA
| | - Xuelian Chen
- Department of Otolaryngology, Keck School of Medicine, University of Southern California, Los Angeles, CA, 90033, USA
| | - Jiang F Zhong
- Department of Otolaryngology, Keck School of Medicine, University of Southern California, Los Angeles, CA, 90033, USA
| | - Mingyao Li
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA.
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA.
| |
Collapse
|
17
|
Sarantopoulou D, Brooks TG, Nayak S, Mrčela A, Lahens NF, Grant GR. Comparative evaluation of full-length isoform quantification from RNA-Seq. BMC Bioinformatics 2021; 22:266. [PMID: 34034652 PMCID: PMC8145802 DOI: 10.1186/s12859-021-04198-1] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2020] [Accepted: 05/16/2021] [Indexed: 11/18/2022] Open
Abstract
Background Full-length isoform quantification from RNA-Seq is a key goal in transcriptomics analyses and has been an area of active development since the beginning. The fundamental difficulty stems from the fact that RNA transcripts are long, while RNA-Seq reads are short. Results Here we use simulated benchmarking data that reflects many properties of real data, including polymorphisms, intron signal and non-uniform coverage, allowing for systematic comparative analyses of isoform quantification accuracy and its impact on differential expression analysis. Genome, transcriptome and pseudo alignment-based methods are included; and a simple approach is included as a baseline control. Conclusions Salmon, kallisto, RSEM, and Cufflinks exhibit the highest accuracy on idealized data, while on more realistic data they do not perform dramatically better than the simple approach. We determine the structural parameters with the greatest impact on quantification accuracy to be length and sequence compression complexity and not so much the number of isoforms. The effect of incomplete annotation on performance is also investigated. Overall, the tested methods show sufficient divergence from the truth to suggest that full-length isoform quantification and isoform level DE should still be employed selectively. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04198-1.
Collapse
Affiliation(s)
- Dimitra Sarantopoulou
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA.,National Institute on Aging, National Institutes of Health, Baltimore, MD, USA
| | - Thomas G Brooks
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Soumyashant Nayak
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Antonijo Mrčela
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Nicholas F Lahens
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Gregory R Grant
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA. .,Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
18
|
Liu Q, Hu Y, Stucky A, Fang L, Zhong JF, Wang K. LongGF: computational algorithm and software tool for fast and accurate detection of gene fusions by long-read transcriptome sequencing. BMC Genomics 2020; 21:793. [PMID: 33372596 PMCID: PMC7771079 DOI: 10.1186/s12864-020-07207-4] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2020] [Accepted: 10/29/2020] [Indexed: 12/17/2022] Open
Abstract
BACKGROUND Long-read RNA-Seq techniques can generate reads that encompass a large proportion or the entire mRNA/cDNA molecules, so they are expected to address inherited limitations of short-read RNA-Seq techniques that typically generate < 150 bp reads. However, there is a general lack of software tools for gene fusion detection from long-read RNA-seq data, which takes into account the high basecalling error rates and the presence of alignment errors. RESULTS In this study, we developed a fast computational tool, LongGF, to efficiently detect candidate gene fusions from long-read RNA-seq data, including cDNA sequencing data and direct mRNA sequencing data. We evaluated LongGF on tens of simulated long-read RNA-seq datasets, and demonstrated its superior performance in gene fusion detection. We also tested LongGF on a Nanopore direct mRNA sequencing dataset and a PacBio sequencing dataset generated on a mixture of 10 cancer cell lines, and found that LongGF achieved better performance to detect known gene fusions over existing computational tools. Furthermore, we tested LongGF on a Nanopore cDNA sequencing dataset on acute myeloid leukemia, and pinpointed the exact location of a translocation (previously known in cytogenetic resolution) in base resolution, which was further validated by Sanger sequencing. CONCLUSIONS In summary, LongGF will greatly facilitate the discovery of candidate gene fusion events from long-read RNA-Seq data, especially in cancer samples. LongGF is implemented in C++ and is available at https://github.com/WGLab/LongGF .
Collapse
Affiliation(s)
- Qian Liu
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA
| | - Yu Hu
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA
| | - Andres Stucky
- Department of Otolaryngology, Keck School of Medicine, University of Southern California, Los Angeles, CA, 90033, USA
| | - Li Fang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA
| | - Jiang F Zhong
- Department of Otolaryngology, Keck School of Medicine, University of Southern California, Los Angeles, CA, 90033, USA
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA.
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA.
| |
Collapse
|
19
|
Yu R, Yang W, Wang S. Performance evaluation of lossy quality compression algorithms for RNA-seq data. BMC Bioinformatics 2020; 21:321. [PMID: 32689929 PMCID: PMC7372835 DOI: 10.1186/s12859-020-03658-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2020] [Accepted: 07/13/2020] [Indexed: 11/29/2022] Open
Abstract
Background Recent advancements in high-throughput sequencing technologies have generated an unprecedented amount of genomic data that must be stored, processed, and transmitted over the network for sharing. Lossy genomic data compression, especially of the base quality values of sequencing data, is emerging as an efficient way to handle this challenge due to its superior compression performance compared to lossless compression methods. Many lossy compression algorithms have been developed for and evaluated using DNA sequencing data. However, whether these algorithms can be used on RNA sequencing (RNA-seq) data remains unclear. Results In this study, we evaluated the impacts of lossy quality value compression on common RNA-seq data analysis pipelines including expression quantification, transcriptome assembly, and short variants detection using RNA-seq data from different species and sequencing platforms. Our study shows that lossy quality value compression could effectively improve RNA-seq data compression. In some cases, lossy algorithms achieved up to 1.2-3 times further reduction on the overall RNA-seq data size compared to existing lossless algorithms. However, lossy quality value compression could affect the results of some RNA-seq data processing pipelines, and hence its impacts to RNA-seq studies cannot be ignored in some cases. Pipelines using HISAT2 for alignment were most significantly affected by lossy quality value compression, while the effects of lossy compression on pipelines that do not depend on quality values, e.g., STAR-based expression quantification and transcriptome assembly pipelines, were not observed. Moreover, regardless of using either STAR or HISAT2 as the aligner, variant detection results were affected by lossy quality value compression, albeit to a lesser extent when STAR-based pipeline was used. Our results also show that the impacts of lossy quality value compression depend on the compression algorithms being used and the compression levels if the algorithm supports setting of multiple compression levels. Conclusions Lossy quality value compression can be incorporated into existing RNA-seq analysis pipelines to alleviate the data storage and transmission burdens. However, care should be taken on the selection of compression tools and levels based on the requirements of the downstream analysis pipelines to avoid introducing undesirable adverse effects on the analysis results.
Collapse
|
20
|
An improved de novo assembling and polishing of Solea senegalensis transcriptome shed light on retinoic acid signalling in larvae. Sci Rep 2020; 10:20654. [PMID: 33244091 PMCID: PMC7691524 DOI: 10.1038/s41598-020-77201-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2020] [Accepted: 11/06/2020] [Indexed: 12/17/2022] Open
Abstract
Senegalese sole is an economically important flatfish species in aquaculture and an attractive model to decipher the molecular mechanisms governing the severe transformations occurring during metamorphosis, where retinoic acid seems to play a key role in tissue remodeling. In this study, a robust sole transcriptome was envisaged by reducing the number of assembled libraries (27 out of 111 available), fine-tuning a new automated and reproducible set of workflows for de novo assembling based on several assemblers, and removing low confidence transcripts after mapping onto a sole female genome draft. From a total of 96 resulting assemblies, two "raw" transcriptomes, one containing only Illumina reads and another with Illumina and GS-FLX reads, were selected to provide SOLSEv5.0, the most informative transcriptome with low redundancy and devoid of most single-exon transcripts. It included both Illumina and GS-FLX reads and consisted of 51,348 transcripts of which 22,684 code for 17,429 different proteins described in databases, where 9527 were predicted as complete proteins. SOLSEv5.0 was used as reference for the study of retinoic acid (RA) signalling in sole larvae using drug treatments (DEAB, a RA synthesis blocker, and TTNPB, a RA-receptor agonist) for 24 and 48 h. Differential expression and functional interpretation were facilitated by an updated version of DEGenes Hunter. Acute exposure of both drugs triggered an intense, specific and transient response at 24 h but with hardly observable differences after 48 h at least in the DEAB treatments. Activation of RA signalling by TTNPB specifically increased the expression of genes in pathways related to RA degradation, retinol storage, carotenoid metabolism, homeostatic response and visual cycle, and also modified the expression of transcripts related to morphogenesis and collagen fibril organisation. In contrast, DEAB mainly decreased genes related to retinal production, impairing phototransduction signalling in the retina. A total of 755 transcripts mainly related to lipid metabolism, lipid transport and lipid homeostasis were altered in response to both treatments, indicating non-specific drug responses associated with intestinal absorption. These results indicate that a new assembling and transcript sieving were both necessary to provide a reliable transcriptome to identify the many aspects of RA action during sole development that are of relevance for sole aquaculture.
Collapse
|
21
|
Rodriguez JM, Pozo F, di Domenico T, Vazquez J, Tress ML. An analysis of tissue-specific alternative splicing at the protein level. PLoS Comput Biol 2020; 16:e1008287. [PMID: 33017396 PMCID: PMC7561204 DOI: 10.1371/journal.pcbi.1008287] [Citation(s) in RCA: 51] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2020] [Revised: 10/15/2020] [Accepted: 08/25/2020] [Indexed: 01/09/2023] Open
Abstract
The role of alternative splicing is one of the great unanswered questions in cellular biology. There is strong evidence for alternative splicing at the transcript level, and transcriptomics experiments show that many splice events are tissue specific. It has been suggested that alternative splicing evolved in order to remodel tissue-specific protein-protein networks. Here we investigated the evidence for tissue-specific splicing among splice isoforms detected in a large-scale proteomics analysis. Although the data supporting alternative splicing is limited at the protein level, clear patterns emerged among the small numbers of alternative splice events that we could detect in the proteomics data. More than a third of these splice events were tissue-specific and most were ancient: over 95% of splice events that were tissue-specific in both proteomics and RNAseq analyses evolved prior to the ancestors of lobe-finned fish, at least 400 million years ago. By way of contrast, three in four alternative exons in the human gene set arose in the primate lineage, so our results cannot be extrapolated to the whole genome. Tissue-specific alternative protein forms in the proteomics analysis were particularly abundant in nervous and muscle tissues and their genes had roles related to the cytoskeleton and either the structure of muscle fibres or cell-cell connections. Our results suggest that this conserved tissue-specific alternative splicing may have played a role in the development of the vertebrate brain and heart. We manually curated a set of 255 splice events detected in a large-scale tissue-based proteomics experiment and found that more than a third had evidence of significant tissue-specific differences. Events that were significantly tissue-specific at the protein level were highly conserved; almost 75% evolved over 400 million years ago. The tissues in which we found most evidence for tissue-specific splicing were nervous tissues and cardiac tissues. Genes with tissue-specific events in these two tissues had functions related to important cellular structures in brain and heart tissues. These splice events may have been essential for the development of vertebrate heart and muscle. However, our data set may not be representative of alternative exons as a whole. We found that most tissue specific splicing was strongly conserved, but just 5% of annotated alternative exons in the human gene set are ancient. More than three quarters of alternative exons are primate-derived. Although the analysis does not provide a definitive answer to the question of the functional role of alternative splicing, our results do indicate that alternative splice variants may have played a significant part in the evolution of brain and heart tissues in vertebrates.
Collapse
Affiliation(s)
- Jose Manuel Rodriguez
- Cardiovascular Proteomics Laboratory, Centro Nacional de Investigaciones Cardiovasculares (CNIC), Calle Melchor Fernandez, Madrid, Spain
| | - Fernando Pozo
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Calle Melchor Fernandez, Madrid, Spain
| | - Tomas di Domenico
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Calle Melchor Fernandez, Madrid, Spain
| | - Jesus Vazquez
- Cardiovascular Proteomics Laboratory, Centro Nacional de Investigaciones Cardiovasculares (CNIC), Calle Melchor Fernandez, Madrid, Spain
- Centro de Investigación Biomédica en Red de Enfermedades Cardiovasculares (CIBERCV), Madrid, Spain
| | - Michael L. Tress
- Bioinformatics Unit, Spanish National Cancer Research Centre (CNIO), Calle Melchor Fernandez, Madrid, Spain
- * E-mail:
| |
Collapse
|
22
|
Transcriptome Profiling Analyses in Psoriasis: A Dynamic Contribution of Keratinocytes to the Pathogenesis. Genes (Basel) 2020; 11:genes11101155. [PMID: 33007857 PMCID: PMC7600703 DOI: 10.3390/genes11101155] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2020] [Revised: 09/28/2020] [Accepted: 09/29/2020] [Indexed: 02/08/2023] Open
Abstract
Psoriasis is an immune-mediated inflammatory skin disease with a complex etiology involving environmental and genetic factors. A better insight into related genomic alteration helps design precise therapies leading to better treatment outcome. Gene expression in psoriasis can provide relevant information about the altered expression of mRNA transcripts, thus giving new insights into the disease onset. Techniques for transcriptome analyses, such as microarray and RNA sequencing (RNA-seq), are relevant tools for the discovery of new biomarkers as well as new therapeutic targets. This review summarizes the findings related to the contribution of keratinocytes in the pathogenesis of psoriasis by an in-depth review of studies that have examined psoriatic transcriptomes in the past years. It also provides valuable information on reconstructed 3D psoriatic skin models using cells isolated from psoriatic patients for transcriptomic studies.
Collapse
|
23
|
Mao S, Pachter L, Tse D, Kannan S. RefShannon: A genome-guided transcriptome assembler using sparse flow decomposition. PLoS One 2020; 15:e0232946. [PMID: 32484809 PMCID: PMC7266320 DOI: 10.1371/journal.pone.0232946] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2019] [Accepted: 04/24/2020] [Indexed: 12/12/2022] Open
Abstract
High throughput sequencing of RNA (RNA-Seq) has become a staple in modern molecular biology, with applications not only in quantifying gene expression but also in isoform-level analysis of the RNA transcripts. To enable such an isoform-level analysis, a transcriptome assembly algorithm is utilized to stitch together the observed short reads into the corresponding transcripts. This task is complicated due to the complexity of alternative splicing - a mechanism by which the same gene may generate multiple distinct RNA transcripts. We develop a novel genome-guided transcriptome assembler, RefShannon, that exploits the varying abundances of the different transcripts, in enabling an accurate reconstruction of the transcripts. Our evaluation shows RefShannon is able to improve sensitivity effectively (up to 22%) at a given specificity in comparison with other state-of-the-art assemblers. RefShannon is written in Python and is available from Github (https://github.com/shunfumao/RefShannon).
Collapse
Affiliation(s)
- Shunfu Mao
- Department of Electrical and Computer Engineering, University of Washington, Seattle, WA, United States of America
| | - Lior Pachter
- Division of Biology and Biological Engineering, Caltech, Pasadena, CA, United States of America
| | - David Tse
- Department of Electrical Engineering, Stanford University, Stanford, CA, United States of America
| | - Sreeram Kannan
- Department of Electrical and Computer Engineering, University of Washington, Seattle, WA, United States of America
- * E-mail:
| |
Collapse
|
24
|
Sheynkman GM, Tuttle KS, Laval F, Tseng E, Underwood JG, Yu L, Dong D, Smith ML, Sebra R, Willems L, Hao T, Calderwood MA, Hill DE, Vidal M. ORF Capture-Seq as a versatile method for targeted identification of full-length isoforms. Nat Commun 2020; 11:2326. [PMID: 32393825 PMCID: PMC7214433 DOI: 10.1038/s41467-020-16174-z] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2019] [Accepted: 04/16/2020] [Indexed: 01/02/2023] Open
Abstract
Most human protein-coding genes are expressed as multiple isoforms, which greatly expands the functional repertoire of the encoded proteome. While at least one reliable open reading frame (ORF) model has been assigned for every coding gene, the majority of alternative isoforms remains uncharacterized due to (i) vast differences of overall levels between different isoforms expressed from common genes, and (ii) the difficulty of obtaining full-length transcript sequences. Here, we present ORF Capture-Seq (OCS), a flexible method that addresses both challenges for targeted full-length isoform sequencing applications using collections of cloned ORFs as probes. As a proof-of-concept, we show that an OCS pipeline focused on genes coding for transcription factors increases isoform detection by an order of magnitude when compared to unenriched samples. In short, OCS enables rapid discovery of isoforms from custom-selected genes and will accelerate mapping of the human transcriptome.
Collapse
Affiliation(s)
- Gloria M Sheynkman
- Center for Cancer Systems Biology (CCSB), Dana-Farber Cancer Institute, Boston, MA, 02215, USA. .,Department of Genetics, Blavatnik Institute, Harvard Medical School, Boston, MA, 02115, USA. .,Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, MA, 02215, USA.
| | - Katharine S Tuttle
- Center for Cancer Systems Biology (CCSB), Dana-Farber Cancer Institute, Boston, MA, 02215, USA.,Department of Genetics, Blavatnik Institute, Harvard Medical School, Boston, MA, 02115, USA.,Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, MA, 02215, USA.,Department of Biochemistry, Northeastern University, Boston, MA, 02115, USA.,Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA.,Icahn Institute of Data Science and Genomic Technology, New York, NY, 10029, USA
| | - Florent Laval
- Center for Cancer Systems Biology (CCSB), Dana-Farber Cancer Institute, Boston, MA, 02215, USA.,Department of Genetics, Blavatnik Institute, Harvard Medical School, Boston, MA, 02115, USA.,Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, MA, 02215, USA.,Laboratory of Molecular Biology, TERRA Teaching and Research Centre, Gembloux Agro-Bio Tech, University of Liège, Gembloux, 5030, Belgium.,Laboratory of Molecular and Cellular Epigenetics, GIGA-Cancer, University of Liège, 4000, Liège, Belgium
| | | | | | - Liang Yu
- School of Computer Science and Technology, Xidian University, Xi'an, 710071, China
| | - Da Dong
- School of Computer Science and Technology, Xidian University, Xi'an, 710071, China
| | - Melissa L Smith
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA.,Icahn Institute of Data Science and Genomic Technology, New York, NY, 10029, USA
| | - Robert Sebra
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA.,Icahn Institute of Data Science and Genomic Technology, New York, NY, 10029, USA
| | - Luc Willems
- Laboratory of Molecular Biology, TERRA Teaching and Research Centre, Gembloux Agro-Bio Tech, University of Liège, Gembloux, 5030, Belgium.,Laboratory of Molecular and Cellular Epigenetics, GIGA-Cancer, University of Liège, 4000, Liège, Belgium
| | - Tong Hao
- Center for Cancer Systems Biology (CCSB), Dana-Farber Cancer Institute, Boston, MA, 02215, USA.,Department of Genetics, Blavatnik Institute, Harvard Medical School, Boston, MA, 02115, USA.,Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, MA, 02215, USA
| | - Michael A Calderwood
- Center for Cancer Systems Biology (CCSB), Dana-Farber Cancer Institute, Boston, MA, 02215, USA.,Department of Genetics, Blavatnik Institute, Harvard Medical School, Boston, MA, 02115, USA.,Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, MA, 02215, USA
| | - David E Hill
- Center for Cancer Systems Biology (CCSB), Dana-Farber Cancer Institute, Boston, MA, 02215, USA. .,Department of Genetics, Blavatnik Institute, Harvard Medical School, Boston, MA, 02115, USA. .,Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, MA, 02215, USA.
| | - Marc Vidal
- Center for Cancer Systems Biology (CCSB), Dana-Farber Cancer Institute, Boston, MA, 02215, USA.,Department of Genetics, Blavatnik Institute, Harvard Medical School, Boston, MA, 02115, USA
| |
Collapse
|
25
|
Martín-Vide C, Vega-Rodríguez MA, Wheeler T. BOAssembler: A Bayesian Optimization Framework to Improve RNA-Seq Assembly Performance. ALGORITHMS FOR COMPUTATIONAL BIOLOGY 2020. [PMCID: PMC7197064 DOI: 10.1007/978-3-030-42266-0_15] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
26
|
Rapazote-Flores P, Bayer M, Milne L, Mayer CD, Fuller J, Guo W, Hedley PE, Morris J, Halpin C, Kam J, McKim SM, Zwirek M, Casao MC, Barakate A, Schreiber M, Stephen G, Zhang R, Brown JWS, Waugh R, Simpson CG. BaRTv1.0: an improved barley reference transcript dataset to determine accurate changes in the barley transcriptome using RNA-seq. BMC Genomics 2019; 20:968. [PMID: 31829136 DOI: 10.1186/s12864-019-6243-6247] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2019] [Accepted: 10/29/2019] [Indexed: 05/27/2023] Open
Abstract
BACKGROUND The time required to analyse RNA-seq data varies considerably, due to discrete steps for computational assembly, quantification of gene expression and splicing analysis. Recent fast non-alignment tools such as Kallisto and Salmon overcome these problems, but these tools require a high quality, comprehensive reference transcripts dataset (RTD), which are rarely available in plants. RESULTS A high-quality, non-redundant barley gene RTD and database (Barley Reference Transcripts - BaRTv1.0) has been generated. BaRTv1.0, was constructed from a range of tissues, cultivars and abiotic treatments and transcripts assembled and aligned to the barley cv. Morex reference genome (Mascher et al. Nature; 544: 427-433, 2017). Full-length cDNAs from the barley variety Haruna nijo (Matsumoto et al. Plant Physiol; 156: 20-28, 2011) determined transcript coverage, and high-resolution RT-PCR validated alternatively spliced (AS) transcripts of 86 genes in five different organs and tissue. These methods were used as benchmarks to select an optimal barley RTD. BaRTv1.0-Quantification of Alternatively Spliced Isoforms (QUASI) was also made to overcome inaccurate quantification due to variation in 5' and 3' UTR ends of transcripts. BaRTv1.0-QUASI was used for accurate transcript quantification of RNA-seq data of five barley organs/tissues. This analysis identified 20,972 significant differentially expressed genes, 2791 differentially alternatively spliced genes and 2768 transcripts with differential transcript usage. CONCLUSION A high confidence barley reference transcript dataset consisting of 60,444 genes with 177,240 transcripts has been generated. Compared to current barley transcripts, BaRTv1.0 transcripts are generally longer, have less fragmentation and improved gene models that are well supported by splice junction reads. Precise transcript quantification using BaRTv1.0 allows routine analysis of gene expression and AS.
Collapse
Affiliation(s)
- Paulo Rapazote-Flores
- Information and Computational Sciences, James Hutton Institute, Invergowrie, Dundee, DD2 5DA, UK
| | - Micha Bayer
- Information and Computational Sciences, James Hutton Institute, Invergowrie, Dundee, DD2 5DA, UK
| | - Linda Milne
- Information and Computational Sciences, James Hutton Institute, Invergowrie, Dundee, DD2 5DA, UK
| | | | - John Fuller
- Cell and Molecular Sciences, The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, UK
| | - Wenbin Guo
- Division of Plant Sciences, School of Life Sciences, University of Dundee at the James Hutton Institute, Dundee, DD2 5DA, UK
| | - Pete E Hedley
- Cell and Molecular Sciences, The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, UK
| | - Jenny Morris
- Cell and Molecular Sciences, The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, UK
| | - Claire Halpin
- Division of Plant Sciences, School of Life Sciences, University of Dundee at the James Hutton Institute, Dundee, DD2 5DA, UK
| | - Jason Kam
- Division of Plant Sciences, School of Life Sciences, University of Dundee at the James Hutton Institute, Dundee, DD2 5DA, UK
- Present address: Institute of Biological, Environmental and Rural Sciences, Aberystwyth University, Gogerddan, Aberystwyth, Ceredigion, SY23 3EB, UK
| | - Sarah M McKim
- Division of Plant Sciences, School of Life Sciences, University of Dundee at the James Hutton Institute, Dundee, DD2 5DA, UK
| | - Monika Zwirek
- Division of Plant Sciences, School of Life Sciences, University of Dundee at the James Hutton Institute, Dundee, DD2 5DA, UK
- Present Address: MRC Protein Phosphorylation and Ubiquitylation Unit, Sir James Black Centre, School of Life Sciences, University of Dundee, Dundee, DD1 5EH, UK
| | - M Cristina Casao
- Division of Plant Sciences, School of Life Sciences, University of Dundee at the James Hutton Institute, Dundee, DD2 5DA, UK
| | - Abdellah Barakate
- Cell and Molecular Sciences, The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, UK
| | - Miriam Schreiber
- Cell and Molecular Sciences, The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, UK
| | - Gordon Stephen
- Information and Computational Sciences, James Hutton Institute, Invergowrie, Dundee, DD2 5DA, UK
| | - Runxuan Zhang
- Information and Computational Sciences, James Hutton Institute, Invergowrie, Dundee, DD2 5DA, UK
| | - John W S Brown
- Cell and Molecular Sciences, The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, UK
- Division of Plant Sciences, School of Life Sciences, University of Dundee at the James Hutton Institute, Dundee, DD2 5DA, UK
| | - Robbie Waugh
- Cell and Molecular Sciences, The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, UK
- Division of Plant Sciences, School of Life Sciences, University of Dundee at the James Hutton Institute, Dundee, DD2 5DA, UK
| | - Craig G Simpson
- Cell and Molecular Sciences, The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, UK.
| |
Collapse
|
27
|
Rapazote-Flores P, Bayer M, Milne L, Mayer CD, Fuller J, Guo W, Hedley PE, Morris J, Halpin C, Kam J, McKim SM, Zwirek M, Casao MC, Barakate A, Schreiber M, Stephen G, Zhang R, Brown JWS, Waugh R, Simpson CG. BaRTv1.0: an improved barley reference transcript dataset to determine accurate changes in the barley transcriptome using RNA-seq. BMC Genomics 2019; 20:968. [PMID: 31829136 PMCID: PMC6907147 DOI: 10.1186/s12864-019-6243-7] [Citation(s) in RCA: 39] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2019] [Accepted: 10/29/2019] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND The time required to analyse RNA-seq data varies considerably, due to discrete steps for computational assembly, quantification of gene expression and splicing analysis. Recent fast non-alignment tools such as Kallisto and Salmon overcome these problems, but these tools require a high quality, comprehensive reference transcripts dataset (RTD), which are rarely available in plants. RESULTS A high-quality, non-redundant barley gene RTD and database (Barley Reference Transcripts - BaRTv1.0) has been generated. BaRTv1.0, was constructed from a range of tissues, cultivars and abiotic treatments and transcripts assembled and aligned to the barley cv. Morex reference genome (Mascher et al. Nature; 544: 427-433, 2017). Full-length cDNAs from the barley variety Haruna nijo (Matsumoto et al. Plant Physiol; 156: 20-28, 2011) determined transcript coverage, and high-resolution RT-PCR validated alternatively spliced (AS) transcripts of 86 genes in five different organs and tissue. These methods were used as benchmarks to select an optimal barley RTD. BaRTv1.0-Quantification of Alternatively Spliced Isoforms (QUASI) was also made to overcome inaccurate quantification due to variation in 5' and 3' UTR ends of transcripts. BaRTv1.0-QUASI was used for accurate transcript quantification of RNA-seq data of five barley organs/tissues. This analysis identified 20,972 significant differentially expressed genes, 2791 differentially alternatively spliced genes and 2768 transcripts with differential transcript usage. CONCLUSION A high confidence barley reference transcript dataset consisting of 60,444 genes with 177,240 transcripts has been generated. Compared to current barley transcripts, BaRTv1.0 transcripts are generally longer, have less fragmentation and improved gene models that are well supported by splice junction reads. Precise transcript quantification using BaRTv1.0 allows routine analysis of gene expression and AS.
Collapse
Affiliation(s)
- Paulo Rapazote-Flores
- Information and Computational Sciences, James Hutton Institute, Invergowrie, Dundee, DD2 5DA, UK
| | - Micha Bayer
- Information and Computational Sciences, James Hutton Institute, Invergowrie, Dundee, DD2 5DA, UK
| | - Linda Milne
- Information and Computational Sciences, James Hutton Institute, Invergowrie, Dundee, DD2 5DA, UK
| | | | - John Fuller
- Cell and Molecular Sciences, The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, UK
| | - Wenbin Guo
- Division of Plant Sciences, School of Life Sciences, University of Dundee at the James Hutton Institute, Dundee, DD2 5DA, UK
| | - Pete E Hedley
- Cell and Molecular Sciences, The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, UK
| | - Jenny Morris
- Cell and Molecular Sciences, The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, UK
| | - Claire Halpin
- Division of Plant Sciences, School of Life Sciences, University of Dundee at the James Hutton Institute, Dundee, DD2 5DA, UK
| | - Jason Kam
- Division of Plant Sciences, School of Life Sciences, University of Dundee at the James Hutton Institute, Dundee, DD2 5DA, UK
- Present address: Institute of Biological, Environmental and Rural Sciences, Aberystwyth University, Gogerddan, Aberystwyth, Ceredigion, SY23 3EB, UK
| | - Sarah M McKim
- Division of Plant Sciences, School of Life Sciences, University of Dundee at the James Hutton Institute, Dundee, DD2 5DA, UK
| | - Monika Zwirek
- Division of Plant Sciences, School of Life Sciences, University of Dundee at the James Hutton Institute, Dundee, DD2 5DA, UK
- Present Address: MRC Protein Phosphorylation and Ubiquitylation Unit, Sir James Black Centre, School of Life Sciences, University of Dundee, Dundee, DD1 5EH, UK
| | - M Cristina Casao
- Division of Plant Sciences, School of Life Sciences, University of Dundee at the James Hutton Institute, Dundee, DD2 5DA, UK
| | - Abdellah Barakate
- Cell and Molecular Sciences, The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, UK
| | - Miriam Schreiber
- Cell and Molecular Sciences, The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, UK
| | - Gordon Stephen
- Information and Computational Sciences, James Hutton Institute, Invergowrie, Dundee, DD2 5DA, UK
| | - Runxuan Zhang
- Information and Computational Sciences, James Hutton Institute, Invergowrie, Dundee, DD2 5DA, UK
| | - John W S Brown
- Cell and Molecular Sciences, The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, UK
- Division of Plant Sciences, School of Life Sciences, University of Dundee at the James Hutton Institute, Dundee, DD2 5DA, UK
| | - Robbie Waugh
- Cell and Molecular Sciences, The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, UK
- Division of Plant Sciences, School of Life Sciences, University of Dundee at the James Hutton Institute, Dundee, DD2 5DA, UK
| | - Craig G Simpson
- Cell and Molecular Sciences, The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, UK.
| |
Collapse
|
28
|
Pinskaya M, Saci Z, Gallopin M, Gabriel M, Nguyen HT, Firlej V, Descrimes M, Rapinat A, Gentien D, Taille ADL, Londoño-Vallejo A, Allory Y, Gautheret D, Morillon A. Reference-free transcriptome exploration reveals novel RNAs for prostate cancer diagnosis. Life Sci Alliance 2019; 2:2/6/e201900449. [PMID: 31732695 PMCID: PMC6858606 DOI: 10.26508/lsa.201900449] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2019] [Revised: 11/05/2019] [Accepted: 11/05/2019] [Indexed: 12/24/2022] Open
Abstract
The use of RNA-sequencing technologies held a promise of improved diagnostic tools based on comprehensive transcript sets. However, mining human transcriptome data for disease biomarkers in clinical specimens are restricted by the limited power of conventional reference-based protocols relying on unique and annotated transcripts. Here, we implemented a blind reference-free computational protocol, DE-kupl, to infer yet unreferenced RNA variations from total stranded RNA-sequencing datasets of tissue origin. As a bench test, this protocol was powered for detection of RNA subsequences embedded into putative long noncoding (lnc)RNAs expressed in prostate cancer. Through filtering of 1,179 candidates, we defined 21 lncRNAs that were further validated by NanoString for robust tumor-specific expression in 144 tissue specimens. Predictive modeling yielded a restricted probe panel enabling more than 90% of true-positive detections of cancer in an independent The Cancer Genome Atlas cohort. Remarkably, this clinical signature made of only nine unannotated lncRNAs largely outperformed PCA3, the only used prostate cancer lncRNA biomarker, in detection of high-risk tumors. This modular workflow is highly sensitive and can be applied to any pathology or clinical application.
Collapse
Affiliation(s)
- Marina Pinskaya
- ncRNA, Epigenetic and Genome Fluidity, Université Paris Sciences & Lettres (PSL), Sorbonne Université, Centre National de la Recherche Scientifique (CNRS), Institut Curie, Research Center, Paris, France
| | - Zohra Saci
- ncRNA, Epigenetic and Genome Fluidity, Université Paris Sciences & Lettres (PSL), Sorbonne Université, Centre National de la Recherche Scientifique (CNRS), Institut Curie, Research Center, Paris, France
| | - Mélina Gallopin
- Institute for Integrative Biology of the Cell, Commissariat à l'Energie Atomique, CNRS, Université Paris-Sud, Université Paris-Saclay, Gif sur Yvette, France
| | - Marc Gabriel
- ncRNA, Epigenetic and Genome Fluidity, Université Paris Sciences & Lettres (PSL), Sorbonne Université, Centre National de la Recherche Scientifique (CNRS), Institut Curie, Research Center, Paris, France
| | - Ha Tn Nguyen
- Institute for Integrative Biology of the Cell, Commissariat à l'Energie Atomique, CNRS, Université Paris-Sud, Université Paris-Saclay, Gif sur Yvette, France.,Thuyloi University, Hanoi, Vietnam
| | - Virginie Firlej
- Université Paris-Est Créteil, Créteil, France.,Institut National de la Santé et de la Recherche Médicale, U955, Equipe 7, Créteil, France
| | - Marc Descrimes
- ncRNA, Epigenetic and Genome Fluidity, Université Paris Sciences & Lettres (PSL), Sorbonne Université, Centre National de la Recherche Scientifique (CNRS), Institut Curie, Research Center, Paris, France
| | - Audrey Rapinat
- Translational Research Department, Genomics Platform, Institut Curie, Université PSL, Paris, France
| | - David Gentien
- Translational Research Department, Genomics Platform, Institut Curie, Université PSL, Paris, France
| | - Alexandre de la Taille
- Université Paris-Est Créteil, Créteil, France.,Institut National de la Santé et de la Recherche Médicale, U955, Equipe 7, Créteil, France.,Assistance Publique - Hôpitaux de Paris, Hôpital Henri Mondor, Département d'Urologie, Créteil, France
| | - Arturo Londoño-Vallejo
- Telomeres and Cancer, Université PSL, Sorbonne Université, CNRS, Institut Curie, Research Center, Paris, France
| | - Yves Allory
- Compartimentation et Dynamique Cellulaire, Université PSL, Sorbonne Université, CNRS, Institut Curie, Research Center, Paris, France
| | - Daniel Gautheret
- Institute for Integrative Biology of the Cell, Commissariat à l'Energie Atomique, CNRS, Université Paris-Sud, Université Paris-Saclay, Gif sur Yvette, France
| | - Antonin Morillon
- ncRNA, Epigenetic and Genome Fluidity, Université Paris Sciences & Lettres (PSL), Sorbonne Université, Centre National de la Recherche Scientifique (CNRS), Institut Curie, Research Center, Paris, France
| |
Collapse
|
29
|
Doose G, Bernhart SH, Wagener R, Hoffmann S. DIEGO: detection of differential alternative splicing using Aitchison's geometry. Bioinformatics 2019; 34:1066-1068. [PMID: 29088309 PMCID: PMC5860559 DOI: 10.1093/bioinformatics/btx690] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2017] [Accepted: 10/26/2017] [Indexed: 11/13/2022] Open
Abstract
Motivation Alternative splicing is a biological process of fundamental importance in most eukaryotes. It plays a pivotal role in cell differentiation and gene regulation and has been associated with a number of different diseases. The widespread availability of RNA-Sequencing capacities allows an ever closer investigation of differentially expressed isoforms. However, most tools for differential alternative splicing (DAS) analysis do not take split reads, i.e. the most direct evidence for a splice event, into account. Here, we present DIEGO, a compositional data analysis method able to detect DAS between two sets of RNA-Seq samples based on split reads. Results The python tool DIEGO works without isoform annotations and is fast enough to analyze large experiments while being robust and accurate. We provide python and perl parsers for common formats. Availability and implementation The software is available at: www.bioinf.uni-leipzig.de/Software/DIEGO. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Gero Doose
- Transcriptome Bioinformatics Group, Interdisciplinary Center for Bioinformatics, Leipzig University, 04107 Leipzig.,Chair of Bioinformatics, Faculty of Mathematics and Computer Science, Leipzig University, 04107 Leipzig, Germany.,ecSeq Bioinformatics, 04103 Leipzig, Germany
| | - Stephan H Bernhart
- Transcriptome Bioinformatics Group, Interdisciplinary Center for Bioinformatics, Leipzig University, 04107 Leipzig.,Chair of Bioinformatics, Faculty of Mathematics and Computer Science, Leipzig University, 04107 Leipzig, Germany
| | - Rabea Wagener
- Institute of Human Genetics, University of Ulm and University of Ulm Medical Center, 89081 Ulm, Germany
| | - Steve Hoffmann
- Transcriptome Bioinformatics Group, Interdisciplinary Center for Bioinformatics, Leipzig University, 04107 Leipzig.,Chair of Bioinformatics, Faculty of Mathematics and Computer Science, Leipzig University, 04107 Leipzig, Germany.,Computational Biology Group, Leibniz Institute on Ageing - Fritz Lipmann Institute (FLI) and Friedrich-Schiller-University Jena, 07745 Jena, Germany
| |
Collapse
|
30
|
Abstract
Genetic, transcriptional, and post-transcriptional variations shape the transcriptome of individual cells, rendering establishing an exhaustive set of reference RNAs a complicated matter. Current reference transcriptomes, which are based on carefully curated transcripts, are lagging behind the extensive RNA variation revealed by massively parallel sequencing. Much may be missed by ignoring this unreferenced RNA diversity. There is plentiful evidence for non-reference transcripts with important phenotypic effects. Although reference transcriptomes are inestimable for gene expression analysis, they may turn limiting in important medical applications. We discuss computational strategies for retrieving hidden transcript diversity.
Collapse
Affiliation(s)
- Antonin Morillon
- ncRNA, Epigenetic and Genome Fluidity, CNRS UMR 3244, Sorbonne Université, PSL University, Institut Curie, Centre de Recherche, 26 rue d'Ulm, 75248, Paris, France
| | - Daniel Gautheret
- Institute for Integrative Biology of the Cell, CEA, CNRS, Université Paris-Sud, Université Paris Saclay, Gif sur Yvette, France.
| |
Collapse
|
31
|
Gatter T, Stadler PF. Ryūtō: network-flow based transcriptome reconstruction. BMC Bioinformatics 2019; 20:190. [PMID: 30991937 PMCID: PMC6469118 DOI: 10.1186/s12859-019-2786-5] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2019] [Accepted: 04/01/2019] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND The rapid increase in High-throughput sequencing of RNA (RNA-seq) has led to tremendous improvements in the detection and reconstruction of both expressed coding and non-coding RNA transcripts. Yet, the complete and accurate annotation of the complex transcriptional output of not only the human genome has remained elusive. One of the critical bottlenecks in this endeavor is the computational reconstruction of transcript structures, due to high noise levels, technological limits, and other biases in the raw data. RESULTS We introduce several new and improved algorithms in a novel workflow for transcript assembly and quantification. We propose an extension of the common splice graph framework that combines aspects of overlap and bin graphs and makes it possible to efficiently use both multi-splice and paired-end information to the fullest extent. Phasing information of reads is used to further resolve loci. The decomposition of read coverage patterns is modeled as a minimum-cost flow problem to account for the unavoidable non-uniformities of RNA-seq data. CONCLUSION Its performance compares favorably with state of the art methods on both simulated and real-life datasets. Ryūtō calls 1-4% more true transcripts, while calling 5-35% less false predictions compared to the next best competitor.
Collapse
Affiliation(s)
- Thomas Gatter
- Bioinformatics Group, Department of Computer Science & Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16-18, Leipzig, 04107, Germany.
| | - Peter F Stadler
- Bioinformatics Group, Department of Computer Science & Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16-18, Leipzig, 04107, Germany.,Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, Leipzig, 04103, Germany.,Institute for Theoretical Chemistry, University of Vienna, Währingerstrasse 17, Wien, 1090, Austria.,Faculdad de Ciencias, Universidad Nacional de Colombia, Sede Bogotá, Ciudad Universitaria, Bogotá, D.C., COL-111321, Colombia.,Santa Fe Insitute, 1399 Hyde Park Rd., Santa Fe, NM 87501, USA
| |
Collapse
|
32
|
Shao M, Kingsford C. Theory and A Heuristic for the Minimum Path Flow Decomposition Problem. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:658-670. [PMID: 29990201 DOI: 10.1109/tcbb.2017.2779509] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Motivated by multiple genome assembly problems and other applications, we study the following minimum path flow decomposition problem: Given a directed acyclic graph $G=(V,E)$G=(V,E) with source $s$s and sink $t$t and a flow $f$f, compute a set of $s$s-$t$t paths $P$P and assign weight $w(p)$w(p) for $p\in P$p∈P such that $f(e) = \sum _{p\in P: e\in p} w(p)$f(e)=∑p∈P:e∈pw(p), $\forall e\in E$∀e∈E, and $|P|$|P| is minimized. We develop some fundamental theory for this problem, upon which we design an efficient heuristic. Specifically, we prove that the gap between the optimal number of paths and a known upper bound is determined by the nontrivial equations within the flow values. This result gives rise to the framework of our heuristic: to iteratively reduce the gap through identifying such equations. We also define an operation on certain independent substructures of the graph, and prove that this operation does not affect the optimality but can transform the graph into one with desired property that facilitates reducing the gap. We apply and test our algorithm on both simulated random instances and perfect splice graph instances, and also compare it with the existing state-of-art algorithm for flow decomposition. The results illustrate that our algorithm can achieve very high accuracy on these instances, and also that our algorithm significantly improves on the previous algorithms. An implementation of our algorithm is freely available at https://github.com/Kingsford-Group/catfish.
Collapse
|
33
|
Rao MS, Van Vleet TR, Ciurlionis R, Buck WR, Mittelstadt SW, Blomme EAG, Liguori MJ. Comparison of RNA-Seq and Microarray Gene Expression Platforms for the Toxicogenomic Evaluation of Liver From Short-Term Rat Toxicity Studies. Front Genet 2019; 9:636. [PMID: 30723492 PMCID: PMC6349826 DOI: 10.3389/fgene.2018.00636] [Citation(s) in RCA: 141] [Impact Index Per Article: 23.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2018] [Accepted: 11/27/2018] [Indexed: 12/12/2022] Open
Abstract
Gene expression profiling is a useful tool to predict and interrogate mechanisms of toxicity. RNA-Seq technology has emerged as an attractive alternative to traditional microarray platforms for conducting transcriptional profiling. The objective of this work was to compare both transcriptomic platforms to determine whether RNA-Seq offered significant advantages over microarrays for toxicogenomic studies. RNA samples from the livers of rats treated for 5 days with five tool hepatotoxicants (α-naphthylisothiocyanate/ANIT, carbon tetrachloride/CCl4, methylenedianiline/MDA, acetaminophen/APAP, and diclofenac/DCLF) were analyzed with both gene expression platforms (RNA-Seq and microarray). Data were compared to determine any potential added scientific (i.e., better biological or toxicological insight) value offered by RNA-Seq compared to microarrays. RNA-Seq identified more differentially expressed protein-coding genes and provided a wider quantitative range of expression level changes when compared to microarrays. Both platforms identified a larger number of differentially expressed genes (DEGs) in livers of rats treated with ANIT, MDA, and CCl4 compared to APAP and DCLF, in agreement with the severity of histopathological findings. Approximately 78% of DEGs identified with microarrays overlapped with RNA-Seq data, with a Spearman’s correlation of 0.7 to 0.83. Consistent with the mechanisms of toxicity of ANIT, APAP, MDA and CCl4, both platforms identified dysregulation of liver relevant pathways such as Nrf2, cholesterol biosynthesis, eiF2, hepatic cholestasis, glutathione and LPS/IL-1 mediated RXR inhibition. RNA-Seq data showed additional DEGs that not only significantly enriched these pathways, but also suggested modulation of additional liver relevant pathways. In addition, RNA-Seq enabled the identification of non-coding DEGs that offer a potential for improved mechanistic clarity. Overall, these results indicate that RNA-Seq is an acceptable alternative platform to microarrays for rat toxicogenomic studies with several advantages. Because of its wider dynamic range as well as its ability to identify a larger number of DEGs, RNA-Seq may generate more insight into mechanisms of toxicity. However, more extensive reference data will be necessary to fully leverage these additional RNA-Seq data, especially for non-coding sequences.
Collapse
Affiliation(s)
- Mohan S Rao
- Investigative Toxicology and Pathology, Global Preclinical Safety, AbbVie, North Chicago, IL, United States
| | - Terry R Van Vleet
- Investigative Toxicology and Pathology, Global Preclinical Safety, AbbVie, North Chicago, IL, United States
| | - Rita Ciurlionis
- Investigative Toxicology and Pathology, Global Preclinical Safety, AbbVie, North Chicago, IL, United States
| | - Wayne R Buck
- Investigative Toxicology and Pathology, Global Preclinical Safety, AbbVie, North Chicago, IL, United States
| | - Scott W Mittelstadt
- Investigative Toxicology and Pathology, Global Preclinical Safety, AbbVie, North Chicago, IL, United States
| | - Eric A G Blomme
- Investigative Toxicology and Pathology, Global Preclinical Safety, AbbVie, North Chicago, IL, United States
| | - Michael J Liguori
- Investigative Toxicology and Pathology, Global Preclinical Safety, AbbVie, North Chicago, IL, United States
| |
Collapse
|
34
|
Gruber AJ, Gypas F, Riba A, Schmidt R, Zavolan M. Terminal exon characterization with TECtool reveals an abundance of cell-specific isoforms. Nat Methods 2018; 15:832-836. [PMID: 30202060 PMCID: PMC7611301 DOI: 10.1038/s41592-018-0114-z] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2017] [Accepted: 07/10/2018] [Indexed: 11/23/2022]
Abstract
Sequencing of RNA 3' ends has uncovered numerous sites that do not correspond to the termination sites of known transcripts. Through their 3' untranslated regions, protein-coding RNAs interact with RNA-binding proteins and microRNAs, which regulate many properties, including RNA stability and subcellular localization. We developed the terminal exon characterization (TEC) tool ( http://tectool.unibas.ch ), which can be used with RNA-sequencing data from any species for which a genome annotation that includes sites of RNA cleavage and polyadenylation is available. We discovered hundreds of previously unknown isoforms and cell-type-specific terminal exons in human cells. Ribosome profiling data revealed that many of these isoforms were translated. By applying TECtool to single-cell sequencing data, we found that the newly identified isoforms were expressed in subpopulations of cells. Thus, TECtool enables the identification of previously unknown isoforms in well-studied cell systems and in rare cell types.
Collapse
Affiliation(s)
- Andreas J Gruber
- Oxford Big Data Institute, Nuffield Department of Medicine, University of Oxford, Oxford, UK.
| | - Foivos Gypas
- Computational and Systems Biology, Biozentrum, University of Basel, Basel, Switzerland
| | - Andrea Riba
- Institut de Génétique et de Biologie Moléculaire et Cellulaire, Illkirch, France
| | - Ralf Schmidt
- Computational and Systems Biology, Biozentrum, University of Basel, Basel, Switzerland
| | - Mihaela Zavolan
- Computational and Systems Biology, Biozentrum, University of Basel, Basel, Switzerland.
| |
Collapse
|
35
|
Event Analysis: Using Transcript Events To Improve Estimates of Abundance in RNA-seq Data. G3-GENES GENOMES GENETICS 2018; 8:2923-2940. [PMID: 30021829 PMCID: PMC6118309 DOI: 10.1534/g3.118.200373] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Alternative splicing leverages genomic content by allowing the synthesis of multiple transcripts and, by implication, protein isoforms, from a single gene. However, estimating the abundance of transcripts produced in a given tissue from short sequencing reads is difficult and can result in both the construction of transcripts that do not exist, and the failure to identify true transcripts. An alternative approach is to catalog the events that make up isoforms (splice junctions and exons). We present here the Event Analysis (EA) approach, where we project transcripts onto the genome and identify overlapping/unique regions and junctions. In addition, all possible logical junctions are assembled into a catalog. Transcripts are filtered before quantitation based on simple measures: the proportion of the events detected, and the coverage. We find that mapping to a junction catalog is more efficient at detecting novel junctions than mapping in a splice aware manner. We identify 99.8% of true transcripts while iReckon identifies 82% of the true transcripts and creates more transcripts not included in the simulation than were initially used in the simulation. Using PacBio Iso-seq data from a mouse neural progenitor cell model, EA detects 60% of the novel junctions that are combinations of existing exons while only 43% are detected by STAR. EA further detects ∼5,000 annotated junctions missed by STAR. Filtering transcripts based on the proportion of the transcript detected and the number of reads on average supporting that transcript captures 95% of the PacBio transcriptome. Filtering the reference transcriptome before quantitation, results in is a more stable estimate of isoform abundance, with improved correlation between replicates. This was particularly evident when EA is applied to an RNA-seq study of type 1 diabetes (T1D), where the coefficient of variation among subjects (n = 81) in the transcript abundance estimates was substantially reduced compared to the estimation using the full reference. EA focuses on individual transcriptional events. These events can be quantitate and analyzed directly or used to identify the probable set of expressed transcripts. Simple rules based on detected events and coverage used in filtering result in a dramatic improvement in isoform estimation without the use of ancillary data (e.g., ChIP, long reads) that may not be available for many studies.
Collapse
|
36
|
Calixto CPG, Guo W, James AB, Tzioutziou NA, Entizne JC, Panter PE, Knight H, Nimmo HG, Zhang R, Brown JWS. Rapid and Dynamic Alternative Splicing Impacts the Arabidopsis Cold Response Transcriptome. THE PLANT CELL 2018; 30:1424-1444. [PMID: 29764987 PMCID: PMC6096597 DOI: 10.1105/tpc.18.00177] [Citation(s) in RCA: 201] [Impact Index Per Article: 28.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/26/2018] [Revised: 04/20/2018] [Accepted: 05/10/2018] [Indexed: 05/18/2023]
Abstract
Plants have adapted to tolerate and survive constantly changing environmental conditions by reprogramming gene expression The dynamics of the contribution of alternative splicing (AS) to stress responses are unknown. RNA-sequencing of a time-series of Arabidopsis thaliana plants exposed to cold determines the timing of significant AS changes. This shows a massive and rapid AS response with coincident waves of transcriptional and AS activity occurring in the first few hours of temperature reduction and further AS throughout the cold. In particular, hundreds of genes showed changes in expression due to rapidly occurring AS in response to cold ("early AS" genes); these included numerous novel cold-responsive transcription factors and splicing factors/RNA binding proteins regulated only by AS. The speed and sensitivity to small temperature changes of AS of some of these genes suggest that fine-tuning expression via AS pathways contributes to the thermo-plasticity of expression. Four early AS splicing regulatory genes have been shown previously to be required for freezing tolerance and acclimation; we provide evidence of a fifth gene, U2B"-LIKE Such factors likely drive cascades of AS of downstream genes that, alongside transcription, modulate transcriptome reprogramming that together govern the physiological and survival responses of plants to low temperature.
Collapse
Affiliation(s)
- Cristiane P G Calixto
- Plant Sciences Division, School of Life Sciences, University of Dundee, Dundee DD2 5DA, United Kingdom
| | - Wenbin Guo
- Plant Sciences Division, School of Life Sciences, University of Dundee, Dundee DD2 5DA, United Kingdom
- Information and Computational Sciences, The James Hutton Institute, Dundee DD2 5DA, United Kingdom
| | - Allan B James
- Institute of Molecular, Cell, and Systems Biology, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow G12 8QQ, United Kingdom
| | - Nikoleta A Tzioutziou
- Plant Sciences Division, School of Life Sciences, University of Dundee, Dundee DD2 5DA, United Kingdom
| | - Juan Carlos Entizne
- Plant Sciences Division, School of Life Sciences, University of Dundee, Dundee DD2 5DA, United Kingdom
- Cell and Molecular Sciences, The James Hutton Institute, Dundee DD2 5DA, United Kingdom
| | - Paige E Panter
- Department of Biosciences, Durham University, Durham DH1 3LE, United Kingdom
| | - Heather Knight
- Department of Biosciences, Durham University, Durham DH1 3LE, United Kingdom
| | - Hugh G Nimmo
- Institute of Molecular, Cell, and Systems Biology, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow G12 8QQ, United Kingdom
| | - Runxuan Zhang
- Information and Computational Sciences, The James Hutton Institute, Dundee DD2 5DA, United Kingdom
| | - John W S Brown
- Plant Sciences Division, School of Life Sciences, University of Dundee, Dundee DD2 5DA, United Kingdom
- Cell and Molecular Sciences, The James Hutton Institute, Dundee DD2 5DA, United Kingdom
| |
Collapse
|
37
|
Abstract
Single-cell RNA sequencing (scRNA-seq) is currently transforming our understanding of biology, as it is a powerful tool to resolve cellular heterogeneity and molecular networks. Over 50 protocols have been developed in recent years and also data processing and analyzes tools are evolving fast. Here, we review the basic principles underlying the different experimental protocols and how to benchmark them. We also review and compare the essential methods to process scRNA-seq data from mapping, filtering, normalization and batch corrections to basic differential expression analysis. We hope that this helps to choose appropriate experimental and computational methods for the research question at hand.
Collapse
Affiliation(s)
- Christoph Ziegenhain
- Anthropology and Human Genomics, Department of Biology II, Ludwig-Maximilians University, Großhaderner Str. 2, Martinsried, Germany
| | - Beate Vieth
- Anthropology and Human Genomics, Department of Biology II, Ludwig-Maximilians University, Großhaderner Str. 2, Martinsried, Germany
| | - Swati Parekh
- Anthropology and Human Genomics, Department of Biology II, Ludwig-Maximilians University, Großhaderner Str. 2, Martinsried, Germany
| | - Ines Hellmann
- Anthropology and Human Genomics, Department of Biology II, Ludwig-Maximilians University, Großhaderner Str. 2, Martinsried, Germany
| | - Wolfgang Enard
- Anthropology and Human Genomics, Department of Biology II, Ludwig-Maximilians University, Großhaderner Str. 2, Martinsried, Germany
| |
Collapse
|
38
|
Calixto CPG, Guo W, James AB, Tzioutziou NA, Entizne JC, Panter PE, Knight H, Nimmo HG, Zhang R, Brown JWS. Rapid and Dynamic Alternative Splicing Impacts the Arabidopsis Cold Response Transcriptome. THE PLANT CELL 2018; 30:1424-1444. [PMID: 29764987 DOI: 10.1101/251876] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/26/2018] [Revised: 04/20/2018] [Accepted: 05/10/2018] [Indexed: 05/20/2023]
Abstract
Plants have adapted to tolerate and survive constantly changing environmental conditions by reprogramming gene expression The dynamics of the contribution of alternative splicing (AS) to stress responses are unknown. RNA-sequencing of a time-series of Arabidopsis thaliana plants exposed to cold determines the timing of significant AS changes. This shows a massive and rapid AS response with coincident waves of transcriptional and AS activity occurring in the first few hours of temperature reduction and further AS throughout the cold. In particular, hundreds of genes showed changes in expression due to rapidly occurring AS in response to cold ("early AS" genes); these included numerous novel cold-responsive transcription factors and splicing factors/RNA binding proteins regulated only by AS. The speed and sensitivity to small temperature changes of AS of some of these genes suggest that fine-tuning expression via AS pathways contributes to the thermo-plasticity of expression. Four early AS splicing regulatory genes have been shown previously to be required for freezing tolerance and acclimation; we provide evidence of a fifth gene, U2B"-LIKE Such factors likely drive cascades of AS of downstream genes that, alongside transcription, modulate transcriptome reprogramming that together govern the physiological and survival responses of plants to low temperature.
Collapse
Affiliation(s)
- Cristiane P G Calixto
- Plant Sciences Division, School of Life Sciences, University of Dundee, Dundee DD2 5DA, United Kingdom
| | - Wenbin Guo
- Plant Sciences Division, School of Life Sciences, University of Dundee, Dundee DD2 5DA, United Kingdom
- Information and Computational Sciences, The James Hutton Institute, Dundee DD2 5DA, United Kingdom
| | - Allan B James
- Institute of Molecular, Cell, and Systems Biology, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow G12 8QQ, United Kingdom
| | - Nikoleta A Tzioutziou
- Plant Sciences Division, School of Life Sciences, University of Dundee, Dundee DD2 5DA, United Kingdom
| | - Juan Carlos Entizne
- Plant Sciences Division, School of Life Sciences, University of Dundee, Dundee DD2 5DA, United Kingdom
- Cell and Molecular Sciences, The James Hutton Institute, Dundee DD2 5DA, United Kingdom
| | - Paige E Panter
- Department of Biosciences, Durham University, Durham DH1 3LE, United Kingdom
| | - Heather Knight
- Department of Biosciences, Durham University, Durham DH1 3LE, United Kingdom
| | - Hugh G Nimmo
- Institute of Molecular, Cell, and Systems Biology, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow G12 8QQ, United Kingdom
| | - Runxuan Zhang
- Information and Computational Sciences, The James Hutton Institute, Dundee DD2 5DA, United Kingdom
| | - John W S Brown
- Plant Sciences Division, School of Life Sciences, University of Dundee, Dundee DD2 5DA, United Kingdom
- Cell and Molecular Sciences, The James Hutton Institute, Dundee DD2 5DA, United Kingdom
| |
Collapse
|
39
|
Schou MF, Bechsgaard J, Muñoz J, Kristensen TN. Genome-wide regulatory deterioration impedes adaptive responses to stress in inbred populations of Drosophila melanogaster. Evolution 2018; 72:1614-1628. [PMID: 29738620 DOI: 10.1111/evo.13497] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2017] [Revised: 04/21/2018] [Accepted: 05/01/2018] [Indexed: 02/28/2024]
Abstract
Inbreeding depression is often intensified under environmental stress (i.e., inbreeding-stress interaction). Although the fitness consequences of this phenomenon are well-described, underlying mechanisms such as an increased expression of deleterious alleles under stress, or a lower capacity for adaptive responses to stress with inbreeding, have rarely been investigated. We investigated a fitness component (egg-to-adult viability) and gene-expression patterns using RNA-seq analyses in noninbred control lines and in inbred lines of Drosophila melanogaster exposed to benign temperature or heat stress. We find little support for an increase in the cumulative expression of deleterious alleles under stress. Instead, inbred individuals had a reduced ability to induce an adaptive gene regulatory stress response compared to controls. The decrease in egg-to-adult viability due to stress was most pronounced in the lines with the largest deviation in the adaptive stress response (R2 = 0.48). Thus, we find strong evidence for a lower capacity of inbred individuals to respond by gene regulation to stress and that this is the main driver of inbreeding-stress interactions. In comparison, the altered gene expression due to inbreeding at benign temperature showed no correlation with fitness and was pronounced in genomic regions experiencing the highest increase in homozygosity.
Collapse
Affiliation(s)
- Mads F Schou
- Department of Bioscience, Aarhus University, DK-8000 Aarhus C, Denmark
| | - Jesper Bechsgaard
- Department of Bioscience, Aarhus University, DK-8000 Aarhus C, Denmark
| | - Joaquin Muñoz
- Department of Chemistry and Bioscience, Aalborg University, DK-9220 Aalborg East, Denmark
| | - Torsten N Kristensen
- Department of Bioscience, Aarhus University, DK-8000 Aarhus C, Denmark
- Department of Chemistry and Bioscience, Aalborg University, DK-9220 Aalborg East, Denmark
| |
Collapse
|
40
|
Yi L, Pimentel H, Bray NL, Pachter L. Gene-level differential analysis at transcript-level resolution. Genome Biol 2018; 19:53. [PMID: 29650040 PMCID: PMC5896116 DOI: 10.1186/s13059-018-1419-z] [Citation(s) in RCA: 80] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2017] [Accepted: 03/08/2018] [Indexed: 11/23/2022] Open
Abstract
Compared to RNA-sequencing transcript differential analysis, gene-level differential expression analysis is more robust and experimentally actionable. However, the use of gene counts for statistical analysis can mask transcript-level dynamics. We demonstrate that ‘analysis first, aggregation second,’ where the p values derived from transcript analysis are aggregated to obtain gene-level results, increase sensitivity and accuracy. The method we propose can also be applied to transcript compatibility counts obtained from pseudoalignment of reads, which circumvents the need for quantification and is fast, accurate, and model-free. The method generalizes to various levels of biology and we showcase an application to gene ontologies.
Collapse
Affiliation(s)
- Lynn Yi
- UCLA-Caltech Medical Science Training Program, Los Angeles, CA, USA.,Division of Biology and Biological Engineering, Caltech, Pasadena, CA, USA
| | - Harold Pimentel
- Department of Genetics, Stanford University, Palo Alto, CA, USA
| | | | - Lior Pachter
- Division of Biology and Biological Engineering, Caltech, Pasadena, CA, USA. .,Department of Computing and Mathematical Sciences, Caltech, Pasadena, CA, USA.
| |
Collapse
|
41
|
Bessière C, Taha M, Petitprez F, Vandel J, Marin JM, Bréhélin L, Lèbre S, Lecellier CH. Probing instructions for expression regulation in gene nucleotide compositions. PLoS Comput Biol 2018; 14:e1005921. [PMID: 29293496 PMCID: PMC5766238 DOI: 10.1371/journal.pcbi.1005921] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2017] [Revised: 01/12/2018] [Accepted: 12/10/2017] [Indexed: 01/22/2023] Open
Abstract
Gene expression is orchestrated by distinct regulatory regions to ensure a wide variety of cell types and functions. A challenge is to identify which regulatory regions are active, what are their associated features and how they work together in each cell type. Several approaches have tackled this problem by modeling gene expression based on epigenetic marks, with the ultimate goal of identifying driving regions and associated genomic variations that are clinically relevant in particular in precision medicine. However, these models rely on experimental data, which are limited to specific samples (even often to cell lines) and cannot be generated for all regulators and all patients. In addition, we show here that, although these approaches are accurate in predicting gene expression, inference of TF combinations from this type of models is not straightforward. Furthermore these methods are not designed to capture regulation instructions present at the sequence level, before the binding of regulators or the opening of the chromatin. Here, we probe sequence-level instructions for gene expression and develop a method to explain mRNA levels based solely on nucleotide features. Our method positions nucleotide composition as a critical component of gene expression. Moreover, our approach, able to rank regulatory regions according to their contribution, unveils a strong influence of the gene body sequence, in particular introns. We further provide evidence that the contribution of nucleotide content can be linked to co-regulations associated with genome 3D architecture and to associations of genes within topologically associated domains.
Collapse
Affiliation(s)
- Chloé Bessière
- IBC, Univ. Montpellier, CNRS, Montpellier, France
- Institut de Génétique Moléculaire de Montpellier, University of Montpellier, CNRS, Montpellier, France
| | - May Taha
- IBC, Univ. Montpellier, CNRS, Montpellier, France
- Institut de Génétique Moléculaire de Montpellier, University of Montpellier, CNRS, Montpellier, France
- IMAG, Univ. Montpellier, CNRS, Montpellier, France
| | - Florent Petitprez
- IBC, Univ. Montpellier, CNRS, Montpellier, France
- Institut de Génétique Moléculaire de Montpellier, University of Montpellier, CNRS, Montpellier, France
| | - Jimmy Vandel
- IBC, Univ. Montpellier, CNRS, Montpellier, France
- LIRMM, Univ. Montpellier, CNRS, Montpellier, France
| | - Jean-Michel Marin
- IBC, Univ. Montpellier, CNRS, Montpellier, France
- IMAG, Univ. Montpellier, CNRS, Montpellier, France
| | - Laurent Bréhélin
- IBC, Univ. Montpellier, CNRS, Montpellier, France
- LIRMM, Univ. Montpellier, CNRS, Montpellier, France
| | - Sophie Lèbre
- IBC, Univ. Montpellier, CNRS, Montpellier, France
- IMAG, Univ. Montpellier, CNRS, Montpellier, France
- Univ. Paul-Valéry-Montpellier 3, Montpellier, France
| | - Charles-Henri Lecellier
- IBC, Univ. Montpellier, CNRS, Montpellier, France
- Institut de Génétique Moléculaire de Montpellier, University of Montpellier, CNRS, Montpellier, France
| |
Collapse
|
42
|
Accurate assembly of transcripts through phase-preserving graph decomposition. Nat Biotechnol 2017; 35:1167-1169. [PMID: 29131147 PMCID: PMC5722698 DOI: 10.1038/nbt.4020] [Citation(s) in RCA: 136] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2017] [Accepted: 10/20/2017] [Indexed: 01/06/2023]
Abstract
We introduce Scallop, an accurate reference-based transcript assembler that improves reconstruction of multi-exon and lowly expressed transcripts. Scallop preserves long-range phasing paths extracted from reads, while producing a parsimonious set of transcripts and minimizing coverage deviation. On 10 human RNA-seq samples, Scallop produces 34.5% and 36.3% more correct multi-exon transcripts than StringTie and TransComb, and respectively identifies 67.5% and 52.3% more lowly expressed transcripts. Scallop achieves higher sensitivity and precision than previous approaches over a wide range of coverage thresholds.
Collapse
|
43
|
Zhang R, Calixto CPG, Marquez Y, Venhuizen P, Tzioutziou NA, Guo W, Spensley M, Entizne JC, Lewandowska D, Ten Have S, Frei Dit Frey N, Hirt H, James AB, Nimmo HG, Barta A, Kalyna M, Brown JWS. A high quality Arabidopsis transcriptome for accurate transcript-level analysis of alternative splicing. Nucleic Acids Res 2017; 45:5061-5073. [PMID: 28402429 PMCID: PMC5435985 DOI: 10.1093/nar/gkx267] [Citation(s) in RCA: 183] [Impact Index Per Article: 22.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2016] [Accepted: 04/04/2017] [Indexed: 12/30/2022] Open
Abstract
Alternative splicing generates multiple transcript and protein isoforms from the same gene and thus is important in gene expression regulation. To date, RNA-sequencing (RNA-seq) is the standard method for quantifying changes in alternative splicing on a genome-wide scale. Understanding the current limitations of RNA-seq is crucial for reliable analysis and the lack of high quality, comprehensive transcriptomes for most species, including model organisms such as Arabidopsis, is a major constraint in accurate quantification of transcript isoforms. To address this, we designed a novel pipeline with stringent filters and assembled a comprehensive Reference Transcript Dataset for Arabidopsis (AtRTD2) containing 82,190 non-redundant transcripts from 34 212 genes. Extensive experimental validation showed that AtRTD2 and its modified version, AtRTD2-QUASI, for use in Quantification of Alternatively Spliced Isoforms, outperform other available transcriptomes in RNA-seq analysis. This strategy can be implemented in other species to build a pipeline for transcript-level expression and alternative splicing analyses.
Collapse
Affiliation(s)
- Runxuan Zhang
- Informatics and Computational Sciences, The James Hutton Institute, Invergowrie, Dundee DD2 5DA, UK
| | - Cristiane P G Calixto
- Plant Sciences Division, College of Life Sciences, University of Dundee, Invergowrie, Dundee DD2 5DA, UK
| | - Yamile Marquez
- Max F. Perutz Laboratories, Medical University of Vienna, Dr. Bohrgasse 9/3, 1030 Vienna, Austria
| | - Peter Venhuizen
- Max F. Perutz Laboratories, Medical University of Vienna, Dr. Bohrgasse 9/3, 1030 Vienna, Austria
| | - Nikoleta A Tzioutziou
- Plant Sciences Division, College of Life Sciences, University of Dundee, Invergowrie, Dundee DD2 5DA, UK
| | - Wenbin Guo
- Informatics and Computational Sciences, The James Hutton Institute, Invergowrie, Dundee DD2 5DA, UK.,Plant Sciences Division, College of Life Sciences, University of Dundee, Invergowrie, Dundee DD2 5DA, UK
| | - Mark Spensley
- The Donnelly Centre, University of Toronto, 160 College Street, Toronto, Ontario, Canada
| | - Juan Carlos Entizne
- Plant Sciences Division, College of Life Sciences, University of Dundee, Invergowrie, Dundee DD2 5DA, UK
| | - Dominika Lewandowska
- Cell and Molecular Sciences, The James Hutton Institute, Invergowrie, Dundee DD2 5DA, UK
| | - Sara Ten Have
- Centre for Gene Regulation and Expression, School of Life Sciences, University of Dundee, Dundee, UK
| | | | - Heribert Hirt
- Institute of Plant Sciences Paris Saclay, INRA-CNRS-UEVE, Orsay 91405, France
| | - Allan B James
- Institute of Molecular, Cell and Systems Biology, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow G12 8QQ, UK
| | - Hugh G Nimmo
- Institute of Molecular, Cell and Systems Biology, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow G12 8QQ, UK
| | - Andrea Barta
- Max F. Perutz Laboratories, Medical University of Vienna, Dr. Bohrgasse 9/3, 1030 Vienna, Austria
| | - Maria Kalyna
- Department of Applied Genetics and Cell Biology, University of Natural Resources and Life Sciences - BOKU, Muthgasse 18, 1190 Vienna, Austria
| | - John W S Brown
- Plant Sciences Division, College of Life Sciences, University of Dundee, Invergowrie, Dundee DD2 5DA, UK.,Cell and Molecular Sciences, The James Hutton Institute, Invergowrie, Dundee DD2 5DA, UK
| |
Collapse
|
44
|
Wesolowski S, Vera D, Wu W. SRSF shape analysis for sequencing data reveal new differentiating patterns. Comput Biol Chem 2017; 70:56-64. [PMID: 28803038 DOI: 10.1016/j.compbiolchem.2017.07.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2017] [Revised: 05/14/2017] [Accepted: 07/20/2017] [Indexed: 10/19/2022]
Abstract
MOTIVATION Sequencing-based methods to examine fundamental features of the genome, such as gene expression and chromatin structure, rely on inferences from the abundance and distribution of reads derived from Illumina sequencing. Drawing sound inferences from such experiments relies on appropriate mathematical methods to model the distribution of reads along the genome, which has been challenging due to the scale and nature of these data. RESULTS We propose a new framework (SRSFseq) based on square root slope functions shape analysis to analyse Illumina sequencing data. In the new approach the basic unit of information is the density of mapped reads over region of interest located on the known reference genome. The densities are interpreted as shapes and a new shape analysis model is proposed. An equivalent of a Fisher test is used to quantify the significance of shape differences in read distribution patterns between groups of density functions in different experimental conditions. We evaluated the performance of this new framework to analyze RNA-seq data at the exon level, which enabled the detection of variation in read distributions and abundances between experimental conditions not detected by other methods. Thus, the method is a suitable supplement to the state-of-the-art count based techniques. The variety of density representations and flexibility of mathematical design allow the model to be easily adapted to other data types or problems in which the distribution of reads is to be tested. The functional interpretation and SRSF phase-amplitude separation technique give an efficient noise reduction procedure improving the sensitivity and specificity of the method.
Collapse
Affiliation(s)
| | - Daniel Vera
- Center of Genomics and Personalized Medicine, Florida State University, United States
| | - Wei Wu
- Department of Statistics, Florida State University, United States
| |
Collapse
|
45
|
Sahraeian SME, Mohiyuddin M, Sebra R, Tilgner H, Afshar PT, Au KF, Bani Asadi N, Gerstein MB, Wong WH, Snyder MP, Schadt E, Lam HYK. Gaining comprehensive biological insight into the transcriptome by performing a broad-spectrum RNA-seq analysis. Nat Commun 2017; 8:59. [PMID: 28680106 PMCID: PMC5498581 DOI: 10.1038/s41467-017-00050-4] [Citation(s) in RCA: 209] [Impact Index Per Article: 26.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2016] [Accepted: 05/02/2017] [Indexed: 12/30/2022] Open
Abstract
RNA-sequencing (RNA-seq) is an essential technique for transcriptome studies, hundreds of analysis tools have been developed since it was debuted. Although recent efforts have attempted to assess the latest available tools, they have not evaluated the analysis workflows comprehensively to unleash the power within RNA-seq. Here we conduct an extensive study analysing a broad spectrum of RNA-seq workflows. Surpassing the expression analysis scope, our work also includes assessment of RNA variant-calling, RNA editing and RNA fusion detection techniques. Specifically, we examine both short- and long-read RNA-seq technologies, 39 analysis tools resulting in ~120 combinations, and ~490 analyses involving 15 samples with a variety of germline, cancer and stem cell data sets. We report the performance and propose a comprehensive RNA-seq analysis protocol, named RNACocktail, along with a computational pipeline achieving high accuracy. Validation on different samples reveals that our proposed protocol could help researchers extract more biologically relevant predictions by broad analysis of the transcriptome. RNA-seq is widely used for transcriptome analysis. Here, the authors analyse a wide spectrum of RNA-seq workflows and present a comprehensive analysis protocol named RNACocktail as well as a computational pipeline leveraging the widely used tools for accurate RNA-seq analysis.
Collapse
Affiliation(s)
| | | | - Robert Sebra
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
| | - Hagen Tilgner
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, 94305, USA
| | - Pegah T Afshar
- Department of Electrical Engineering, Stanford University, Stanford, CA, 94305, USA
| | - Kin Fai Au
- Department of Internal Medicine, University of Iowa, Iowa City, IA, 52242, USA
| | | | - Mark B Gerstein
- Computational Biology and Bioinformatics, Yale University, New Haven, CT, 06520, USA
| | - Wing Hung Wong
- Statistics; Health Research and Policy, Stanford University, Stanford, CA, 94305, USA
| | - Michael P Snyder
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, 94305, USA
| | - Eric Schadt
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
| | - Hugo Y K Lam
- Roche Sequencing Solutions, Belmont, CA, 94002, USA.
| |
Collapse
|
46
|
Nazarov PV, Muller A, Kaoma T, Nicot N, Maximo C, Birembaut P, Tran NL, Dittmar G, Vallar L. RNA sequencing and transcriptome arrays analyses show opposing results for alternative splicing in patient derived samples. BMC Genomics 2017; 18:443. [PMID: 28587590 PMCID: PMC5461714 DOI: 10.1186/s12864-017-3819-y] [Citation(s) in RCA: 57] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2017] [Accepted: 05/25/2017] [Indexed: 01/29/2023] Open
Abstract
Background RNA sequencing (RNA-seq) and microarrays are two transcriptomics techniques aimed at the quantification of transcribed genes and their isoforms. Here we compare the latest Affymetrix HTA 2.0 microarray with Illumina 2000 RNA-seq for the analysis of patient samples - normal lung epithelium tissue and squamous cell carcinoma lung tumours. Protein coding mRNAs and long non-coding RNAs (lncRNAs) were included in the study. Results Both platforms performed equally well for protein-coding RNAs, however the stochastic variability was higher for the sequencing data than for microarrays. This reduced the number of differentially expressed genes and genes with predictive potential for RNA-seq compared to microarray data. Analysis of this variability revealed a lack of reads for short and low abundant genes; lncRNAs, being shorter and less abundant RNAs, were found especially susceptible to this issue. A major difference between the two platforms was uncovered by analysis of alternatively spliced genes. Investigation of differential exon abundance showed insufficient reads for many exons and exon junctions in RNA-seq while the detection on the array platform was more stable. Nevertheless, we identified 207 genes which undergo alternative splicing and were consistently detected by both techniques. Conclusions Despite the fact that the results of gene expression analysis were highly consistent between Human Transcriptome Arrays and RNA-seq platforms, the analysis of alternative splicing produced discordant results. We concluded that modern microarrays can still outperform sequencing for standard analysis of gene expression in terms of reproducibility and cost. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-3819-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Petr V Nazarov
- Proteome and Genome Research Unit, Department of Oncology, Luxembourg Institute of Health, Luxembourg, Luxembourg.
| | - Arnaud Muller
- Proteome and Genome Research Unit, Department of Oncology, Luxembourg Institute of Health, Luxembourg, Luxembourg
| | - Tony Kaoma
- Proteome and Genome Research Unit, Department of Oncology, Luxembourg Institute of Health, Luxembourg, Luxembourg
| | - Nathalie Nicot
- Proteome and Genome Research Unit, Department of Oncology, Luxembourg Institute of Health, Luxembourg, Luxembourg
| | - Cristina Maximo
- Proteome and Genome Research Unit, Department of Oncology, Luxembourg Institute of Health, Luxembourg, Luxembourg
| | | | - Nhan L Tran
- Departments of Cancer Biology and Neurosurgery, Mayo Clinic Arizona, Phoenix, USA
| | - Gunnar Dittmar
- Proteome and Genome Research Unit, Department of Oncology, Luxembourg Institute of Health, Luxembourg, Luxembourg
| | - Laurent Vallar
- Proteome and Genome Research Unit, Department of Oncology, Luxembourg Institute of Health, Luxembourg, Luxembourg
| |
Collapse
|
47
|
Kaisers W, Ptok J, Schwender H, Schaal H. Validation of Splicing Events in Transcriptome Sequencing Data. Int J Mol Sci 2017; 18:ijms18061110. [PMID: 28545234 PMCID: PMC5485934 DOI: 10.3390/ijms18061110] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2017] [Revised: 04/26/2017] [Accepted: 04/28/2017] [Indexed: 11/16/2022] Open
Abstract
Genomic alignments of sequenced cellular messenger RNA contain gapped alignments which are interpreted as consequence of intron removal. The resulting gap-sites, genomic locations of alignment gaps, are landmarks representing potential splice-sites. As alignment algorithms report gap-sites with a considerable false discovery rate, validations are required. We describe two quality scores, gap quality score (gqs) and weighted gap information score (wgis), developed for validation of putative splicing events: While gqs solely relies on alignment data wgis additionally considers information from the genomic sequence. FASTQ files obtained from 54 human dermal fibroblast samples were aligned against the human genome (GRCh38) using TopHat and STAR aligner. Statistical properties of gap-sites validated by gqs and wgis were evaluated by their sequence similarity to known exon-intron borders. Within the 54 samples, TopHat identifies 1,000,380 and STAR reports 6,487,577 gap-sites. Due to the lack of strand information, however, the percentage of identified GT-AG gap-sites is rather low. While gap-sites from TopHat contain ≈89% GT-AG, gap-sites from STAR only contain ≈42% GT-AG dinucleotide pairs in merged data from 54 fibroblast samples. Validation with gqs yields 156,251 gap-sites from TopHat alignments and 166,294 from STAR alignments. Validation with wgis yields 770,327 gap-sites from TopHat alignments and 1,065,596 from STAR alignments. Both alignment algorithms, TopHat and STAR, report gap-sites with considerable false discovery rate, which can drastically be reduced by validation with gqs and wgis.
Collapse
Affiliation(s)
- Wolfgang Kaisers
- Department for Anaesthesiology, University Hospital Düsseldorf, Heinrich Heine University, 40225 Düsseldorf, Germany.
- BMFZ (Biologisch-Medizinisches Forschungszentrum), Heinrich Heine University, 40225 Düsseldorf, Germany.
| | - Johannes Ptok
- Institute of Virology, University Hospital Düsseldorf, Heinrich Heine University, 40225 Düsseldorf, Germany.
| | - Holger Schwender
- BMFZ (Biologisch-Medizinisches Forschungszentrum), Heinrich Heine University, 40225 Düsseldorf, Germany.
- Mathematical Institute, Heinrich Heine University, 40225 Düsseldorf, Germany.
| | - Heiner Schaal
- BMFZ (Biologisch-Medizinisches Forschungszentrum), Heinrich Heine University, 40225 Düsseldorf, Germany.
- Institute of Virology, University Hospital Düsseldorf, Heinrich Heine University, 40225 Düsseldorf, Germany.
| |
Collapse
|
48
|
Majoros WH, Campbell MS, Holt C, DeNardo EK, Ware D, Allen AS, Yandell M, Reddy TE. High-throughput interpretation of gene structure changes in human and nonhuman resequencing data, using ACE. Bioinformatics 2017; 33:1437-1446. [PMID: 28011790 DOI: 10.1093/bioinformatics/btw799] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2016] [Accepted: 12/13/2016] [Indexed: 11/12/2022] Open
Abstract
Motivation The accurate interpretation of genetic variants is critical for characterizing genotype-phenotype associations. Because the effects of genetic variants can depend strongly on their local genomic context, accurate genome annotations are essential. Furthermore, as some variants have the potential to disrupt or alter gene structure, variant interpretation efforts stand to gain from the use of individualized annotations that account for differences in gene structure between individuals or strains. Results We describe a suite of software tools for identifying possible functional changes in gene structure that may result from sequence variants. ACE ('Assessing Changes to Exons') converts phased genotype calls to a collection of explicit haplotype sequences, maps transcript annotations onto them, detects gene-structure changes and their possible repercussions, and identifies several classes of possible loss of function. Novel transcripts predicted by ACE are commonly supported by spliced RNA-seq reads, and can be used to improve read alignment and transcript quantification when an individual-specific genome sequence is available. Using publicly available RNA-seq data, we show that ACE predictions confirm earlier results regarding the quantitative effects of nonsense-mediated decay, and we show that predicted loss-of-function events are highly concordant with patterns of intolerance to mutations across the human population. ACE can be readily applied to diverse species including animals and plants, making it a broadly useful tool for use in eukaryotic population-based resequencing projects, particularly for assessing the joint impact of all variants at a locus. Availability and Implementation ACE is written in open-source C ++ and Perl and is available from geneprediction.org/ACE. Contact myandell@genetics.utah.edu or tim.reddy@duke.edu. Supplementary information Supplementary information is available at Bioinformatics online.
Collapse
Affiliation(s)
- William H Majoros
- Program in Computational Biology and Bioinformatics, Duke University, Durham, NC, USA.,Center for Genomic and Computational Biology, Duke University Medical School, Durham, NC, USA
| | | | - Carson Holt
- Department of Human Genetics, Eccles Institute of Human Genetics, University of Utah and School of Medicine, Salt Lake City, UT, USA
| | - Erin K DeNardo
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Doreen Ware
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA.,USDA ARS NEA Robert W. Holley Center for Agriculture and Health, Cornell University, Ithaca, NY, USA
| | - Andrew S Allen
- Program in Computational Biology and Bioinformatics, Duke University, Durham, NC, USA.,Department of Biostatistics and Bioinformatics, Duke University Medical School, Durham, NC, USA
| | - Mark Yandell
- Department of Human Genetics, Eccles Institute of Human Genetics, University of Utah and School of Medicine, Salt Lake City, UT, USA.,USTAR Center for Genetic Discovery, University of Utah, Salt Lake City, UT, USA
| | - Timothy E Reddy
- Program in Computational Biology and Bioinformatics, Duke University, Durham, NC, USA.,Center for Genomic and Computational Biology, Duke University Medical School, Durham, NC, USA.,Department of Biostatistics and Bioinformatics, Duke University Medical School, Durham, NC, USA
| |
Collapse
|
49
|
Abstract
The pervasive expression of circular RNAs (circRNAs) is a recently discovered feature of gene expression in highly diverged eukaryotes. Numerous algorithms that are used to detect genome-wide circRNA expression from RNA sequencing (RNA-seq) data have been developed in the past few years, but there is little overlap in their predictions and no clear gold-standard method to assess the accuracy of these algorithms. We review sources of experimental and bioinformatic biases that complicate the accurate discovery of circRNAs and discuss statistical approaches to address these biases. We conclude with a discussion of the current experimental progress on the topic.
Collapse
|
50
|
Tress ML, Abascal F, Valencia A. Most Alternative Isoforms Are Not Functionally Important. Trends Biochem Sci 2017; 42:408-410. [PMID: 28483377 DOI: 10.1016/j.tibs.2017.04.002] [Citation(s) in RCA: 54] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2017] [Accepted: 04/03/2017] [Indexed: 12/11/2022]
Affiliation(s)
- Michael L Tress
- Department of Structural and Computational Biology, Spanish National Cancer Research Centre (CNIO), 28029 Madrid, Spain.
| | | | - Alfonso Valencia
- Current address: Life Sciences Department, Barcelona Supercomputing Center (BSC-CNS), 08034 Barcelona, Spain
| |
Collapse
|