1
|
Li Y, Huang J, Li LF, Guo P, Wang Y, Cushman SA, Shang FD. Roles and regulatory patterns of protein isoforms in plant adaptation and development. THE NEW PHYTOLOGIST 2025; 245:1887-1896. [PMID: 39645578 DOI: 10.1111/nph.20327] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/09/2024] [Accepted: 11/20/2024] [Indexed: 12/09/2024]
Abstract
Protein isoforms (PIs) play pivotal roles in regulating plant growth and development that confer adaptability to diverse environmental conditions. PIs are widely present in plants and generated through alternative splicing (AS), alternative polyadenylation (APA), alternative initiation (AI), and ribosomal frameshifting (RF) events. The widespread presence of PIs not only significantly increases the complexity of genomic information but also greatly enriches regulatory networks and enhances their flexibility. PIs may also play important roles in phenotypic diversity, ecological niche differentiation, and speciation, thereby increasing the dimensions of research in molecular ecology. However, PIs pose new challenges for the quantitative analysis, annotation, and identification of genetic regulatory mechanisms. Thus, focus on PIs make genomic and epigenomic studies both more powerful and more challenging. This review summarizes the origins, functions, regulatory patterns of isoforms, and the challenges they present for future research in molecular ecology and molecular biology.
Collapse
Affiliation(s)
- Yong Li
- College of Life Science and Technology, Inner Mongolia Normal University, Hohhot, 010020, China
- Key Laboratory of Biodiversity Conservation and Sustainable Utilization in Mongolian Plateau for College and University of Inner Mongolia Autonomous Region, Hohhot, 010022, China
| | - Jinling Huang
- Department of Biology, East Carolina University, Greenville, NC, 27858, USA
- State Key Laboratory of Crop Stress Adaptation and Improvement, School of Life Science, Henan University, Kaifeng, 475004, China
| | - Lin-Feng Li
- College of Life Science and Technology, Sun Yat-sen University, Guangzhou, 510275, China
| | - Peng Guo
- College of Life Science, Henan Agricultural University, Zhengzhou, 450002, China
- Henan Engineering Research Center for Osmanthus Germplasm Innovation and Resource Utilization, Henan Agricultural University, Zhengzhou, 450002, China
| | - Yihan Wang
- College of Life Science, Henan Agricultural University, Zhengzhou, 450002, China
- Henan Engineering Research Center for Osmanthus Germplasm Innovation and Resource Utilization, Henan Agricultural University, Zhengzhou, 450002, China
| | - Samuel A Cushman
- Department of Biology, University of Oxford, 11a Mansfield Road, Oxford, OX5QL, UK
| | - Fu-De Shang
- College of Life Science, Henan Agricultural University, Zhengzhou, 450002, China
- Henan Engineering Research Center for Osmanthus Germplasm Innovation and Resource Utilization, Henan Agricultural University, Zhengzhou, 450002, China
| |
Collapse
|
2
|
Deshpande D, Chhugani K, Chang Y, Karlsberg A, Loeffler C, Zhang J, Muszyńska A, Munteanu V, Yang H, Rotman J, Tao L, Balliu B, Tseng E, Eskin E, Zhao F, Mohammadi P, P. Łabaj P, Mangul S. RNA-seq data science: From raw data to effective interpretation. Front Genet 2023; 14:997383. [PMID: 36999049 PMCID: PMC10043755 DOI: 10.3389/fgene.2023.997383] [Citation(s) in RCA: 52] [Impact Index Per Article: 26.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2022] [Accepted: 02/24/2023] [Indexed: 03/14/2023] Open
Abstract
RNA sequencing (RNA-seq) has become an exemplary technology in modern biology and clinical science. Its immense popularity is due in large part to the continuous efforts of the bioinformatics community to develop accurate and scalable computational tools to analyze the enormous amounts of transcriptomic data that it produces. RNA-seq analysis enables genes and their corresponding transcripts to be probed for a variety of purposes, such as detecting novel exons or whole transcripts, assessing expression of genes and alternative transcripts, and studying alternative splicing structure. It can be a challenge, however, to obtain meaningful biological signals from raw RNA-seq data because of the enormous scale of the data as well as the inherent limitations of different sequencing technologies, such as amplification bias or biases of library preparation. The need to overcome these technical challenges has pushed the rapid development of novel computational tools, which have evolved and diversified in accordance with technological advancements, leading to the current myriad of RNA-seq tools. These tools, combined with the diverse computational skill sets of biomedical researchers, help to unlock the full potential of RNA-seq. The purpose of this review is to explain basic concepts in the computational analysis of RNA-seq data and define discipline-specific jargon.
Collapse
Affiliation(s)
- Dhrithi Deshpande
- Department of Pharmacology and Pharmaceutical Sciences, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, Los Angeles, CA, United States
| | - Karishma Chhugani
- Department of Pharmacology and Pharmaceutical Sciences, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, Los Angeles, CA, United States
| | - Yutong Chang
- Department of Pharmacology and Pharmaceutical Sciences, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, Los Angeles, CA, United States
| | - Aaron Karlsberg
- Department of Clinical Pharmacy, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, Los Angeles, CA, United States
| | - Caitlin Loeffler
- Department of Computer Science, University of California, Los Angeles, CA, United States
| | - Jinyang Zhang
- Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing, China
| | - Agata Muszyńska
- Małopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland
- Institute of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland
| | - Viorel Munteanu
- Department of Computers, Informatics and Microelectronics, Technical University of Moldova, Chisinau, Moldova
| | - Harry Yang
- Department of Microbiology, Immunology and Molecular Genetics, University of California Los Angeles, Los Angeles, CA, United States
| | - Jeremy Rotman
- Department of Clinical Pharmacy, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, Los Angeles, CA, United States
| | - Laura Tao
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, CHS, Los Angeles, CA, United States
| | - Brunilda Balliu
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, CHS, Los Angeles, CA, United States
| | | | - Eleazar Eskin
- Department of Computer Science, University of California, Los Angeles, CA, United States
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, CHS, Los Angeles, CA, United States
- Department of Human Genetics, David Geffen School of Medicine at UCLA, Los Angeles, CA, United States
| | - Fangqing Zhao
- Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing, China
- Key Laboratory of Systems Biology, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou, China
| | - Pejman Mohammadi
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, United States
| | - Paweł P. Łabaj
- Małopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland
- Department of Biotechnology, Boku University Vienna, Vienna, Austria
| | - Serghei Mangul
- Department of Clinical Pharmacy, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, Los Angeles, CA, United States
- Department of Quantitative and Computational Biology, USC Dornsife College of Letters, Arts and Sciences, Los Angeles, CA, United States
- *Correspondence: Serghei Mangul,
| |
Collapse
|
3
|
Fan J, Chan S, Patro R. Perplexity: evaluating transcript abundance estimation in the absence of ground truth. Algorithms Mol Biol 2022; 17:6. [PMID: 35331283 PMCID: PMC8951746 DOI: 10.1186/s13015-022-00214-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2021] [Accepted: 03/01/2022] [Indexed: 11/20/2022] Open
Abstract
Background There has been rapid development of probabilistic models and inference methods for transcript abundance estimation from RNA-seq data. These models aim to accurately estimate transcript-level abundances, to account for different biases in the measurement process, and even to assess uncertainty in resulting estimates that can be propagated to subsequent analyses. The assumed accuracy of the estimates inferred by such methods underpin gene expression based analysis routinely carried out in the lab. Although hyperparameter selection is known to affect the distributions of inferred abundances (e.g. producing smooth versus sparse estimates), strategies for performing model selection in experimental data have been addressed informally at best. Results We derive perplexity for evaluating abundance estimates on fragment sets directly. We adapt perplexity from the analogous metric used to evaluate language and topic models and extend the metric to carefully account for corner cases unique to RNA-seq. In experimental data, estimates with the best perplexity also best correlate with qPCR measurements. In simulated data, perplexity is well behaved and concordant with genome-wide measurements against ground truth and differential expression analysis. Furthermore, we demonstrate theoretically and experimentally that perplexity can be computed for arbitrary transcript abundance estimation models. Conclusions Alongside the derivation and implementation of perplexity for transcript abundance estimation, our study is the first to make possible model selection for transcript abundance estimation on experimental data in the absence of ground truth.
Collapse
|
4
|
Srivastava A, Malik L, Sarkar H, Patro R. A Bayesian framework for inter-cellular information sharing improves dscRNA-seq quantification. Bioinformatics 2020; 36:i292-i299. [PMID: 32657394 PMCID: PMC7355277 DOI: 10.1093/bioinformatics/btaa450] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Motivation Droplet-based single-cell RNA-seq (dscRNA-seq) data are being generated at an unprecedented pace, and the accurate estimation of gene-level abundances for each cell is a crucial first step in most dscRNA-seq analyses. When pre-processing the raw dscRNA-seq data to generate a count matrix, care must be taken to account for the potentially large number of multi-mapping locations per read. The sparsity of dscRNA-seq data, and the strong 3’ sampling bias, makes it difficult to disambiguate cases where there is no uniquely mapping read to any of the candidate target genes. Results We introduce a Bayesian framework for information sharing across cells within a sample, or across multiple modalities of data using the same sample, to improve gene quantification estimates for dscRNA-seq data. We use an anchor-based approach to connect cells with similar gene-expression patterns, and learn informative, empirical priors which we provide to alevin’s gene multi-mapping resolution algorithm. This improves the quantification estimates for genes with no uniquely mapping reads (i.e. when there is no unique intra-cellular information). We show our new model improves the per cell gene-level estimates and provides a principled framework for information sharing across multiple modalities. We test our method on a combination of simulated and real datasets under various setups. Availability and implementation The information sharing model is included in alevin and is implemented in C++14. It is available as open-source software, under GPL v3, at https://github.com/COMBINE-lab/salmon as of version 1.1.0.
Collapse
Affiliation(s)
- Avi Srivastava
- Department of Computer Science, Stony Brook University, Stony Brook 11794, NY, USA
| | - Laraib Malik
- Department of Computer Science, Stony Brook University, Stony Brook 11794, NY, USA
| | - Hirak Sarkar
- Computer Science Department, University of Maryland, College Park 20742, MD, USA
| | - Rob Patro
- Computer Science Department, University of Maryland, College Park 20742, MD, USA
| |
Collapse
|
5
|
Deng W, Mou T, Kalari KR, Niu N, Wang L, Pawitan Y, Vu TN. Alternating EM algorithm for a bilinear model in isoform quantification from RNA-seq data. Bioinformatics 2019; 36:805-812. [PMID: 31400221 PMCID: PMC9883676 DOI: 10.1093/bioinformatics/btz640] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2018] [Revised: 06/13/2019] [Accepted: 08/09/2019] [Indexed: 02/02/2023] Open
Abstract
MOTIVATION Estimation of isoform-level gene expression from RNA-seq data depends on simplifying assumptions, such as uniform read distribution, that are easily violated in real data. Such violations typically lead to biased estimates. Most existing methods provide bias correction step(s), which is based on biological considerations-such as GC content-and applied in single samples separately. The main problem is that not all biases are known. RESULTS We have developed a novel method called XAEM based on a more flexible and robust statistical model. Existing methods are essentially based on a linear model Xβ, where the design matrix X is known and is computed based on the simplifying assumptions. In contrast XAEM considers Xβ as a bilinear model with both X and β unknown. Joint estimation of X and β is made possible by a simultaneous analysis of multi-sample RNA-seq data. Compared to existing methods, XAEM automatically performs empirical correction of potentially unknown biases. We use an alternating expectation-maximization (AEM) algorithm, alternating between estimation of X and β. For speed XAEM utilizes quasi-mapping for read alignment, thus leading to a fast algorithm. Overall XAEM performs favorably compared to recent advanced methods. For simulated datasets, XAEM obtains higher accuracy for multiple-isoform genes. In a differential-expression analysis of a real single-cell RNA-seq dataset, XAEM achieves substantially better rediscovery rates in independent validation sets. AVAILABILITY AND IMPLEMENTATION The method and pipeline are implemented as a tool and freely available for use at http://fafner.meb.ki.se/biostatwiki/xaem/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Wenjiang Deng
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm 17177, Sweden
| | - Tian Mou
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm 17177, Sweden
| | | | - Nifang Niu
- Department of Molecular Pharmacology and Experimental Therapeutics, Mayo Clinic, Rochester, MN 55905, USA
| | - Liewei Wang
- Department of Molecular Pharmacology and Experimental Therapeutics, Mayo Clinic, Rochester, MN 55905, USA
| | | | | |
Collapse
|
6
|
Van den Berge K, Hembach KM, Soneson C, Tiberi S, Clement L, Love MI, Patro R, Robinson MD. RNA Sequencing Data: Hitchhiker's Guide to Expression Analysis. Annu Rev Biomed Data Sci 2019. [DOI: 10.1146/annurev-biodatasci-072018-021255] [Citation(s) in RCA: 71] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Gene expression is the fundamental level at which the results of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq data sets, as well as the performance of the myriad of methods developed. In this review, we give an overview of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on the quantification of gene expression and statistical approachesfor differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.
Collapse
Affiliation(s)
- Koen Van den Berge
- Bioinformatics Institute Ghent and Department of Applied Mathematics, Computer Science and Statistics, Ghent University, 9000 Ghent, Belgium
| | - Katharina M. Hembach
- Institute of Molecular Life Sciences and SIB Swiss Institute of Bioinformatics, University of Zurich, 8057 Zurich, Switzerland
| | - Charlotte Soneson
- Institute of Molecular Life Sciences and SIB Swiss Institute of Bioinformatics, University of Zurich, 8057 Zurich, Switzerland
| | - Simone Tiberi
- Institute of Molecular Life Sciences and SIB Swiss Institute of Bioinformatics, University of Zurich, 8057 Zurich, Switzerland
| | - Lieven Clement
- Bioinformatics Institute Ghent and Department of Applied Mathematics, Computer Science and Statistics, Ghent University, 9000 Ghent, Belgium
| | - Michael I. Love
- Department of Biostatistics and Department of Genetics, University of North Carolina, Chapel Hill, North Carolina 27514, USA
| | - Rob Patro
- Department of Computer Science, Stony Brook University, Stony Brook, New York 11794, USA
| | - Mark D. Robinson
- Institute of Molecular Life Sciences and SIB Swiss Institute of Bioinformatics, University of Zurich, 8057 Zurich, Switzerland
| |
Collapse
|
7
|
HLA-VBSeq v2: improved HLA calling accuracy with full-length Japanese class-I panel. Hum Genome Var 2019; 6:29. [PMID: 31240105 PMCID: PMC6584547 DOI: 10.1038/s41439-019-0061-y] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2019] [Revised: 05/26/2019] [Accepted: 05/27/2019] [Indexed: 12/22/2022] Open
Abstract
HLA-VBSeq is an HLA calling tool developed to infer the most likely HLA types from high-throughput sequencing data. However, there is still room for improvement in specific genetic groups because of the diversity of HLA alleles in human populations. Here, we present HLA-VBSeq v2, a software application that makes use of a new Japanese HLA reference panel to enhance calling accuracy for Japanese HLA class-I genes. Our analysis showed significant improvements in calling accuracy in all HLA regions, with prediction accuracies achieving over 99.0, 97.8, and 99.8% in HLA-A, B and C, respectively.
Collapse
|
8
|
Abstract
Identification of differentially expressed genes has been a high priority task of downstream analyses to further advances in biomedical research. Investigators have been faced with an array of issues in dealing with more complicated experiments and metadata, including batch effects, normalization, temporal dynamics (temporally differential expression), and isoform diversity (isoform-level quantification and differential splicing events). To date, there are currently no standard approaches to precisely and efficiently analyze these moderate or large-scale experimental designs, especially with combined metadata. In this report, we propose comprehensive analytical pipelines to precisely characterize temporal dynamics in differential expression of genes and other genomic features, i.e., the variability of transcripts, isoforms and exons, by controlling batch effects and other nuisance factors that could have significant confounding effects on the main effects of interest in comparative models and may result in misleading interpretations.
Collapse
|
9
|
Lee H, Kingsford C. Kourami: graph-guided assembly for novel human leukocyte antigen allele discovery. Genome Biol 2018; 19:16. [PMID: 29415772 PMCID: PMC5804087 DOI: 10.1186/s13059-018-1388-2] [Citation(s) in RCA: 56] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2017] [Accepted: 01/08/2018] [Indexed: 01/07/2023] Open
Abstract
Accurate typing of human leukocyte antigen (HLA) is important because HLA genes play important roles in immune responses and disease genesis. Previously available computational methods are database-matching approaches and their outputs are inherently limited by the completeness of already known types, making them unsuitable for discovery of novel alleles. We have developed a graph-guided assembly technique for classical HLA genes, which can construct allele sequences given high-coverage whole-genome sequencing data. Our method delivers highly accurate HLA typing, comparable to the current state-of-the-art methods. Using various data, we also demonstrate that our method can type novel alleles.
Collapse
Affiliation(s)
- Heewook Lee
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Carl Kingsford
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA.
| |
Collapse
|
10
|
Srivastava A, Sarkar H, Gupta N, Patro R. RapMap: a rapid, sensitive and accurate tool for mapping RNA-seq reads to transcriptomes. Bioinformatics 2017; 32:i192-i200. [PMID: 27307617 PMCID: PMC4908361 DOI: 10.1093/bioinformatics/btw277] [Citation(s) in RCA: 72] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Motivation: The alignment of sequencing reads to a transcriptome is a common and important step in many RNA-seq analysis tasks. When aligning RNA-seq reads directly to a transcriptome (as is common in the de novo setting or when a trusted reference annotation is available), care must be taken to report the potentially large number of multi-mapping locations per read. This can pose a substantial computational burden for existing aligners, and can considerably slow downstream analysis. Results: We introduce a novel concept, quasi-mapping, and an efficient algorithm implementing this approach for mapping sequencing reads to a transcriptome. By attempting only to report the potential loci of origin of a sequencing read, and not the base-to-base alignment by which it derives from the reference, RapMap—our tool implementing quasi-mapping—is capable of mapping sequencing reads to a target transcriptome substantially faster than existing alignment tools. The algorithm we use to implement quasi-mapping uses several efficient data structures and takes advantage of the special structure of shared sequence prevalent in transcriptomes to rapidly provide highly-accurate mapping information. We demonstrate how quasi-mapping can be successfully applied to the problems of transcript-level quantification from RNA-seq reads and the clustering of contigs from de novo assembled transcriptomes into biologically meaningful groups. Availability and implementation: RapMap is implemented in C ++11 and is available as open-source software, under GPL v3, at https://github.com/COMBINE-lab/RapMap. Contact:rob.patro@cs.stonybrook.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Avi Srivastava
- Department of Computer Science, Stony Brook University Stony Brook, New York, NY 11794-2424, USA
| | - Hirak Sarkar
- Department of Computer Science, Stony Brook University Stony Brook, New York, NY 11794-2424, USA
| | - Nitish Gupta
- Department of Computer Science, Stony Brook University Stony Brook, New York, NY 11794-2424, USA
| | - Rob Patro
- Department of Computer Science, Stony Brook University Stony Brook, New York, NY 11794-2424, USA
| |
Collapse
|
11
|
Zakeri M, Srivastava A, Almodaresi F, Patro R. Improved data-driven likelihood factorizations for transcript abundance estimation. Bioinformatics 2017; 33:i142-i151. [PMID: 28881996 PMCID: PMC5870700 DOI: 10.1093/bioinformatics/btx262] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
MOTIVATION Many methods for transcript-level abundance estimation reduce the computational burden associated with the iterative algorithms they use by adopting an approximate factorization of the likelihood function they optimize. This leads to considerably faster convergence of the optimization procedure, since each round of e.g. the EM algorithm, can execute much more quickly. However, these approximate factorizations of the likelihood function simplify calculations at the expense of discarding certain information that can be useful for accurate transcript abundance estimation. RESULTS We demonstrate that model simplifications (i.e. factorizations of the likelihood function) adopted by certain abundance estimation methods can lead to a diminished ability to accurately estimate the abundances of highly related transcripts. In particular, considering factorizations based on transcript-fragment compatibility alone can result in a loss of accuracy compared to the per-fragment, unsimplified model. However, we show that such shortcomings are not an inherent limitation of approximately factorizing the underlying likelihood function. By considering the appropriate conditional fragment probabilities, and adopting improved, data-driven factorizations of this likelihood, we demonstrate that such approaches can achieve accuracy nearly indistinguishable from methods that consider the complete (i.e. per-fragment) likelihood, while retaining the computational efficiently of the compatibility-based factorizations. AVAILABILITY AND IMPLEMENTATION Our data-driven factorizations are incorporated into a branch of the Salmon transcript quantification tool: https://github.com/COMBINE-lab/salmon/tree/factorizations . CONTACT rob.patro@cs.stonybrook.edu. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mohsen Zakeri
- Department of Computer Science, Stony Brook University, Stony Brook, NY, USA
| | - Avi Srivastava
- Department of Computer Science, Stony Brook University, Stony Brook, NY, USA
| | - Fatemeh Almodaresi
- Department of Computer Science, Stony Brook University, Stony Brook, NY, USA
| | - Rob Patro
- Department of Computer Science, Stony Brook University, Stony Brook, NY, USA
| |
Collapse
|
12
|
Song Y, Botvinnik OB, Lovci MT, Kakaradov B, Liu P, Xu JL, Yeo GW. Single-Cell Alternative Splicing Analysis with Expedition Reveals Splicing Dynamics during Neuron Differentiation. Mol Cell 2017; 67:148-161.e5. [PMID: 28673540 DOI: 10.1016/j.molcel.2017.06.003] [Citation(s) in RCA: 118] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2016] [Revised: 03/29/2017] [Accepted: 06/02/2017] [Indexed: 12/28/2022]
Abstract
Alternative splicing (AS) generates isoform diversity for cellular identity and homeostasis in multicellular life. Although AS variation has been observed among single cells, little is known about the biological or evolutionary significance of such variation. We developed Expedition, a computational framework consisting of outrigger, a de novo splice graph transversal algorithm to detect AS; anchor, a Bayesian approach to assign modalities; and bonvoyage, a visualization tool using non-negative matrix factorization to display modality changes. Applying Expedition to single pluripotent stem cells undergoing neuronal differentiation, we discover that up to 20% of AS exons exhibit bimodality. Bimodal exons are flanked by more conserved intronic sequences harboring distinct cis-regulatory motifs, constitute much of cell-type-specific splicing, are highly dynamic during cellular transitions, preserve reading frame, and reveal intricacy of cell states invisible to conventional gene expression analysis. Systematic AS characterization in single cells redefines our understanding of AS complexity in cell biology.
Collapse
Affiliation(s)
- Yan Song
- Department of Cellular and Molecular Medicine, Stem Cell Program and Institute for Genomic Medicine, University of California, San Diego, La Jolla, CA 92093, USA
| | - Olga B Botvinnik
- Department of Cellular and Molecular Medicine, Stem Cell Program and Institute for Genomic Medicine, University of California, San Diego, La Jolla, CA 92093, USA; Bioinformatics and Systems Biology Graduate Program, University of California, San Diego, La Jolla, CA 92093, USA
| | - Michael T Lovci
- Department of Cellular and Molecular Medicine, Stem Cell Program and Institute for Genomic Medicine, University of California, San Diego, La Jolla, CA 92093, USA; Biomedical Sciences Graduate Program, University of California, San Diego, La Jolla, CA 92093, USA
| | - Boyko Kakaradov
- Department of Cellular and Molecular Medicine, Stem Cell Program and Institute for Genomic Medicine, University of California, San Diego, La Jolla, CA 92093, USA; Bioinformatics and Systems Biology Graduate Program, University of California, San Diego, La Jolla, CA 92093, USA
| | - Patrick Liu
- Department of Cellular and Molecular Medicine, Stem Cell Program and Institute for Genomic Medicine, University of California, San Diego, La Jolla, CA 92093, USA
| | - Jia L Xu
- Department of Cellular and Molecular Medicine, Stem Cell Program and Institute for Genomic Medicine, University of California, San Diego, La Jolla, CA 92093, USA
| | - Gene W Yeo
- Department of Cellular and Molecular Medicine, Stem Cell Program and Institute for Genomic Medicine, University of California, San Diego, La Jolla, CA 92093, USA; Bioinformatics and Systems Biology Graduate Program, University of California, San Diego, La Jolla, CA 92093, USA; Biomedical Sciences Graduate Program, University of California, San Diego, La Jolla, CA 92093, USA; Department of Physiology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore 119077, Singapore; Molecular Engineering Laboratory, A(∗)STAR, Singapore 138632, Singapore.
| |
Collapse
|
13
|
Papastamoulis P, Rattray M. A Bayesian model selection approach for identifying differentially expressed transcripts from RNA sequencing data. J R Stat Soc Ser C Appl Stat 2017; 67:3-23. [PMID: 29353941 PMCID: PMC5763373 DOI: 10.1111/rssc.12213] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
Recent advances in molecular biology allow the quantification of the transcriptome and scoring transcripts as differentially or equally expressed between two biological conditions. Although these two tasks are closely linked, the available inference methods treat them separately: a primary model is used to estimate expression and its output is post processed by using a differential expression model. In the paper, both issues are simultaneously addressed by proposing the joint estimation of expression levels and differential expression: the unknown relative abundance of each transcript can either be equal or not between two conditions. A hierarchical Bayesian model builds on the BitSeq framework and the posterior distribution of transcript expression and differential expression is inferred by using Markov chain Monte Carlo sampling. It is shown that the model proposed enjoys conjugacy for fixed dimension variables; thus the full conditional distributions are analytically derived. Two samplers are constructed, a reversible jump Markov chain Monte Carlo sampler and a collapsed Gibbs sampler, and the latter is found to perform better. A cluster representation of the aligned reads to the transcriptome is introduced, allowing parallel estimation of the marginal posterior distribution of subsets of transcripts under reasonable computing time. Under a fixed prior probability of differential expression the clusterwise sampler has the same marginal posterior distributions as the raw sampler, but a more general prior structure is also employed. The algorithm proposed is benchmarked against alternative methods by using synthetic data sets and applied to real RNA sequencing data. Source code is available on line from https://github.com/mqbssppe/cjBitSeq.
Collapse
|
14
|
Sankar A, Malone B, Bayliss SC, Pascoe B, Méric G, Hitchings MD, Sheppard SK, Feil EJ, Corander J, Honkela A. Bayesian identification of bacterial strains from sequencing data. Microb Genom 2016; 2:e000075. [PMID: 28348870 PMCID: PMC5320594 DOI: 10.1099/mgen.0.000075] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2015] [Accepted: 06/20/2016] [Indexed: 11/23/2022] Open
Abstract
Rapidly assaying the diversity of a bacterial species present in a sample obtained from a hospital patient or an environmental source has become possible after recent technological advances in DNA sequencing. For several applications it is important to accurately identify the presence and estimate relative abundances of the target organisms from short sequence reads obtained from a sample. This task is particularly challenging when the set of interest includes very closely related organisms, such as different strains of pathogenic bacteria, which can vary considerably in terms of virulence, resistance and spread. Using advanced Bayesian statistical modelling and computation techniques we introduce a novel pipeline for bacterial identification that is shown to outperform the currently leading pipeline for this purpose. Our approach enables fast and accurate sequence-based identification of bacterial strains while using only modest computational resources. Hence it provides a useful tool for a wide spectrum of applications, including rapid clinical diagnostics to distinguish among closely related strains causing nosocomial infections. The software implementation is available at https://github.com/PROBIC/BIB.
Collapse
Affiliation(s)
- Aravind Sankar
- Helsinki Institute for Information Technology, Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Brandon Malone
- Helsinki Institute for Information Technology, Department of Computer Science, University of Helsinki, Helsinki, Finland
- German Centre for Cardiovascular Research DZHK, Klaus Tschira Institute for Integrative Computational Cardiology and Department of Internal Medicine III, University of Heidelberg, Germany
| | - Sion C. Bayliss
- Department of Biology and Biochemistry, University of Bath, UK
| | - Ben Pascoe
- Department of Biology and Biochemistry, University of Bath, UK
| | - Guillaume Méric
- Department of Biology and Biochemistry, University of Bath, UK
| | | | | | - Edward J. Feil
- Department of Biology and Biochemistry, University of Bath, UK
| | - Jukka Corander
- Helsinki Institute for Information Technology, Department of Mathematics and Statistics, University of Helsinki, Helsinki, Finland
- Department of Biostatistics, University of Oslo, Norway
| | - Antti Honkela
- Helsinki Institute for Information Technology, Department of Computer Science, University of Helsinki, Helsinki, Finland
| |
Collapse
|
15
|
Schuierer S, Roma G. The exon quantification pipeline (EQP): a comprehensive approach to the quantification of gene, exon and junction expression from RNA-seq data. Nucleic Acids Res 2016; 44:e132. [PMID: 27302131 PMCID: PMC5027495 DOI: 10.1093/nar/gkw538] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2015] [Accepted: 06/04/2016] [Indexed: 01/24/2023] Open
Abstract
The quantification of transcriptomic features is the basis of the analysis of RNA-seq data. We present an integrated alignment workflow and a simple counting-based approach to derive estimates for gene, exon and exon–exon junction expression. In contrast to previous counting-based approaches, EQP takes into account only reads whose alignment pattern agrees with the splicing pattern of the features of interest. This leads to improved gene expression estimates as well as to the generation of exon counts that allow disambiguating reads between overlapping exons. Unlike other methods that quantify skipped introns, EQP offers a novel way to compute junction counts based on the agreement of the read alignments with the exons on both sides of the junction, thus providing a uniformly derived set of counts. We evaluated the performance of EQP on both simulated and real Illumina RNA-seq data and compared it with other quantification tools. Our results suggest that EQP provides superior gene expression estimates and we illustrate the advantages of EQP's exon and junction counts. The provision of uniformly derived high-quality counts makes EQP an ideal quantification tool for differential expression and differential splicing studies. EQP is freely available for download at https://github.com/Novartis/EQP-cluster.
Collapse
Affiliation(s)
- Sven Schuierer
- Novartis Institutes for Biomedical Research, CH-4056 Basel, Switzerland
| | - Guglielmo Roma
- Novartis Institutes for Biomedical Research, CH-4056 Basel, Switzerland
| |
Collapse
|
16
|
Lin Z, Li M, Sestan N, Zhao H. A Markov random field-based approach for joint estimation of differentially expressed genes in mouse transcriptome data. Stat Appl Genet Mol Biol 2016; 15:139-50. [PMID: 26926866 PMCID: PMC5587217 DOI: 10.1515/sagmb-2015-0070] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
The statistical methodology developed in this study was motivated by our interest in studying neurodevelopment using the mouse brain RNA-Seq data set, where gene expression levels were measured in multiple layers in the somatosensory cortex across time in both female and male samples. We aim to identify differentially expressed genes between adjacent time points, which may provide insights on the dynamics of brain development. Because of the extremely small sample size (one male and female at each time point), simple marginal analysis may be underpowered. We propose a Markov random field (MRF)-based approach to capitalizing on the between layers similarity, temporal dependency and the similarity between sex. The model parameters are estimated by an efficient EM algorithm with mean field-like approximation. Simulation results and real data analysis suggest that the proposed model improves the power to detect differentially expressed genes than simple marginal analysis. Our method also reveals biologically interesting results in the mouse brain RNA-Seq data set.
Collapse
Affiliation(s)
- Zhixiang Lin
- Department of Statistics, Stanford University, Stanford, CA 94305, USA
- Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511, USA
| | - Mingfeng Li
- Department of Neurobiology, Kavli Institute for Neuroscience, Yale School of Medicine, 06510 New Haven, CT, USA
| | - Nenad Sestan
- Department of Neurobiology, Kavli Institute for Neuroscience, Yale School of Medicine, 06510 New Haven, CT, USA
| | - Hongyu Zhao
- Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut 06520, USA
- Department of Genetics, Yale School of Medicine, New Haven, Connecticut 06520, USA
| |
Collapse
|
17
|
PBSeq: Modeling base-level bias to estimate gene and isoform expression for RNA-seq data. INT J MACH LEARN CYB 2016. [DOI: 10.1007/s13042-016-0497-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
18
|
Nariai N, Kojima K, Mimori T, Kawai Y, Nagasaki M. A Bayesian approach for estimating allele-specific expression from RNA-Seq data with diploid genomes. BMC Genomics 2016; 17 Suppl 1:2. [PMID: 26818838 PMCID: PMC4895278 DOI: 10.1186/s12864-015-2295-5] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background RNA-sequencing (RNA-Seq) has become a popular tool for transcriptome profiling in mammals. However, accurate estimation of allele-specific expression (ASE) based on alignments of reads to the reference genome is challenging, because it contains only one allele on a mosaic haploid genome. Even with the information of diploid genome sequences, precise alignment of reads to the correct allele is difficult because of the high-similarity between the corresponding allele sequences. Results We propose a Bayesian approach to estimate ASE from RNA-Seq data with diploid genome sequences. In the statistical framework, the haploid choice is modeled as a hidden variable and estimated simultaneously with isoform expression levels by variational Bayesian inference. Through the simulation data analysis, we demonstrate the effectiveness of the proposed approach in terms of identifying ASE compared to the existing approach. We also show that our approach enables better quantification of isoform expression levels compared to the existing methods, TIGAR2, RSEM and Cufflinks. In the real data analysis of the human reference lymphoblastoid cell line GM12878, some autosomal genes were identified as ASE genes, and skewed paternal X-chromosome inactivation in GM12878 was identified. Conclusions The proposed method, called ASE-TIGAR, enables accurate estimation of gene expression from RNA-Seq data in an allele-specific manner. Our results show the effectiveness of utilizing personal genomic information for accurate estimation of ASE. An implementation of our method is available at http://nagasakilab.csml.org/ase-tigar.
Collapse
Affiliation(s)
- Naoki Nariai
- Present address: Institute for Genomic Medicine, University of California, San Diego, 9500 Gilman Drive, La Jolla, 92093, California, USA. .,Department of Integrative Genomics, Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8575, Japan.
| | - Kaname Kojima
- Department of Integrative Genomics, Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8575, Japan.
| | - Takahiro Mimori
- Department of Integrative Genomics, Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8575, Japan.
| | - Yosuke Kawai
- Department of Integrative Genomics, Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8575, Japan.
| | - Masao Nagasaki
- Department of Integrative Genomics, Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8575, Japan.
| |
Collapse
|
19
|
Hensman J, Papastamoulis P, Glaus P, Honkela A, Rattray M. Fast and accurate approximate inference of transcript expression from RNA-seq data. Bioinformatics 2015; 31:3881-9. [PMID: 26315907 PMCID: PMC4673974 DOI: 10.1093/bioinformatics/btv483] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2015] [Accepted: 08/07/2015] [Indexed: 11/25/2022] Open
Abstract
Motivation: Assigning RNA-seq reads to their transcript of origin is a fundamental task in transcript expression estimation. Where ambiguities in assignments exist due to transcripts sharing sequence, e.g. alternative isoforms or alleles, the problem can be solved through probabilistic inference. Bayesian methods have been shown to provide accurate transcript abundance estimates compared with competing methods. However, exact Bayesian inference is intractable and approximate methods such as Markov chain Monte Carlo and Variational Bayes (VB) are typically used. While providing a high degree of accuracy and modelling flexibility, standard implementations can be prohibitively slow for large datasets and complex transcriptome annotations. Results: We propose a novel approximate inference scheme based on VB and apply it to an existing model of transcript expression inference from RNA-seq data. Recent advances in VB algorithmics are used to improve the convergence of the algorithm beyond the standard Variational Bayes Expectation Maximization algorithm. We apply our algorithm to simulated and biological datasets, demonstrating a significant increase in speed with only very small loss in accuracy of expression level estimation. We carry out a comparative study against seven popular alternative methods and demonstrate that our new algorithm provides excellent accuracy and inter-replicate consistency while remaining competitive in computation time. Availability and implementation: The methods were implemented in R and C++, and are available as part of the BitSeq project at github.com/BitSeq. The method is also available through the BitSeq Bioconductor package. The source code to reproduce all simulation results can be accessed via github.com/BitSeq/BitSeqVB_benchmarking. Contact:james.hensman@sheffield.ac.uk or panagiotis.papastamoulis@manchester.ac.uk or Magnus.Rattray@manchester.ac.uk Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- James Hensman
- Sheffield Institute for Translational Neuroscience (SITraN), Sheffield, UK
| | | | - Peter Glaus
- School of Computer Science, The University of Manchester, Manchester, UK and
| | - Antti Honkela
- Helsinki Institute for Information Technology (HIIT), Department of Computer Science, University of Helsinki, Helsinki, Finland
| | | |
Collapse
|
20
|
Kanitz A, Gypas F, Gruber AJ, Gruber AR, Martin G, Zavolan M. Comparative assessment of methods for the computational inference of transcript isoform abundance from RNA-seq data. Genome Biol 2015. [PMID: 26201343 PMCID: PMC4511015 DOI: 10.1186/s13059-015-0702-5] [Citation(s) in RCA: 99] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Background Understanding the regulation of gene expression, including transcription start site usage, alternative splicing, and polyadenylation, requires accurate quantification of expression levels down to the level of individual transcript isoforms. To comparatively evaluate the accuracy of the many methods that have been proposed for estimating transcript isoform abundance from RNA sequencing data, we have used both synthetic data as well as an independent experimental method for quantifying the abundance of transcript ends at the genome-wide level. Results We found that many tools have good accuracy and yield better estimates of gene-level expression compared to commonly used count-based approaches, but they vary widely in memory and runtime requirements. Nucleotide composition and intron/exon structure have comparatively little influence on the accuracy of expression estimates, which correlates most strongly with transcript/gene expression levels. To facilitate the reproduction and further extension of our study, we provide datasets, source code, and an online analysis tool on a companion website, where developers can upload expression estimates obtained with their own tool to compare them to those inferred by the methods assessed here. Conclusions As many methods for quantifying isoform abundance with comparable accuracy are available, a user’s choice will likely be determined by factors such as the memory and runtime requirements, as well as the availability of methods for downstream analyses. Sequencing-based methods to quantify the abundance of specific transcript regions could complement validation schemes based on synthetic data and quantitative PCR in future or ongoing assessments of RNA-seq analysis methods. Electronic supplementary material The online version of this article (doi:10.1186/s13059-015-0702-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Alexander Kanitz
- Biozentrum, University of Basel and Swiss Institute of Bioinformatics, Basel, Switzerland.
| | - Foivos Gypas
- Biozentrum, University of Basel and Swiss Institute of Bioinformatics, Basel, Switzerland.
| | - Andreas J Gruber
- Biozentrum, University of Basel and Swiss Institute of Bioinformatics, Basel, Switzerland.
| | - Andreas R Gruber
- Biozentrum, University of Basel and Swiss Institute of Bioinformatics, Basel, Switzerland.
| | - Georges Martin
- Biozentrum, University of Basel and Swiss Institute of Bioinformatics, Basel, Switzerland.
| | - Mihaela Zavolan
- Biozentrum, University of Basel and Swiss Institute of Bioinformatics, Basel, Switzerland.
| |
Collapse
|
21
|
Oh S. How are Bayesian and Non-Parametric Methods Doing a Great Job in RNA-Seq Differential Expression Analysis? : A Review. COMMUNICATIONS FOR STATISTICAL APPLICATIONS AND METHODS 2015. [DOI: 10.5351/csam.2015.22.2.181] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Affiliation(s)
- Sunghee Oh
- Department of Veterinary Medicine, Jeju National University, Korea
| |
Collapse
|
22
|
Estimating copy numbers of alleles from population-scale high-throughput sequencing data. BMC Bioinformatics 2015; 16 Suppl 1:S4. [PMID: 25707811 PMCID: PMC4331703 DOI: 10.1186/1471-2105-16-s1-s4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Background With the recent development of microarray and high-throughput sequencing (HTS) technologies, a number of studies have revealed catalogs of copy number variants (CNVs) and their association with phenotypes and complex traits. In parallel, a number of approaches to predict CNV regions and genotypes are proposed for both microarray and HTS data. However, only a few approaches focus on haplotyping of CNV loci. Results We propose a novel approach to infer copy unit alleles and their numbers in each sample simultaneously from population-scale HTS data by variational Bayesian inference on a generative probabilistic model inspired by latent Dirichlet allocation, which is a well studied model for document classification problems. In simulation studies, we evaluated concordance between inferred and true copy unit alleles for lower-, middle-, and higher-copy number dataset, in which precision and recall were ≥ 0.9 for data with mean coverage ≥ 10× per copy unit. We also applied the approach to HTS data of 1123 samples at highly variable salivary amylase gene locus and a pseudogene locus, and confirmed consistency of the estimated alleles within samples belonging to a trio of CEPH/Utah pedigree 1463 with 11 offspring. Conclusions Our proposed approach enables detailed analysis of copy number variations, such as association study between copy unit alleles and phenotypes or biological features including human diseases.
Collapse
|
23
|
Nariai N, Kojima K, Saito S, Mimori T, Sato Y, Kawai Y, Yamaguchi-Kabata Y, Yasuda J, Nagasaki M. HLA-VBSeq: accurate HLA typing at full resolution from whole-genome sequencing data. BMC Genomics 2015; 16 Suppl 2:S7. [PMID: 25708870 PMCID: PMC4331721 DOI: 10.1186/1471-2164-16-s2-s7] [Citation(s) in RCA: 68] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
Background Human leucocyte antigen (HLA) genes play an important role in determining the outcome of organ transplantation and are linked to many human diseases. Because of the diversity and polymorphisms of HLA loci, HLA typing at high resolution is challenging even with whole-genome sequencing data. Results We have developed a computational tool, HLA-VBSeq, to estimate the most probable HLA alleles at full (8-digit) resolution from whole-genome sequence data. HLA-VBSeq simultaneously optimizes read alignments to HLA allele sequences and abundance of reads on HLA alleles by variational Bayesian inference. We show the effectiveness of the proposed method over other methods through the analysis of predicting HLA types for HLA class I (HLA-A, -B and -C) and class II (HLA-DQA1,-DQB1 and -DRB1) loci from the simulation data of various depth of coverage, and real sequencing data of human trio samples. Conclusions HLA-VBSeq is an efficient and accurate HLA typing method using high-throughput sequencing data without the need of primer design for HLA loci. Moreover, it does not assume any prior knowledge about HLA allele frequencies, and hence HLA-VBSeq is broadly applicable to human samples obtained from a genetically diverse population.
Collapse
|
24
|
Nariai N, Kojima K, Mimori T, Sato Y, Kawai Y, Yamaguchi-Kabata Y, Nagasaki M. TIGAR2: sensitive and accurate estimation of transcript isoform expression with longer RNA-Seq reads. BMC Genomics 2014; 15 Suppl 10:S5. [PMID: 25560536 PMCID: PMC4304212 DOI: 10.1186/1471-2164-15-s10-s5] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND High-throughput RNA sequencing (RNA-Seq) enables quantification and identification of transcripts at single-base resolution. Recently, longer sequence reads become available thanks to the development of new types of sequencing technologies as well as improvements in chemical reagents for the Next Generation Sequencers. Although several computational methods have been proposed for quantifying gene expression levels from RNA-Seq data, they are not sufficiently optimized for longer reads (e.g. >250 bp). RESULTS We propose TIGAR2, a statistical method for quantifying transcript isoforms from fixed and variable length RNA-Seq data. Our method models substitution, deletion, and insertion errors of sequencers based on gapped-alignments of reads to the reference cDNA sequences so that sensitive read-aligners such as Bowtie2 and BWA-MEM are effectively incorporated in our pipeline. Also, a heuristic algorithm is implemented in variational Bayesian inference for faster computation. We apply TIGAR2 to both simulation data and real data of human samples and evaluate performance of transcript quantification with TIGAR2 in comparison to existing methods. CONCLUSIONS TIGAR2 is a sensitive and accurate tool for quantifying transcript isoform abundances from RNA-Seq data. Our method performs better than existing methods for the fixed-length reads (100 bp, 250 bp, 500 bp, and 1000 bp of both single-end and paired-end) and variable-length reads, especially for reads longer than 250 bp.
Collapse
Affiliation(s)
- Naoki Nariai
- Department of Integrative Genomics, Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan
| | - Kaname Kojima
- Department of Integrative Genomics, Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan
| | - Takahiro Mimori
- Department of Integrative Genomics, Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan
| | - Yukuto Sato
- Department of Integrative Genomics, Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan
| | - Yosuke Kawai
- Department of Integrative Genomics, Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan
| | - Yumi Yamaguchi-Kabata
- Department of Integrative Genomics, Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan
| | - Masao Nagasaki
- Department of Integrative Genomics, Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan
| |
Collapse
|
25
|
Papastamoulis P, Hensman J, Glaus P, Rattray M. Improved variational Bayes inference for transcript expression estimation. Stat Appl Genet Mol Biol 2014; 13:203-16. [PMID: 24413218 DOI: 10.1515/sagmb-2013-0054] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
RNA-seq studies allow for the quantification of transcript expression by aligning millions of short reads to a reference genome. However, transcripts share much of their sequence, so that many reads map to more than one place and their origin remains uncertain. This problem can be dealt using mixtures of distributions and transcript expression reduces to estimating the weights of the mixture. In this paper, variational Bayesian (VB) techniques are used in order to approximate the posterior distribution of transcript expression. VB has previously been shown to be more computationally efficient for this problem than Markov chain Monte Carlo. VB methodology can precisely estimate the posterior means, but leads to variance underestimation. For this reason, a novel approach is introduced which integrates the latent allocation variables out of the VB approximation. It is shown that this modification leads to a better marginal likelihood bound and improved estimate of the posterior variance. A set of simulation studies and application to real RNA-seq datasets highlight the improved performance of the proposed method.
Collapse
|
26
|
Williams AG, Thomas S, Wyman SK, Holloway AK. RNA-seq Data: Challenges in and Recommendations for Experimental Design and Analysis. ACTA ACUST UNITED AC 2014; 83:11.13.1-20. [PMID: 25271838 DOI: 10.1002/0471142905.hg1113s83] [Citation(s) in RCA: 59] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
RNA-seq is widely used to determine differential expression of genes or transcripts as well as identify novel transcripts, identify allele-specific expression, and precisely measure translation of transcripts. Thoughtful experimental design and choice of analysis tools are critical to ensure high-quality data and interpretable results. Important considerations for experimental design include number of replicates, whether to collect paired-end or single-end reads, sequence length, and sequencing depth. Common analysis steps in all RNA-seq experiments include quality control, read alignment, assigning reads to genes or transcripts, and estimating gene or transcript abundance. Our aims are two-fold: to make recommendations for common components of experimental design and assess tool capabilities for each of these steps. We also test tools designed to detect differential expression, since this is the most widespread application of RNA-seq. We hope that these analyses will help guide those who are new to RNA-seq and will generate discussion about remaining needs for tool improvement and development.
Collapse
|
27
|
Fonseca NA, Marioni J, Brazma A. RNA-Seq gene profiling--a systematic empirical comparison. PLoS One 2014; 9:e107026. [PMID: 25268973 PMCID: PMC4182317 DOI: 10.1371/journal.pone.0107026] [Citation(s) in RCA: 58] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2014] [Accepted: 08/06/2014] [Indexed: 11/18/2022] Open
Abstract
Accurately quantifying gene expression levels is a key goal of experiments using RNA-sequencing to assay the transcriptome. This typically requires aligning the short reads generated to the genome or transcriptome before quantifying expression of pre-defined sets of genes. Differences in the alignment/quantification tools can have a major effect upon the expression levels found with important consequences for biological interpretation. Here we address two main issues: do different analysis pipelines affect the gene expression levels inferred from RNA-seq data? And, how close are the expression levels inferred to the "true" expression levels? We evaluate fifty gene profiling pipelines in experimental and simulated data sets with different characteristics (e.g, read length and sequencing depth). In the absence of knowledge of the 'ground truth' in real RNAseq data sets, we used simulated data to assess the differences between the "true" expression and those reconstructed by the analysis pipelines. Even though this approach does not take into account all known biases present in RNAseq data, it still allows to estimate the accuracy of the gene expression values inferred by different analysis pipelines. The results show that i) overall there is a high correlation between the expression levels inferred by the best pipelines and the true quantification values; ii) the error in the estimated gene expression values can vary considerably across genes; and iii) a small set of genes have expression estimates with consistently high error (across data sets and methods). Finally, although the mapping software is important, the quantification method makes a greater difference to the results.
Collapse
Affiliation(s)
- Nuno A. Fonseca
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, United Kingdom
- * E-mail: (NF); (JM); (AB)
| | - John Marioni
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, United Kingdom
- * E-mail: (NF); (JM); (AB)
| | - Alvis Brazma
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, United Kingdom
- * E-mail: (NF); (JM); (AB)
| |
Collapse
|
28
|
Cho H, Davis J, Li X, Smith KS, Battle A, Montgomery SB. High-resolution transcriptome analysis with long-read RNA sequencing. PLoS One 2014; 9:e108095. [PMID: 25251678 PMCID: PMC4176000 DOI: 10.1371/journal.pone.0108095] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2014] [Accepted: 08/18/2014] [Indexed: 11/18/2022] Open
Abstract
RNA sequencing (RNA-seq) enables characterization and quantification of individual transcriptomes as well as detection of patterns of allelic expression and alternative splicing. Current RNA-seq protocols depend on high-throughput short-read sequencing of cDNA. However, as ongoing advances are rapidly yielding increasing read lengths, a technical hurdle remains in identifying the degree to which differences in read length influence various transcriptome analyses. In this study, we generated two paired-end RNA-seq datasets of differing read lengths (2×75 bp and 2×262 bp) for lymphoblastoid cell line GM12878 and compared the effect of read length on transcriptome analyses, including read-mapping performance, gene and transcript quantification, and detection of allele-specific expression (ASE) and allele-specific alternative splicing (ASAS) patterns. Our results indicate that, while the current long-read protocol is considerably more expensive than short-read sequencing, there are important benefits that can only be achieved with longer read length, including lower mapping bias and reduced ambiguity in assigning reads to genomic elements, such as mRNA transcript. We show that these benefits ultimately lead to improved detection of cis-acting regulatory and splicing variation effects within individuals.
Collapse
Affiliation(s)
- Hyunghoon Cho
- Department of Computer Science, Stanford University, Stanford, California, United States of America
| | - Joe Davis
- Department of Genetics, Stanford University, Stanford, California, United States of America
| | - Xin Li
- Department of Pathology, Stanford University, Stanford, California, United States of America
| | - Kevin S. Smith
- Department of Pathology, Stanford University, Stanford, California, United States of America
| | - Alexis Battle
- Department of Computer Science, Stanford University, Stanford, California, United States of America
- Department of Genetics, Stanford University, Stanford, California, United States of America
- * E-mail: (SBM); (AB)
| | - Stephen B. Montgomery
- Department of Computer Science, Stanford University, Stanford, California, United States of America
- Department of Genetics, Stanford University, Stanford, California, United States of America
- Department of Pathology, Stanford University, Stanford, California, United States of America
- * E-mail: (SBM); (AB)
| |
Collapse
|