1
|
Cuevas-Diaz Duran R, Wei H, Wu J. Data normalization for addressing the challenges in the analysis of single-cell transcriptomic datasets. BMC Genomics 2024; 25:444. [PMID: 38711017 PMCID: PMC11073985 DOI: 10.1186/s12864-024-10364-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2023] [Accepted: 04/29/2024] [Indexed: 05/08/2024] Open
Abstract
BACKGROUND Normalization is a critical step in the analysis of single-cell RNA-sequencing (scRNA-seq) datasets. Its main goal is to make gene counts comparable within and between cells. To do so, normalization methods must account for technical and biological variability. Numerous normalization methods have been developed addressing different sources of dispersion and making specific assumptions about the count data. MAIN BODY The selection of a normalization method has a direct impact on downstream analysis, for example differential gene expression and cluster identification. Thus, the objective of this review is to guide the reader in making an informed decision on the most appropriate normalization method to use. To this aim, we first give an overview of the different single cell sequencing platforms and methods commonly used including isolation and library preparation protocols. Next, we discuss the inherent sources of variability of scRNA-seq datasets. We describe the categories of normalization methods and include examples of each. We also delineate imputation and batch-effect correction methods. Furthermore, we describe data-driven metrics commonly used to evaluate the performance of normalization methods. We also discuss common scRNA-seq methods and toolkits used for integrated data analysis. CONCLUSIONS According to the correction performed, normalization methods can be broadly classified as within and between-sample algorithms. Moreover, with respect to the mathematical model used, normalization methods can further be classified into: global scaling methods, generalized linear models, mixed methods, and machine learning-based methods. Each of these methods depict pros and cons and make different statistical assumptions. However, there is no better performing normalization method. Instead, metrics such as silhouette width, K-nearest neighbor batch-effect test, or Highly Variable Genes are recommended to assess the performance of normalization methods.
Collapse
Affiliation(s)
- Raquel Cuevas-Diaz Duran
- Tecnologico de Monterrey, Escuela de Medicina y Ciencias de la Salud, Monterrey, Nuevo Leon, 64710, Mexico.
| | - Haichao Wei
- The Vivian L. Smith Department of Neurosurgery, McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA
- Center for Stem Cell and Regenerative Medicine, UT Brown Foundation Institute of Molecular Medicine, Houston, TX, 77030, USA
| | - Jiaqian Wu
- The Vivian L. Smith Department of Neurosurgery, McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA.
- Center for Stem Cell and Regenerative Medicine, UT Brown Foundation Institute of Molecular Medicine, Houston, TX, 77030, USA.
- MD Anderson Cancer Center UTHealth Graduate School of Biomedical Sciences, Houston, TX, 77030, USA.
| |
Collapse
|
2
|
Jones EF, Haldar A, Oza VH, Lasseigne BN. Quantifying transcriptome diversity: a review. Brief Funct Genomics 2024; 23:83-94. [PMID: 37225889 PMCID: PMC11484519 DOI: 10.1093/bfgp/elad019] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2023] [Revised: 04/14/2023] [Accepted: 05/05/2023] [Indexed: 05/26/2023] Open
Abstract
Following the central dogma of molecular biology, gene expression heterogeneity can aid in predicting and explaining the wide variety of protein products, functions and, ultimately, heterogeneity in phenotypes. There is currently overlapping terminology used to describe the types of diversity in gene expression profiles, and overlooking these nuances can misrepresent important biological information. Here, we describe transcriptome diversity as a measure of the heterogeneity in (1) the expression of all genes within a sample or a single gene across samples in a population (gene-level diversity) or (2) the isoform-specific expression of a given gene (isoform-level diversity). We first overview modulators and quantification of transcriptome diversity at the gene level. Then, we discuss the role alternative splicing plays in driving transcript isoform-level diversity and how it can be quantified. Additionally, we overview computational resources for calculating gene-level and isoform-level diversity for high-throughput sequencing data. Finally, we discuss future applications of transcriptome diversity. This review provides a comprehensive overview of how gene expression diversity arises, and how measuring it determines a more complete picture of heterogeneity across proteins, cells, tissues, organisms and species.
Collapse
Affiliation(s)
- Emma F Jones
- The Department of Cell, Developmental and Integrative Biology, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, AL, USA
| | - Anisha Haldar
- The Department of Cell, Developmental and Integrative Biology, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, AL, USA
| | - Vishal H Oza
- The Department of Cell, Developmental and Integrative Biology, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, AL, USA
| | - Brittany N Lasseigne
- The Department of Cell, Developmental and Integrative Biology, Heersink School of Medicine, The University of Alabama at Birmingham, Birmingham, AL, USA
| |
Collapse
|
3
|
Zong L, Zhu Y, Jiang Y, Xia Y, Liu Q, Wang J, Gao S, Luo B, Yuan Y, Zhou J, Jiang S. An optimized workflow of full-length transcriptome sequencing for accurate fusion transcript identification. RNA Biol 2024; 21:122-131. [PMID: 39540613 PMCID: PMC11572239 DOI: 10.1080/15476286.2024.2425527] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Revised: 10/23/2024] [Accepted: 10/25/2024] [Indexed: 11/16/2024] Open
Abstract
Next-generation sequencing has revolutionized cancer genomics by enabling high-throughput mutation screening yet detecting fusion genes reliably remains challenging. Long-read sequencing offers potential for accurate fusion transcript identification, though challenges persist. In this study, we present an optimized workflow using nanopore sequencing technology to precisely identify fusion transcripts. Our approach encompasses a tailored library preparation protocol, data processing, and fusion gene analysis pipeline. We evaluated the performance using Universal Human Reference RNA and human adenocarcinoma cell lines. Our optimized nanopore sequencing workflow generated high-quality full-length transcriptome data characterized by an extended length distribution and comprehensive transcript coverage. Validation experiments confirmed novel fusion events with potential clinical relevance. Our protocol aims to mitigate biases and enhance accuracy, facilitating increased adoption in clinical diagnostics. Continued advancements in long-read sequencing promise deeper insights into fusion gene biology and improved cancer diagnostics.
Collapse
Affiliation(s)
- Liang Zong
- Department of Biology and Genetics, College of Life Sciences and Health, Wuhan University of Science and Technology, Wuhan, China
- Wuhan BGI Technology Service Co. Ltd., BGI-Wuhan, Wuhan, China
| | - Yabing Zhu
- BGI Tech Solutions Co. Ltd., BGI-Shenzhen, Shenzhen, China
| | - Yuan Jiang
- Wuhan BGI Technology Service Co. Ltd., BGI-Wuhan, Wuhan, China
| | - Ying Xia
- Wuhan BGI Technology Service Co. Ltd., BGI-Wuhan, Wuhan, China
| | - Qun Liu
- Wuhan BGI Technology Service Co. Ltd., BGI-Wuhan, Wuhan, China
| | - Jing Wang
- Wuhan BGI Technology Service Co. Ltd., BGI-Wuhan, Wuhan, China
| | - Song Gao
- Wuhan BGI Technology Service Co. Ltd., BGI-Wuhan, Wuhan, China
| | - Bei Luo
- Wuhan BGI Technology Service Co. Ltd., BGI-Wuhan, Wuhan, China
| | - Yongxian Yuan
- BGI Tech Solutions Co. Ltd., BGI-Shenzhen, Shenzhen, China
| | - Jingjiao Zhou
- Department of Biology and Genetics, College of Life Sciences and Health, Wuhan University of Science and Technology, Wuhan, China
| | - Sanjie Jiang
- BGI Tech Solutions Co. Ltd., BGI-Shenzhen, Shenzhen, China
| |
Collapse
|
4
|
Davies P, Jones M, Liu J, Hebenstreit D. Anti-bias training for (sc)RNA-seq: experimental and computational approaches to improve precision. Brief Bioinform 2021; 22:6265204. [PMID: 33959753 PMCID: PMC8574610 DOI: 10.1093/bib/bbab148] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2021] [Revised: 03/10/2021] [Accepted: 03/26/2021] [Indexed: 12/29/2022] Open
Abstract
RNA-seq, including single cell RNA-seq (scRNA-seq), is plagued by insufficient sensitivity and lack of precision. As a result, the full potential of (sc)RNA-seq is limited. Major factors in this respect are the presence of global bias in most datasets, which affects detection and quantitation of RNA in a length-dependent fashion. In particular, scRNA-seq is affected by technical noise and a high rate of dropouts, where the vast majority of original transcripts is not converted into sequencing reads. We discuss these biases origins and implications, bioinformatics approaches to correct for them, and how biases can be exploited to infer characteristics of the sample preparation process, which in turn can be used to improve library preparation.
Collapse
Affiliation(s)
- Philip Davies
- Daniel Hebenstreit's Research Group University of Warwick, CV4 7AL Coventry, UK
| | - Matt Jones
- Daniel Hebenstreit's Research Group University of Warwick, CV4 7AL Coventry, UK
| | - Juntai Liu
- Physics Department, University of Warwick, CV4 7AL Coventry, UK
| | | |
Collapse
|
5
|
|
6
|
Archer N, Egan SA, Coffey TJ, Emes RD, Addis MF, Ward PN, Blanchard AM, Leigh JA. A Paradox in Bacterial Pathogenesis: Activation of the Local Macrophage Inflammasome Is Required for Virulence of Streptococcus uberis. Pathogens 2020; 9:pathogens9120997. [PMID: 33260788 PMCID: PMC7768481 DOI: 10.3390/pathogens9120997] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Revised: 11/12/2020] [Accepted: 11/26/2020] [Indexed: 12/12/2022] Open
Abstract
Streptococcus uberis is a common cause of intramammary infection and mastitis in dairy cattle. Unlike other mammary pathogens, S. uberis evades detection by mammary epithelial cells, and the host–pathogen interactions during early colonisation are poorly understood. Intramammary challenge of dairy cows with S. uberis (strain 0140J) or isogenic mutants lacking the surface-anchored serine protease, SUB1154, demonstrated that virulence was dependent on the presence and correct location of this protein. Unlike the wild-type strain, the mutant lacking SUB1154 failed to elicit IL-1β from ex vivo CD14+ cells obtained from milk (bovine mammary macrophages, BMM), but this response was reinstated by complementation with recombinant SUB1154; the protein in isolation elicited no response. Production of IL-1β was ablated in the presence of various inhibitors, indicating dependency on internalisation and activation of NLRP3 and caspase-1, consistent with inflammasome activation. Similar transcriptomic changes were detected in ex vivo BMM in response to the wild-type or the SUB1154 deletion mutant, consistent with S. uberis priming BMM, enabling the SUB1154 protein to activate inflammasome maturation in a transcriptionally independent manner. These data can be reconciled in a novel model of pathogenesis in which, paradoxically, early colonisation is dependent on the innate response to the initial infection.
Collapse
Affiliation(s)
- Nathan Archer
- School of Veterinary Medicine and Sciences, Sutton Bonington Campus, University of Nottingham, Loughborough LE12 5RD, UK; (N.A.); (S.A.E.); (T.J.C.); (R.D.E.); (A.M.B.)
| | - Sharon A. Egan
- School of Veterinary Medicine and Sciences, Sutton Bonington Campus, University of Nottingham, Loughborough LE12 5RD, UK; (N.A.); (S.A.E.); (T.J.C.); (R.D.E.); (A.M.B.)
| | - Tracey J. Coffey
- School of Veterinary Medicine and Sciences, Sutton Bonington Campus, University of Nottingham, Loughborough LE12 5RD, UK; (N.A.); (S.A.E.); (T.J.C.); (R.D.E.); (A.M.B.)
| | - Richard D. Emes
- School of Veterinary Medicine and Sciences, Sutton Bonington Campus, University of Nottingham, Loughborough LE12 5RD, UK; (N.A.); (S.A.E.); (T.J.C.); (R.D.E.); (A.M.B.)
- Advanced Data Analysis Centre, Sutton Bonington Campus, University of Nottingham, Loughborough LE12 5RD, UK
| | - M. Filippa Addis
- Porto Conte Ricerche, 07041 Alghero, Italy;
- Dipartimento di Medicina Veterinaria, Università degli Studi di Milano, 20133 Milan, Italy
| | - Philip N. Ward
- Division of Structural Biology, Nuffield Department of Medicine, University of Oxford, Oxford OX3 7BN, UK;
| | - Adam M. Blanchard
- School of Veterinary Medicine and Sciences, Sutton Bonington Campus, University of Nottingham, Loughborough LE12 5RD, UK; (N.A.); (S.A.E.); (T.J.C.); (R.D.E.); (A.M.B.)
| | - James A. Leigh
- School of Veterinary Medicine and Sciences, Sutton Bonington Campus, University of Nottingham, Loughborough LE12 5RD, UK; (N.A.); (S.A.E.); (T.J.C.); (R.D.E.); (A.M.B.)
- Correspondence:
| |
Collapse
|
7
|
Gupta RK, Kuznicki J. Biological and Medical Importance of Cellular Heterogeneity Deciphered by Single-Cell RNA Sequencing. Cells 2020; 9:E1751. [PMID: 32707839 PMCID: PMC7463515 DOI: 10.3390/cells9081751] [Citation(s) in RCA: 32] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2020] [Revised: 07/15/2020] [Accepted: 07/20/2020] [Indexed: 01/01/2023] Open
Abstract
The present review discusses recent progress in single-cell RNA sequencing (scRNA-seq), which can describe cellular heterogeneity in various organs, bodily fluids, and pathologies (e.g., cancer and Alzheimer's disease). We outline scRNA-seq techniques that are suitable for investigating cellular heterogeneity that is present in cell populations with very high resolution of the transcriptomic landscape. We summarize scRNA-seq findings and applications of this technology to identify cell types, activity, and other features that are important for the function of different bodily organs. We discuss future directions for scRNA-seq techniques that can link gene expression, protein expression, cellular function, and their roles in pathology. We speculate on how the field could develop beyond its present limitations (e.g., performing scRNA-seq in situ and in vivo). Finally, we discuss the integration of machine learning and artificial intelligence with cutting-edge scRNA-seq technology, which could provide a strong basis for designing precision medicine and targeted therapy in the future.
Collapse
Affiliation(s)
- Rishikesh Kumar Gupta
- International Institute of Molecular and Cell Biology in Warsaw, Trojdena 4, 02-109 Warsaw Poland;
- Postgraduate School of Molecular Medicine, Warsaw Medical University, 61 Żwirki i Wigury St., 02-091 Warsaw, Poland
| | - Jacek Kuznicki
- International Institute of Molecular and Cell Biology in Warsaw, Trojdena 4, 02-109 Warsaw Poland;
| |
Collapse
|
8
|
Wu S, Zhang H, Fouladdel S, Li H, Keller E, Wicha MS, Omenn GS, Azizi E, Guan Y. Cellular, transcriptomic and isoform heterogeneity of breast cancer cell line revealed by full-length single-cell RNA sequencing. Comput Struct Biotechnol J 2020; 18:676-685. [PMID: 32257051 PMCID: PMC7114460 DOI: 10.1016/j.csbj.2020.03.005] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2019] [Revised: 01/28/2020] [Accepted: 03/11/2020] [Indexed: 12/13/2022] Open
Abstract
Tumor heterogeneity is generated through a combination of genetic and epigenetic mechanisms, the latter of which plays an important role in the generation of stem like cells responsible for tumor formation and metastasis. Although the development of single cell transcriptomic technologies holds promise to deconvolute this complexity, a number of these techniques have limitations including drop-out and uneven coverage, which challenge the further delineation of tumor heterogeneity. We adopted deep and full-length single-cell RNA sequencing on Fluidigm's Polaris platform to reveal the cellular, transcriptomic, and isoform heterogeneity of SUM149, a triple negative breast cancer (TNBC) cell line. We first validate the quality of the TNBC sequencing data with the sequencing data from erythroleukemia K562 cell line as control. We next scrutinized well-defined marker genes for cancer stem-like cell to identify different cell populations. We then profile the isoform expression data to investigate the heterogeneity of alternative splicing patterns. Though classified as triple-negative breast cancer, the SUM149 stem cells show heterogeneous expression of marker receptors (ER, PR, and HER2) across the cells. We identified three cell populations that express patterns of stemness: epithelial-mesenchymal transition (EMT) cancer stem cells (CSCs), mesenchymal-epithelial transition (MET) CSCs and Dual-EMT-MET CSCs. These cells also manifested a high level of heterogeneity in alternative splicing patterns. For example, CSCs have shown different expression patterns of the CD44v6 exon, as well as different levels of truncated EGFR transcripts, which may suggest different potentials for proliferation and invasion among cancer stem cells. Our study identified features of the landscape of previously underestimated cellular, transcriptomic, and isoform heterogeneity of cancer stem cells in triple-negative breast cancers.
Collapse
Affiliation(s)
- Shaocheng Wu
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor 48109, MI, United States
- Bioinformatics Graduate Program, University of British Columbia, 570 West 7th Avenue, V5Z 4S6 Vancouver, BC, Canada
- Department of Molecular Oncology, British Columbia Cancer Research Centre, Vancouver, BC, Canada
| | - Hongjiu Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor 48109, MI, United States
- Microsoft, Inc., Bellevue, WA, United States
| | - Shamileh Fouladdel
- Comprehensive Cancer Center, University of Michigan, Ann Arbor 48109, MI, United States
| | - Hongyang Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor 48109, MI, United States
| | - Evan Keller
- Comprehensive Cancer Center, University of Michigan, Ann Arbor 48109, MI, United States
- Department of Urology, Biointerfaces Institute and Single Cell Spatial Analysis Program, University of Michigan, Ann Arbor 48109, MI, United States
| | - Max S. Wicha
- Comprehensive Cancer Center, University of Michigan, Ann Arbor 48109, MI, United States
| | - Gilbert S. Omenn
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor 48109, MI, United States
| | - Ebrahim Azizi
- Comprehensive Cancer Center, University of Michigan, Ann Arbor 48109, MI, United States
| | - Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor 48109, MI, United States
| |
Collapse
|
9
|
Ozaki H, Hayashi T, Umeda M, Nikaido I. Millefy: visualizing cell-to-cell heterogeneity in read coverage of single-cell RNA sequencing datasets. BMC Genomics 2020; 21:177. [PMID: 32122302 PMCID: PMC7053140 DOI: 10.1186/s12864-020-6542-z] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2019] [Accepted: 01/29/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Read coverage of RNA sequencing data reflects gene expression and RNA processing events. Single-cell RNA sequencing (scRNA-seq) methods, particularly "full-length" ones, provide read coverage of many individual cells and have the potential to reveal cellular heterogeneity in RNA transcription and processing. However, visualization tools suited to highlighting cell-to-cell heterogeneity in read coverage are still lacking. RESULTS Here, we have developed Millefy, a tool for visualizing read coverage of scRNA-seq data in genomic contexts. Millefy is designed to show read coverage of all individual cells at once in genomic contexts and to highlight cell-to-cell heterogeneity in read coverage. By visualizing read coverage of all cells as a heat map and dynamically reordering cells based on diffusion maps, Millefy facilitates discovery of "local" region-specific, cell-to-cell heterogeneity in read coverage. We applied Millefy to scRNA-seq data sets of mouse embryonic stem cells and triple-negative breast cancers and showed variability of transcribed regions including antisense RNAs, 3 ' UTR lengths, and enhancer RNA transcription. CONCLUSIONS Millefy simplifies the examination of cellular heterogeneity in RNA transcription and processing events using scRNA-seq data. Millefy is available as an R package (https://github.com/yuifu/millefy) and as a Docker image for use with Jupyter Notebook (https://hub.docker.com/r/yuifu/datascience-notebook-millefy).
Collapse
Affiliation(s)
- Haruka Ozaki
- Bioinformatics Laboratory, Faculty of Medicine, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8575 Japan
- Center for Artificial Intelligence Research, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577 Japan
- Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics Research, 2-1 Hirosawa, Wako, Saitama, 351-0198 Japan
| | - Tetsutaro Hayashi
- Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics Research, 2-1 Hirosawa, Wako, Saitama, 351-0198 Japan
| | - Mana Umeda
- Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics Research, 2-1 Hirosawa, Wako, Saitama, 351-0198 Japan
| | - Itoshi Nikaido
- Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics Research, 2-1 Hirosawa, Wako, Saitama, 351-0198 Japan
- Bioinformatics Course, Master’s/Doctoral Program in Life Science Innovation, School of Integrative and Global Majors, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8575 Japan
| |
Collapse
|
10
|
Bai YL, Baddoo M, Flemington EK, Nakhoul HN, Liu YZ. Screen technical noise in single cell RNA sequencing data. Genomics 2020; 112:346-355. [PMID: 30802598 DOI: 10.1016/j.ygeno.2019.02.014] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2018] [Revised: 01/20/2019] [Accepted: 02/20/2019] [Indexed: 12/12/2022]
Abstract
We proposed a data cleaning pipeline for single cell (SC) RNA-seq data, where we first screen genes (gene-wise screening) followed by screening cell libraries (library-wise screening). Gene-wise screening is based on the expectation that for a gene with a low technical noise, a gene's count in a library will tend to increase with the increase of library size, which was tested using negative binomial regression of gene count (as dependent variable) against library size (as independent variable). Library-wise screening is based on the expectation that across-library correlations for housekeeping (HK) genes is expected to be higher than the correlations for non-housekeeping (NHK) genes in those libraries with low technical noise. We removed those libraries, whose mean pairwise correlation for HK genes is NOT significantly higher than that for NHK genes. We successfully applied the pipeline to two large SC RNA-seq datasets. The pipeline was also developed into an R package.
Collapse
Affiliation(s)
- Yu-Long Bai
- Dept. of Global Biostatistics and Data Science, Tulane University School of Public Health and Tropical Medicine, United States
| | - Melody Baddoo
- Dept. of Pathology, Tulane Cancer Center, Tulane University Health Sciences Center, United States
| | - Erik K Flemington
- Dept. of Pathology, Tulane Cancer Center, Tulane University Health Sciences Center, United States
| | - Hani N Nakhoul
- Dept. of Pathology, Tulane Cancer Center, Tulane University Health Sciences Center, United States.
| | - Yao-Zhong Liu
- Dept. of Global Biostatistics and Data Science, Tulane University School of Public Health and Tropical Medicine, United States.
| |
Collapse
|
11
|
Dyer NP, Shahrezaei V, Hebenstreit D. LiBiNorm: an htseq-count analogue with improved normalisation of Smart-seq2 data and library preparation diagnostics. PeerJ 2019; 7:e6222. [PMID: 30740268 PMCID: PMC6366399 DOI: 10.7717/peerj.6222] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2018] [Accepted: 12/05/2018] [Indexed: 12/02/2022] Open
Abstract
Protocols for preparing RNA sequencing (RNA-seq) libraries, most prominently “Smart-seq” variations, introduce global biases that can have a significant impact on the quantification of gene expression levels. This global bias can lead to drastic over- or under-representation of RNA in non-linear length-dependent fashion due to enzymatic reactions during cDNA production. It is currently not corrected by any RNA-seq software, which mostly focus on local bias in coverage along RNAs. This paper describes LiBiNorm, a simple command line program that mimics the popular htseq-count software and allows diagnostics, quantification, and global bias removal. LiBiNorm outputs gene expression data that has been normalized to correct for global bias introduced by the Smart-seq2 protocol. In addition, it produces data and several plots that allow insights into the experimental history underlying library preparation. The LiBiNorm package includes an R script that allows visualization of the main results. LiBiNorm is the first software application to correct for the global bias that is introduced by the Smart-seq2 protocol. It is freely downloadable at http://www2.warwick.ac.uk/fac/sci/lifesci/research/libinorm.
Collapse
Affiliation(s)
- Nigel P Dyer
- School of Life Sciences, University of Warwick, Coventry, UK
| | | | | |
Collapse
|
12
|
Sasagawa Y, Hayashi T, Nikaido I. Strategies for Converting RNA to Amplifiable cDNA for Single-Cell RNA Sequencing Methods. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2019; 1129:1-17. [PMID: 30968357 DOI: 10.1007/978-981-13-6037-4_1] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
This review describes the features of molecular biology techniques for single-cell RNA sequencing (scRNA-seq), including methods developed in our laboratory. Existing scRNA-seq methods require the conversion of first-strand cDNA to amplifiable cDNA followed by whole-transcript amplification. There are three primary strategies for this conversion: poly-A tagging, template switching, and RNase H-DNA polymerase I-mediated second-strand cDNA synthesis for in vitro transcription. We discuss the merits and limitations of these strategies and describe our Reverse Transcription with Random Displacement Amplification technology that allows for direct first-strand cDNA amplification from RNA without the need for conversion to an amplifiable cDNA. We believe that this review provides all users of single-cell transcriptome technologies with an understanding of the relationship between the quantitative performance of various methods and their molecular features.
Collapse
Affiliation(s)
- Yohei Sasagawa
- Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics Research, Wako, Saitama, Japan
| | - Tetsutaro Hayashi
- Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics Research, Wako, Saitama, Japan
| | - Itoshi Nikaido
- Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics Research, Wako, Saitama, Japan.
| |
Collapse
|
13
|
Fu Y, Wu PH, Beane T, Zamore PD, Weng Z. Elimination of PCR duplicates in RNA-seq and small RNA-seq using unique molecular identifiers. BMC Genomics 2018; 19:531. [PMID: 30001700 PMCID: PMC6044086 DOI: 10.1186/s12864-018-4933-1] [Citation(s) in RCA: 103] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2018] [Accepted: 07/08/2018] [Indexed: 12/12/2022] Open
Abstract
Background RNA-seq and small RNA-seq are powerful, quantitative tools to study gene regulation and function. Common high-throughput sequencing methods rely on polymerase chain reaction (PCR) to expand the starting material, but not every molecule amplifies equally, causing some to be overrepresented. Unique molecular identifiers (UMIs) can be used to distinguish undesirable PCR duplicates derived from a single molecule and identical but biologically meaningful reads from different molecules. Results We have incorporated UMIs into RNA-seq and small RNA-seq protocols and developed tools to analyze the resulting data. Our UMIs contain stretches of random nucleotides whose lengths sufficiently capture diverse molecule species in both RNA-seq and small RNA-seq libraries generated from mouse testis. Our approach yields high-quality data while allowing unique tagging of all molecules in high-depth libraries. Conclusions Using simulated and real datasets, we demonstrate that our methods increase the reproducibility of RNA-seq and small RNA-seq data. Notably, we find that the amount of starting material and sequencing depth, but not the number of PCR cycles, determine PCR duplicate frequency. Finally, we show that computational removal of PCR duplicates based only on their mapping coordinates introduces substantial bias into data analysis. Electronic supplementary material The online version of this article (10.1186/s12864-018-4933-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yu Fu
- Bioinformatics Program, Boston University, 44 Cummington Mall, Boston, MA, 02215, USA.,Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, 368 Plantation Street, Worcester, MA, 01605, USA
| | - Pei-Hsuan Wu
- RNA Therapeutics Institute and Howard Hughes Medical Institute, University of Massachusetts Medical School, 368 Plantation Street, Worcester, MA, 01605, USA
| | - Timothy Beane
- RNA Therapeutics Institute and Howard Hughes Medical Institute, University of Massachusetts Medical School, 368 Plantation Street, Worcester, MA, 01605, USA
| | - Phillip D Zamore
- RNA Therapeutics Institute and Howard Hughes Medical Institute, University of Massachusetts Medical School, 368 Plantation Street, Worcester, MA, 01605, USA.
| | - Zhiping Weng
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, 368 Plantation Street, Worcester, MA, 01605, USA. .,Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical School, 368 Plantation Street, Worcester, MA, 01605, USA.
| |
Collapse
|
14
|
Hayashi T, Ozaki H, Sasagawa Y, Umeda M, Danno H, Nikaido I. Single-cell full-length total RNA sequencing uncovers dynamics of recursive splicing and enhancer RNAs. Nat Commun 2018; 9:619. [PMID: 29434199 PMCID: PMC5809388 DOI: 10.1038/s41467-018-02866-0] [Citation(s) in RCA: 170] [Impact Index Per Article: 24.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2017] [Accepted: 01/05/2018] [Indexed: 01/06/2023] Open
Abstract
Total RNA sequencing has been used to reveal poly(A) and non-poly(A) RNA expression, RNA processing and enhancer activity. To date, no method for full-length total RNA sequencing of single cells has been developed despite the potential of this technology for single-cell biology. Here we describe random displacement amplification sequencing (RamDA-seq), the first full-length total RNA-sequencing method for single cells. Compared with other methods, RamDA-seq shows high sensitivity to non-poly(A) RNA and near-complete full-length transcript coverage. Using RamDA-seq with differentiation time course samples of mouse embryonic stem cells, we reveal hundreds of dynamically regulated non-poly(A) transcripts, including histone transcripts and long noncoding RNA Neat1. Moreover, RamDA-seq profiles recursive splicing in >300-kb introns. RamDA-seq also detects enhancer RNAs and their cell type-specific activity in single cells. Taken together, we demonstrate that RamDA-seq could help investigate the dynamics of gene expression, RNA-processing events and transcriptional regulation in single cells.
Collapse
Affiliation(s)
- Tetsutaro Hayashi
- Bioinformatics Research Unit, Advanced Center for Computing and Communication, RIKEN, 2-1 Hirosawa Wako, Saitama, 351-0198, Japan
| | - Haruka Ozaki
- Bioinformatics Research Unit, Advanced Center for Computing and Communication, RIKEN, 2-1 Hirosawa Wako, Saitama, 351-0198, Japan
| | - Yohei Sasagawa
- Bioinformatics Research Unit, Advanced Center for Computing and Communication, RIKEN, 2-1 Hirosawa Wako, Saitama, 351-0198, Japan
| | - Mana Umeda
- Bioinformatics Research Unit, Advanced Center for Computing and Communication, RIKEN, 2-1 Hirosawa Wako, Saitama, 351-0198, Japan
| | - Hiroki Danno
- Bioinformatics Research Unit, Advanced Center for Computing and Communication, RIKEN, 2-1 Hirosawa Wako, Saitama, 351-0198, Japan
| | - Itoshi Nikaido
- Bioinformatics Research Unit, Advanced Center for Computing and Communication, RIKEN, 2-1 Hirosawa Wako, Saitama, 351-0198, Japan.
- Single-cell Omics Research Unit, Center for RIKEN Center for Developmental Biology, RIKEN, 2-1 Hirosawa Wako, Saitama, 351-0198, Japan.
| |
Collapse
|
15
|
Wang L, Wang Y, Zang D, Sun Z, Yang C. Optimization of Poplar mRNA purification for trancriptome library construction. Acta Biochim Biophys Sin (Shanghai) 2018; 50:224-226. [PMID: 29206897 DOI: 10.1093/abbs/gmx130] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2017] [Accepted: 11/15/2017] [Indexed: 11/14/2022] Open
Affiliation(s)
- Lina Wang
- State Key Laboratory of Tree Genetics and Breeding, Northeast Forestry University, Harbin 150040, China
| | - Yucheng Wang
- State Key Laboratory of Tree Genetics and Breeding, Northeast Forestry University, Harbin 150040, China
| | - Dandan Zang
- State Key Laboratory of Tree Genetics and Breeding, Northeast Forestry University, Harbin 150040, China
| | - Zhibo Sun
- State Key Laboratory of Tree Genetics and Breeding, Northeast Forestry University, Harbin 150040, China
| | - Chuanping Yang
- State Key Laboratory of Tree Genetics and Breeding, Northeast Forestry University, Harbin 150040, China
| |
Collapse
|
16
|
Zhao C, Liu F, Pyle AM. An ultraprocessive, accurate reverse transcriptase encoded by a metazoan group II intron. RNA (NEW YORK, N.Y.) 2018; 24:183-195. [PMID: 29109157 PMCID: PMC5769746 DOI: 10.1261/rna.063479.117] [Citation(s) in RCA: 59] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/10/2017] [Accepted: 10/31/2017] [Indexed: 05/24/2023]
Abstract
Group II introns and non-LTR retrotransposons encode a phylogenetically related family of highly processive reverse transcriptases (RTs) that are essential for mobility and persistence of these retroelements. Recent crystallographic studies on members of this RT family have revealed that they are structurally distinct from the retroviral RTs that are typically used in biotechnology. However, quantitative, structure-guided analysis of processivity, efficiency, and accuracy of this alternate RT family has been lacking. Here, we characterize the processivity of a group II intron maturase RT from Eubacterium rectale (E.r), for which high-resolution structural information is available. We find that the E.r. maturase RT (MarathonRT) efficiently copies transcripts at least 10 kb in length and displays superior intrinsic RT processivity compared to commercial enzymes such as Superscript IV (SSIV). The elevated processivity of MarathonRT is at least partly mediated by a loop structure in the finger subdomain that acts as a steric guard (the α-loop). Additionally, we find that a positively charged secondary RNA binding site on the surface of the RT diminishes the primer utilization efficiency of the enzyme, and that reengineering of this surface enhances capabilities of the MarathonRT. Finally, using single-molecule sequencing, we show that the error frequency of MarathonRT is comparable to that of other high-performance RTs, such as SSIV, which were tested in parallel. Our results provide a structural framework for understanding the enhanced processivity of retroelement RTs, and they demonstrate the potential for engineering a powerful new generation of RT tools for application in biotechnology and research.
Collapse
Affiliation(s)
- Chen Zhao
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | - Fei Liu
- Department of Molecular, Cellular, and Developmental Biology, Yale University, New Haven, Connecticut 06520, USA
- Howard Hughes Medical Institute, Chevy Chase, Maryland 20815, USA
| | - Anna Marie Pyle
- Department of Molecular, Cellular, and Developmental Biology, Yale University, New Haven, Connecticut 06520, USA
- Howard Hughes Medical Institute, Chevy Chase, Maryland 20815, USA
- Department of Chemistry, Yale University, New Haven, Connecticut 06520, USA
| |
Collapse
|
17
|
Kietrys AM, Velema WA, Kool ET. Fingerprints of Modified RNA Bases from Deep Sequencing Profiles. J Am Chem Soc 2017; 139:17074-17081. [PMID: 29111692 PMCID: PMC5819333 DOI: 10.1021/jacs.7b07914] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
Posttranscriptional modifications of RNA bases are not only found in many noncoding RNAs but have also recently been identified in coding (messenger) RNAs as well. They require complex and laborious methods to locate, and many still lack methods for localized detection. Here we test the ability of next-generation sequencing (NGS) to detect and distinguish between ten modified bases in synthetic RNAs. We compare ultradeep sequencing patterns of modified bases, including miscoding, insertions and deletions (indels), and truncations, to unmodified bases in the same contexts. The data show widely varied responses to modification, ranging from no response, to high levels of mutations, insertions, deletions, and truncations. The patterns are distinct for several of the modifications, and suggest the future use of ultradeep sequencing as a fingerprinting strategy for locating and identifying modifications in cellular RNAs.
Collapse
Affiliation(s)
- Anna M. Kietrys
- Department of Chemistry, Stanford University, Stanford, California 94305, United States
| | - Willem A. Velema
- Department of Chemistry, Stanford University, Stanford, California 94305, United States
| | - Eric T. Kool
- Department of Chemistry, Stanford University, Stanford, California 94305, United States
| |
Collapse
|
18
|
Szkop KJ, Nobeli I. Untranslated Parts of Genes Interpreted: Making Heads or Tails of High-Throughput Transcriptomic Data via Computational Methods: Computational methods to discover and quantify isoforms with alternative untranslated regions. Bioessays 2017; 39. [PMID: 29052251 DOI: 10.1002/bies.201700090] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2017] [Revised: 09/12/2017] [Indexed: 01/07/2023]
Abstract
In this review we highlight the importance of defining the untranslated parts of transcripts, and present a number of computational approaches for the discovery and quantification of alternative transcription start and poly-adenylation events in high-throughput transcriptomic data. The fate of eukaryotic transcripts is closely linked to their untranslated regions, which are determined by the position at which transcription starts and ends at a genomic locus. Although the extent of alternative transcription starts and alternative poly-adenylation sites has been revealed by sequencing methods focused on the ends of transcripts, the application of these methods is not yet widely adopted by the community. We suggest that computational methods applied to standard high-throughput technologies are a useful, albeit less accurate, alternative to the expertise-demanding 5' and 3' sequencing and they are the only option for analysing legacy transcriptomic data. We review these methods here, focusing on technical challenges and arguing for the need to include better normalization of the data and more appropriate statistical models of the expected variation in the signal.
Collapse
Affiliation(s)
- Krzysztof J Szkop
- Institute of Structural and Molecular Biology, Department of Biological Sciences Birkbeck, University of London, Malet Street, London WC1E 7HX, UK
| | - Irene Nobeli
- Institute of Structural and Molecular Biology, Department of Biological Sciences Birkbeck, University of London, Malet Street, London WC1E 7HX, UK
| |
Collapse
|
19
|
Andrews TS, Hemberg M. Identifying cell populations with scRNASeq. Mol Aspects Med 2017; 59:114-122. [PMID: 28712804 DOI: 10.1016/j.mam.2017.07.002] [Citation(s) in RCA: 149] [Impact Index Per Article: 18.6] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2017] [Revised: 06/22/2017] [Accepted: 07/12/2017] [Indexed: 01/06/2023]
Abstract
Single-cell RNASeq (scRNASeq) has emerged as a powerful method for quantifying the transcriptome of individual cells. However, the data from scRNASeq experiments is often both noisy and high dimensional, making the computational analysis non-trivial. Here we provide an overview of different experimental protocols and the most popular methods for facilitating the computational analysis. We focus on approaches for identifying biologically important genes, projecting data into lower dimensions and clustering data into putative cell-populations. Finally we discuss approaches to validation and biological interpretation of the identified cell-types or cell-states.
Collapse
Affiliation(s)
| | - Martin Hemberg
- Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire, UK.
| |
Collapse
|