1
|
Sullivan D, Hjörleifsson K, Swarna N, Oakes C, Holley G, Melsted P, Pachter L. Accurate quantification of nascent and mature RNAs from single-cell and single-nucleus RNA-seq. Nucleic Acids Res 2025; 53:gkae1137. [PMID: 39657125 PMCID: PMC11724275 DOI: 10.1093/nar/gkae1137] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Revised: 10/28/2024] [Accepted: 12/05/2024] [Indexed: 12/14/2024] Open
Abstract
In single-cell and single-nucleus RNA sequencing (RNA-seq), the coexistence of nascent (unprocessed) and mature (processed) messenger RNA (mRNA) poses challenges in accurate read mapping and the interpretation of count matrices. The traditional transcriptome reference, defining the "region of interest" in bulk RNA-seq, restricts its focus to mature mRNA transcripts. This restriction leads to two problems: reads originating outside of the "region of interest" are prone to mismapping within this region, and additionally, such external reads cannot be matched to specific transcript targets. Expanding the "region of interest" to encompass both nascent and mature mRNA transcript targets provides a more comprehensive framework for RNA-seq analysis. Here, we introduce the concept of distinguishing flanking k-mers (DFKs) to improve mapping of sequencing reads. We have developed an algorithm to identify DFKs, which serve as a sophisticated "background filter", enhancing the accuracy of mRNA quantification. This dual strategy of an expanded region of interest coupled with the use of DFKs enhances the precision in quantifying both mature and nascent mRNA molecules, as well as in delineating reads of ambiguous status.
Collapse
Affiliation(s)
- Delaney K Sullivan
- Division of Biology and Biological Engineering, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
- UCLA-Caltech Medical Scientist Training Program, David Geffen School of Medicine, University of California, Los Angeles, 885 Tiverton Drive, Los Angeles, CA 90095, USA
| | - Kristján Eldjárn Hjörleifsson
- Department of Computing and Mathematical Sciences, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Nikhila P Swarna
- Division of Biology and Biological Engineering, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Conrad Oakes
- Division of Biology and Biological Engineering, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Guillaume Holley
- deCODE Genetics/Amgen Inc., Sturlugata 8, 101 Reykjavík, Iceland
| | - Páll Melsted
- deCODE Genetics/Amgen Inc., Sturlugata 8, 101 Reykjavík, Iceland
- School of Engineering and Natural Sciences, University of Iceland, Sæmundargata 2, 102 Reykjavík, Iceland
| | - Lior Pachter
- Division of Biology and Biological Engineering, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
- Department of Computing and Mathematical Sciences, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| |
Collapse
|
2
|
Rich JM, Moses L, Einarsson PH, Jackson K, Luebbert L, Booeshaghi AS, Antonsson S, Sullivan DK, Bray N, Melsted P, Pachter L. The impact of package selection and versioning on single-cell RNA-seq analysis. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.04.588111. [PMID: 38617255 PMCID: PMC11014608 DOI: 10.1101/2024.04.04.588111] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/16/2024]
Abstract
Standard single-cell RNA-sequencing analysis (scRNA-seq) workflows consist of converting raw read data into cell-gene count matrices through sequence alignment, followed by analyses including filtering, highly variable gene selection, dimensionality reduction, clustering, and differential expression analysis. Seurat and Scanpy are the most widely-used packages implementing such workflows, and are generally thought to implement individual steps similarly. We investigate in detail the algorithms and methods underlying Seurat and Scanpy and find that there are, in fact, considerable differences in the outputs of Seurat and Scanpy. The extent of differences between the programs is approximately equivalent to the variability that would be introduced in benchmarking scRNA-seq datasets by sequencing less than 5% of the reads or analyzing less than 20% of the cell population. Additionally, distinct versions of Seurat and Scanpy can produce very different results, especially during parts of differential expression analysis. Our analysis highlights the need for users of scRNA-seq to carefully assess the tools on which they rely, and the importance of developers of scientific software to prioritize transparency, consistency, and reproducibility for their tools.
Collapse
Affiliation(s)
- Joseph M Rich
- Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125, USA
- USC-Caltech MD/PhD Program, Keck School of Medicine, Los Angeles, CA, 90033, USA
| | - Lambda Moses
- Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125, USA
| | - Pétur Helgi Einarsson
- Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, Reykjavík, Iceland
| | - Kayla Jackson
- Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125, USA
- USC-Caltech MD/PhD Program, Keck School of Medicine, Los Angeles, CA, 90033, USA
| | - Laura Luebbert
- Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125, USA
| | - A. Sina Booeshaghi
- Department of Bioengineering, University of California Berkeley, Berkeley, CA, USA
| | - Sindri Antonsson
- Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, Reykjavík, Iceland
| | - Delaney K. Sullivan
- Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125, USA
- UCLA-Caltech Medical Scientist Training Program, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, 90095, USA
| | | | - Páll Melsted
- Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, Reykjavík, Iceland
| | - Lior Pachter
- Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125, USA
- Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, 91125, USA
- Lead Contact
| |
Collapse
|
3
|
Chamberlin JT, Lee Y, Marth GT, Quinlan AR. Differences in molecular sampling and data processing explain variation among single-cell and single-nucleus RNA-seq experiments. Genome Res 2024; 34:179-188. [PMID: 38355308 PMCID: PMC10984380 DOI: 10.1101/gr.278253.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Accepted: 02/01/2024] [Indexed: 02/16/2024]
Abstract
A mechanistic understanding of the biological and technical factors that impact transcript measurements is essential to designing and analyzing single-cell and single-nucleus RNA sequencing experiments. Nuclei contain the same pre-mRNA population as cells, but they contain a small subset of the mRNAs. Nonetheless, early studies argued that single-nucleus analysis yielded results comparable to cellular samples if pre-mRNA measurements were included. However, typical workflows do not distinguish between pre-mRNA and mRNA when estimating gene expression, and variation in their relative abundances across cell types has received limited attention. These gaps are especially important given that incorporating pre-mRNA has become commonplace for both assays, despite known gene length bias in pre-mRNA capture. Here, we reanalyze public data sets from mouse and human to describe the mechanisms and contrasting effects of mRNA and pre-mRNA sampling on gene expression and marker gene selection in single-cell and single-nucleus RNA-seq. We show that pre-mRNA levels vary considerably among cell types, which mediates the degree of gene length bias and limits the generalizability of a recently published normalization method intended to correct for this bias. As an alternative, we repurpose an existing post hoc gene length-based correction method from conventional RNA-seq gene set enrichment analysis. Finally, we show that inclusion of pre-mRNA in bioinformatic processing can impart a larger effect than assay choice itself, which is pivotal to the effective reuse of existing data. These analyses advance our understanding of the sources of variation in single-cell and single-nucleus RNA-seq experiments and provide useful guidance for future studies.
Collapse
Affiliation(s)
- John T Chamberlin
- Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah 84108, USA
| | - Younghee Lee
- Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah 84108, USA
- Seoul National University, College of Veterinary Medicine, Seoul, 08826, South Korea
| | - Gabor T Marth
- Department of Human Genetics, Utah Center for Genetic Discovery, University of Utah, Salt Lake City, Utah 84112, USA
| | - Aaron R Quinlan
- Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah 84108, USA;
- Department of Human Genetics, Utah Center for Genetic Discovery, University of Utah, Salt Lake City, Utah 84112, USA
| |
Collapse
|
4
|
Maden SK, Kwon SH, Huuki-Myers LA, Collado-Torres L, Hicks SC, Maynard KR. Challenges and opportunities to computationally deconvolve heterogeneous tissue with varying cell sizes using single cell RNA-sequencing datasets. ARXIV 2023:arXiv:2305.06501v1. [PMID: 37214135 PMCID: PMC10197733] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Deconvolution of cell mixtures in "bulk" transcriptomic samples from homogenate human tissue is important for understanding the pathologies of diseases. However, several experimental and computational challenges remain in developing and implementing transcriptomics-based deconvolution approaches, especially those using a single cell/nuclei RNA-seq reference atlas, which are becoming rapidly available across many tissues. Notably, deconvolution algorithms are frequently developed using samples from tissues with similar cell sizes. However, brain tissue or immune cell populations have cell types with substantially different cell sizes, total mRNA expression, and transcriptional activity. When existing deconvolution approaches are applied to these tissues, these systematic differences in cell sizes and transcriptomic activity confound accurate cell proportion estimates and instead may quantify total mRNA content. Furthermore, there is a lack of standard reference atlases and computational approaches to facilitate integrative analyses, including not only bulk and single cell/nuclei RNA-seq data, but also new data modalities from spatial -omic or imaging approaches. New multi-assay datasets need to be collected with orthogonal data types generated from the same tissue block and the same individual, to serve as a "gold standard" for evaluating new and existing deconvolution methods. Below, we discuss these key challenges and how they can be addressed with the acquisition of new datasets and approaches to analysis.
Collapse
Affiliation(s)
- Sean K Maden
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | - Sang Ho Kwon
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, USA
- The Solomon H. Snyder Department of Neuroscience, Johns Hopkins School of Medicine, Baltimore, MD, USA
| | - Louise A Huuki-Myers
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, USA
| | | | - Stephanie C Hicks
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
- Malone Center for Engineering in Healthcare, Johns Hopkins University, Baltimore, MD, USA
| | - Kristen R Maynard
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, USA
- The Solomon H. Snyder Department of Neuroscience, Johns Hopkins School of Medicine, Baltimore, MD, USA
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, MD, USA
| |
Collapse
|