1
|
Hozumi Y, Wei GW. Analyzing Single Cell RNA Sequencing with Topological Nonnegative Matrix Factorization. JOURNAL OF COMPUTATIONAL AND APPLIED MATHEMATICS 2024; 445:115842. [PMID: 38464901 PMCID: PMC10919214 DOI: 10.1016/j.cam.2024.115842] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/12/2024]
Abstract
Single-cell RNA sequencing (scRNA-seq) is a relatively new technology that has stimulated enormous interest in statistics, data science, and computational biology due to the high dimensionality, complexity, and large scale associated with scRNA-seq data. Nonnegative matrix factorization (NMF) offers a unique approach due to its meta-gene interpretation of resulting low-dimensional components. However, NMF approaches suffer from the lack of multiscale analysis. This work introduces two persistent Laplacian regularized NMF methods, namely, topological NMF (TNMF) and robust topological NMF (rTNMF). By employing a total of 12 datasets, we demonstrate that the proposed TNMF and rTNMF significantly outperform all other NMF-based methods. We have also utilized TNMF and rTNMF for the visualization of popular Uniform Manifold Approximation and Projection (UMAP) and t -distributed stochastic neighbor embedding (t -SNE).
Collapse
Affiliation(s)
- Yuta Hozumi
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI 48824, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
2
|
de Langen P, Ballester B. MUFFIN: a suite of tools for the analysis of functional sequencing data. NAR Genom Bioinform 2024; 6:lqae051. [PMID: 38745992 PMCID: PMC11091926 DOI: 10.1093/nargab/lqae051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 04/10/2024] [Accepted: 04/27/2024] [Indexed: 05/16/2024] Open
Abstract
The large diversity of functional genomic assays allows for the characterization of non-coding and coding events at the tissue level or at a single-cell resolution. However, this diversity also leads to protocol differences, widely varying sequencing depths, substantial disparities in sample sizes, and number of features. In this work, we have built a Python package, MUFFIN, which offers a wide variety of tools suitable for a broad range of genomic assays and brings many tools that were missing from the Python ecosystem. First, MUFFIN has specialized tools for the exploration of the non-coding regions of genomes, such as a function to identify consensus peaks in peak-called assays, as well as linking genomic regions to genes and performing Gene Set Enrichment Analyses. MUFFIN also possesses a robust and flexible count table processing pipeline, comprising normalization, count transformation, dimensionality reduction, Differential Expression, and clustering. Our tools were tested on three widely different scRNA-seq, ChIP-seq and ATAC-seq datasets. MUFFIN integrates with the popular Scanpy ecosystem and is available on Conda and at https://github.com/pdelangen/Muffin.
Collapse
|
3
|
Huuki-Myers LA, Spangler A, Eagles NJ, Montgomery KD, Kwon SH, Guo B, Grant-Peters M, Divecha HR, Tippani M, Sriworarat C, Nguyen AB, Ravichandran P, Tran MN, Seyedian A, Hyde TM, Kleinman JE, Battle A, Page SC, Ryten M, Hicks SC, Martinowich K, Collado-Torres L, Maynard KR. A data-driven single-cell and spatial transcriptomic map of the human prefrontal cortex. Science 2024; 384:eadh1938. [PMID: 38781370 DOI: 10.1126/science.adh1938] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2023] [Accepted: 12/06/2023] [Indexed: 05/25/2024]
Abstract
The molecular organization of the human neocortex historically has been studied in the context of its histological layers. However, emerging spatial transcriptomic technologies have enabled unbiased identification of transcriptionally defined spatial domains that move beyond classic cytoarchitecture. We used the Visium spatial gene expression platform to generate a data-driven molecular neuroanatomical atlas across the anterior-posterior axis of the human dorsolateral prefrontal cortex. Integration with paired single-nucleus RNA-sequencing data revealed distinct cell type compositions and cell-cell interactions across spatial domains. Using PsychENCODE and publicly available data, we mapped the enrichment of cell types and genes associated with neuropsychiatric disorders to discrete spatial domains.
Collapse
Affiliation(s)
- Louise A Huuki-Myers
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD 21205, USA
| | - Abby Spangler
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD 21205, USA
| | - Nicholas J Eagles
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD 21205, USA
| | - Kelsey D Montgomery
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD 21205, USA
| | - Sang Ho Kwon
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD 21205, USA
- The Solomon H. Snyder Department of Neuroscience, Johns Hopkins School of Medicine, Baltimore, MD 21205, USA
| | - Boyi Guo
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA
| | - Melissa Grant-Peters
- Genetics and Genomic Medicine, Great Ormond Street Institute of Child Health, University College London, London WC1N 1EH, UK
- Aligning Science Across Parkinson's (ASAP) Collaborative Research Network, Chevy Chase, MD 20815, USA
| | - Heena R Divecha
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD 21205, USA
| | - Madhavi Tippani
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD 21205, USA
| | - Chaichontat Sriworarat
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD 21205, USA
- The Solomon H. Snyder Department of Neuroscience, Johns Hopkins School of Medicine, Baltimore, MD 21205, USA
| | - Annie B Nguyen
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD 21205, USA
| | - Prashanthi Ravichandran
- Department of Biomedical Engineering, Johns Hopkins School of Medicine, Baltimore, MD 21218, USA
| | - Matthew N Tran
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD 21205, USA
| | - Arta Seyedian
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD 21205, USA
| | - Thomas M Hyde
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD 21205, USA
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, MD 21205, USA
- Department of Neurology, Johns Hopkins School of Medicine, Baltimore, MD 21205, USA
| | - Joel E Kleinman
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD 21205, USA
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, MD 21205, USA
| | - Alexis Battle
- Department of Biomedical Engineering, Johns Hopkins School of Medicine, Baltimore, MD 21218, USA
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA
- Department of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD 21205, USA
- Malone Center for Engineering in Healthcare, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Stephanie C Page
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD 21205, USA
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, MD 21205, USA
| | - Mina Ryten
- Genetics and Genomic Medicine, Great Ormond Street Institute of Child Health, University College London, London WC1N 1EH, UK
- Aligning Science Across Parkinson's (ASAP) Collaborative Research Network, Chevy Chase, MD 20815, USA
- NIHR Great Ormond Street Hospital Biomedical Research Centre, University College London, London WC1N 1EH, UK
| | - Stephanie C Hicks
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA
- Department of Biomedical Engineering, Johns Hopkins School of Medicine, Baltimore, MD 21218, USA
- Malone Center for Engineering in Healthcare, Johns Hopkins University, Baltimore, MD 21218, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21205, USA
| | - Keri Martinowich
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD 21205, USA
- The Solomon H. Snyder Department of Neuroscience, Johns Hopkins School of Medicine, Baltimore, MD 21205, USA
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, MD 21205, USA
- Johns Hopkins Kavli Neuroscience Discovery Institute, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Leonardo Collado-Torres
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD 21205, USA
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21205, USA
| | - Kristen R Maynard
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD 21205, USA
- The Solomon H. Snyder Department of Neuroscience, Johns Hopkins School of Medicine, Baltimore, MD 21205, USA
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, MD 21205, USA
| |
Collapse
|
4
|
Gao Q, Ji Z, Wang L, Owzar K, Li QJ, Chan C, Xie J. SifiNet: a robust and accurate method to identify feature gene sets and annotate cells. Nucleic Acids Res 2024; 52:e46. [PMID: 38647069 PMCID: PMC11109959 DOI: 10.1093/nar/gkae307] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2023] [Revised: 03/25/2024] [Accepted: 04/14/2024] [Indexed: 04/25/2024] Open
Abstract
SifiNet is a robust and accurate computational pipeline for identifying distinct gene sets, extracting and annotating cellular subpopulations, and elucidating intrinsic relationships among these subpopulations. Uniquely, SifiNet bypasses the cell clustering stage, commonly integrated into other cellular annotation pipelines, thereby circumventing potential inaccuracies in clustering that may compromise subsequent analyses. Consequently, SifiNet has demonstrated superior performance in multiple experimental datasets compared with other state-of-the-art methods. SifiNet can analyze both single-cell RNA and ATAC sequencing data, thereby rendering comprehensive multi-omic cellular profiles. It is conveniently available as an open-source R package.
Collapse
Affiliation(s)
- Qi Gao
- Department of Biostatistics and Bioinformatics, Duke University, USA
| | - Zhicheng Ji
- Department of Biostatistics and Bioinformatics, Duke University, USA
| | - Liuyang Wang
- Department of Molecular Genetics and Microbiology, Duke University, USA
| | - Kouros Owzar
- Department of Biostatistics and Bioinformatics, Duke University, USA
| | - Qi-Jing Li
- Institute of Molecular and Cell Biology, Agency for Science, Technology and Research, Singapore
- Singapore Immunology Network, Agency for Science, Technology and Research, Singapore
| | - Cliburn Chan
- Department of Biostatistics and Bioinformatics, Duke University, USA
| | - Jichun Xie
- Department of Biostatistics and Bioinformatics, Duke University, USA
- Department of Mathematics, Duke University, USA
| |
Collapse
|
5
|
Pilcher WC, Yao L, Gonzalez-Kozlova E, Pita-Juarez Y, Karagkouni D, Acharya CR, Michaud ME, Hamilton M, Nanda S, Song Y, Sato K, Wang JT, Satpathy S, Ma Y, Schulman J, D'Souza D, Jayasinghe RG, Cheloni G, Bakhtiari M, Pabustan N, Nie K, Foltz JA, Saldarriaga I, Alaaeldin R, Lepisto E, Chen R, Fiala MA, Thomas BE, Cook A, Dos Santos JV, Chiang IL, Figueiredo I, Fortier J, Slade M, Oh ST, Rettig MP, Anderson E, Li Y, Dasari S, Strausbauch MA, Simon VA, Rahman AH, Chen Z, Lagana A, DiPersio JF, Rosenblatt J, Kim-Schulze S, Dhodapkar MV, Lonial S, Kumar S, Bhasin SS, Kourelis T, Vij R, Avigan D, Cho HJ, Mulligan G, Ding L, Gnjatic S, Vlachos IS, Bhasin M. A single-cell atlas characterizes dysregulation of the bone marrow immune microenvironment associated with outcomes in multiple myeloma. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.15.593193. [PMID: 38798338 PMCID: PMC11118283 DOI: 10.1101/2024.05.15.593193] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]
Abstract
Multiple Myeloma (MM) remains incurable despite advances in treatment options. Although tumor subtypes and specific DNA abnormalities are linked to worse prognosis, the impact of immune dysfunction on disease emergence and/or treatment sensitivity remains unclear. We established a harmonized consortium to generate an Immune Atlas of MM aimed at informing disease etiology, risk stratification, and potential therapeutic strategies. We generated a transcriptome profile of 1,149,344 single cells from the bone marrow of 263 newly diagnosed patients enrolled in the CoMMpass study and characterized immune and hematopoietic cell populations. Associating cell abundances and gene expression with disease progression revealed the presence of a proinflammatory immune senescence-associated secretory phenotype in rapidly progressing patients. Furthermore, signaling analyses suggested active intercellular communication involving APRIL-BCMA, potentially promoting tumor growth and survival. Finally, we demonstrate that integrating immune cell levels with genetic information can significantly improve patient stratification.
Collapse
|
6
|
Singh A, Khiabanian H. Feature selection followed by a novel residuals-based normalization simplifies and improves single-cell gene expression analysis. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.03.02.530891. [PMID: 38328133 PMCID: PMC10849523 DOI: 10.1101/2023.03.02.530891] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/09/2024]
Abstract
Normalization is a crucial step in the analysis of single-cell RNA-sequencing (scRNA-seq) counts data. Its principal objectives are to reduce the systematic biases primarily introduced through technical sources and to transform the data to make it more amenable for application of established statistical frameworks. In the standard workflows, normalization is followed by feature selection to identify highly variable genes (HVGs) that capture most of the biologically meaningful variation across the cells. Here, we make the case for a revised workflow by proposing a simple feature selection method and showing that we can perform feature selection before normalization by relying on observed counts. We highlight that the feature selection step can be used to not only select HVGs but to also identify stable genes. We further propose a novel variance stabilization transformation inclusive residuals-based normalization method that in fact relies on the stable genes to inform the reduction of systematic biases. We demonstrate significant improvements in downstream clustering analyses through the application of our proposed methods on biological truth-known as well as simulated counts datasets. We have implemented this novel workflow for analyzing high-throughput scRNA-seq data in an R package called Piccolo.
Collapse
|
7
|
Cuevas-Diaz Duran R, Wei H, Wu J. Data normalization for addressing the challenges in the analysis of single-cell transcriptomic datasets. BMC Genomics 2024; 25:444. [PMID: 38711017 DOI: 10.1186/s12864-024-10364-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2023] [Accepted: 04/29/2024] [Indexed: 05/08/2024] Open
Abstract
BACKGROUND Normalization is a critical step in the analysis of single-cell RNA-sequencing (scRNA-seq) datasets. Its main goal is to make gene counts comparable within and between cells. To do so, normalization methods must account for technical and biological variability. Numerous normalization methods have been developed addressing different sources of dispersion and making specific assumptions about the count data. MAIN BODY The selection of a normalization method has a direct impact on downstream analysis, for example differential gene expression and cluster identification. Thus, the objective of this review is to guide the reader in making an informed decision on the most appropriate normalization method to use. To this aim, we first give an overview of the different single cell sequencing platforms and methods commonly used including isolation and library preparation protocols. Next, we discuss the inherent sources of variability of scRNA-seq datasets. We describe the categories of normalization methods and include examples of each. We also delineate imputation and batch-effect correction methods. Furthermore, we describe data-driven metrics commonly used to evaluate the performance of normalization methods. We also discuss common scRNA-seq methods and toolkits used for integrated data analysis. CONCLUSIONS According to the correction performed, normalization methods can be broadly classified as within and between-sample algorithms. Moreover, with respect to the mathematical model used, normalization methods can further be classified into: global scaling methods, generalized linear models, mixed methods, and machine learning-based methods. Each of these methods depict pros and cons and make different statistical assumptions. However, there is no better performing normalization method. Instead, metrics such as silhouette width, K-nearest neighbor batch-effect test, or Highly Variable Genes are recommended to assess the performance of normalization methods.
Collapse
Affiliation(s)
- Raquel Cuevas-Diaz Duran
- Tecnologico de Monterrey, Escuela de Medicina y Ciencias de la Salud, Monterrey, Nuevo Leon, 64710, Mexico.
| | - Haichao Wei
- The Vivian L. Smith Department of Neurosurgery, McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA
- Center for Stem Cell and Regenerative Medicine, UT Brown Foundation Institute of Molecular Medicine, Houston, TX, 77030, USA
| | - Jiaqian Wu
- The Vivian L. Smith Department of Neurosurgery, McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA.
- Center for Stem Cell and Regenerative Medicine, UT Brown Foundation Institute of Molecular Medicine, Houston, TX, 77030, USA.
- MD Anderson Cancer Center UTHealth Graduate School of Biomedical Sciences, Houston, TX, 77030, USA.
| |
Collapse
|
8
|
Church SH, Mah JL, Dunn CW. Integrating phylogenies into single-cell RNA sequencing analysis allows comparisons across species, genes, and cells. PLoS Biol 2024; 22:e3002633. [PMID: 38787797 PMCID: PMC11125556 DOI: 10.1371/journal.pbio.3002633] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/26/2024] Open
Abstract
Comparisons of single-cell RNA sequencing (scRNA-seq) data across species can reveal links between cellular gene expression and the evolution of cell functions, features, and phenotypes. These comparisons evoke evolutionary histories, as depicted by phylogenetic trees, that define relationships between species, genes, and cells. This Essay considers each of these in turn, laying out challenges and solutions derived from a phylogenetic comparative approach and relating these solutions to previously proposed methods for the pairwise alignment of cellular dimensional maps. This Essay contends that species trees, gene trees, cell phylogenies, and cell lineages can all be reconciled as descriptions of the same concept-the tree of cellular life. By integrating phylogenetic approaches into scRNA-seq analyses, challenges for building informed comparisons across species can be overcome, and hypotheses about gene and cell evolution can be robustly tested.
Collapse
Affiliation(s)
- Samuel H. Church
- Department of Ecology and Evolutionary Biology, Yale University, New Haven, Connecticut, United States of America
| | - Jasmine L. Mah
- Department of Ecology and Evolutionary Biology, Yale University, New Haven, Connecticut, United States of America
| | - Casey W. Dunn
- Department of Ecology and Evolutionary Biology, Yale University, New Haven, Connecticut, United States of America
| |
Collapse
|
9
|
Zhang W, Yu R, Xu Z, Li J, Gao W, Jiang M, Dai Q. scCompressSA: dual-channel self-attention based deep autoencoder model for single-cell clustering by compressing gene-gene interactions. BMC Genomics 2024; 25:423. [PMID: 38684946 PMCID: PMC11059774 DOI: 10.1186/s12864-024-10286-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Accepted: 04/04/2024] [Indexed: 05/02/2024] Open
Abstract
BACKGROUND Single-cell clustering has played an important role in exploring the molecular mechanisms about cell differentiation and human diseases. Due to highly-stochastic transcriptomics data, accurate detection of cell types is still challenged, especially for RNA-sequencing data from human beings. In this case, deep neural networks have been increasingly employed to mine cell type specific patterns and have outperformed statistic approaches in cell clustering. RESULTS Using cross-correlation to capture gene-gene interactions, this study proposes the scCompressSA method to integrate topological patterns from scRNA-seq data, with support of self-attention (SA) based coefficient compression (CC) block. This SA-based CC block is able to extract and employ static gene-gene interactions from scRNA-seq data. This proposed scCompressSA method has enhanced clustering accuracy in multiple benchmark scRNA-seq datasets by integrating topological and temporal features. CONCLUSION Static gene-gene interactions have been extracted as temporal features to boost clustering performance in single-cell clustering For the scCompressSA method, dual-channel SA based CC block is able to integrate topological features and has exhibited extraordinary detection accuracy compared with previous clustering approaches that only employ temporal patterns.
Collapse
Affiliation(s)
- Wei Zhang
- Zhejiang Sci-Tech University, Second Street 928, Hangzhou, Zhejiang, 310018, China
| | - Ruochen Yu
- Zhejiang Sci-Tech University, Second Street 928, Hangzhou, Zhejiang, 310018, China
| | - Zeqi Xu
- Zhejiang Sci-Tech University, Second Street 928, Hangzhou, Zhejiang, 310018, China
| | - Junnan Li
- Zhejiang Sci-Tech University, Second Street 928, Hangzhou, Zhejiang, 310018, China
| | - Wenhao Gao
- Zhejiang Sci-Tech University, Second Street 928, Hangzhou, Zhejiang, 310018, China
| | - Mingfeng Jiang
- Zhejiang Sci-Tech University, Second Street 928, Hangzhou, Zhejiang, 310018, China.
| | - Qi Dai
- Zhejiang Sci-Tech University, Second Street 928, Hangzhou, Zhejiang, 310018, China.
| |
Collapse
|
10
|
Nelson ED, Tippani M, Ramnauth AD, Divecha HR, Miller RA, Eagles NJ, Pattie EA, Kwon SH, Bach SV, Kaipa UM, Yao J, Kleinman JE, Collado-Torres L, Han S, Maynard KR, Hyde TM, Martinowich K, Page SC, Hicks SC. An integrated single-nucleus and spatial transcriptomics atlas reveals the molecular landscape of the human hippocampus. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.26.590643. [PMID: 38712198 PMCID: PMC11071618 DOI: 10.1101/2024.04.26.590643] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]
Abstract
The hippocampus contains many unique cell types, which serve the structure's specialized functions, including learning, memory and cognition. These cells have distinct spatial topography, morphology, physiology, and connectivity, highlighting the need for transcriptome-wide profiling strategies that retain cytoarchitectural organization. Here, we generated spatially-resolved transcriptomics (SRT) and single-nucleus RNA-sequencing (snRNA-seq) data from adjacent tissue sections of the anterior human hippocampus across ten adult neurotypical donors. We defined molecular profiles for hippocampal cell types and spatial domains. Using non-negative matrix factorization and transfer learning, we integrated these data to define gene expression patterns within the snRNA-seq data and infer the expression of these patterns in the SRT data. With this approach, we leveraged existing rodent datasets that feature information on circuit connectivity and neural activity induction to make predictions about axonal projection targets and likelihood of ensemble recruitment in spatially-defined cellular populations of the human hippocampus. Finally, we integrated genome-wide association studies with transcriptomic data to identify enrichment of genetic components for neurodevelopmental, neuropsychiatric, and neurodegenerative disorders across cell types, spatial domains, and gene expression patterns of the human hippocampus. To make this comprehensive molecular atlas accessible to the scientific community, both raw and processed data are freely available, including through interactive web applications.
Collapse
|
11
|
Kim H, Chang W, Chae SJ, Park JE, Seo M, Kim JK. scLENS: data-driven signal detection for unbiased scRNA-seq data analysis. Nat Commun 2024; 15:3575. [PMID: 38678050 DOI: 10.1038/s41467-024-47884-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2023] [Accepted: 04/14/2024] [Indexed: 04/29/2024] Open
Abstract
High dimensionality and noise have limited the new biological insights that can be discovered in scRNA-seq data. While dimensionality reduction tools have been developed to extract biological signals from the data, they often require manual determination of signal dimension, introducing user bias. Furthermore, a common data preprocessing method, log normalization, can unintentionally distort signals in the data. Here, we develop scLENS, a dimensionality reduction tool that circumvents the long-standing issues of signal distortion and manual input. Specifically, we identify the primary cause of signal distortion during log normalization and effectively address it by uniformizing cell vector lengths with L2 normalization. Furthermore, we utilize random matrix theory-based noise filtering and a signal robustness test to enable data-driven determination of the threshold for signal dimensions. Our method outperforms 11 widely used dimensionality reduction tools and performs particularly well for challenging scRNA-seq datasets with high sparsity and variability. To facilitate the use of scLENS, we provide a user-friendly package that automates accurate signal detection of scRNA-seq data without manual time-consuming tuning.
Collapse
Affiliation(s)
- Hyun Kim
- Biomedical Mathematics Group, Pioneer Research Center for Mathematical and Computational Sciences, Institute for Basic Science, Daejeon, 34126, Republic of Korea
| | - Won Chang
- Division of Statistics and Data Science, University of Cincinnati, Cincinnati, OH, 45221, USA
| | - Seok Joo Chae
- Biomedical Mathematics Group, Pioneer Research Center for Mathematical and Computational Sciences, Institute for Basic Science, Daejeon, 34126, Republic of Korea
- Department of Mathematical Sciences, KAIST, Daejeon, 34141, Republic of Korea
| | - Jong-Eun Park
- Graduate School of Medical Science and Engineering, KAIST, Daejeon, 34141, Republic of Korea
| | - Minseok Seo
- Department of Computer and Information Science, Korea University, Sejong, 30019, Republic of Korea
| | - Jae Kyoung Kim
- Biomedical Mathematics Group, Pioneer Research Center for Mathematical and Computational Sciences, Institute for Basic Science, Daejeon, 34126, Republic of Korea.
- Department of Mathematical Sciences, KAIST, Daejeon, 34141, Republic of Korea.
| |
Collapse
|
12
|
Phillips RA, Oh S, Bach SV, Du Y, Miller RA, Kleinman JE, Hyde TM, Hicks SC, Page SC, Martinowich K. Transcriptomic characterization of human lateral septum neurons reveals conserved and divergent marker genes across species. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.22.590602. [PMID: 38712125 PMCID: PMC11071425 DOI: 10.1101/2024.04.22.590602] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]
Abstract
The lateral septum (LS) is a midline, subcortical structure, which regulates social behaviors that are frequently impaired in neurodevelopmental disorders including schizophrenia and autism spectrum disorder. Mouse studies have identified neuronal populations within the LS that express a variety of molecular markers, including vasopressin receptor, oxytocin receptor, and corticotropin releasing hormone receptor, that control specific facets of social behavior. Despite its critical role in the regulation of social behavior and notable gene expression patterns, comprehensive molecular profiling of the human LS has not been performed. Here, we conducted single nucleus RNA-sequencing (snRNA-seq) to generate the first transcriptomic profiles of the human LS using postmortem human brain tissue samples from 3 neurotypical donors. Our analysis identified 4 transcriptionally distinct neuronal cell types within the human LS that are enriched for TRPC4 , the gene encoding Trp-related protein 4. Differential expression analysis revealed a distinct LS neuronal cell type that is enriched for OPRM1 , the gene encoding the μ-opioid receptor. Leveraging recently collected mouse LS snRNA-seq datasets, we also conducted a cross-species analysis. Our results demonstrate that TRPC4 enrichment in the LS is highly conserved between human and mouse, while FREM2 , which encodes FRAS1 related extracellular matrix protein 2, is enriched only in the human LS. Together, these results highlight transcriptional heterogeneity of the human LS, and identify robust marker genes for the human LS.
Collapse
|
13
|
Barry T, Roeder K, Katsevich E. Exponential family measurement error models for single-cell CRISPR screens. Biostatistics 2024:kxae010. [PMID: 38649751 DOI: 10.1093/biostatistics/kxae010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2022] [Revised: 01/10/2024] [Accepted: 03/11/2024] [Indexed: 04/25/2024] Open
Abstract
CRISPR genome engineering and single-cell RNA sequencing have accelerated biological discovery. Single-cell CRISPR screens unite these two technologies, linking genetic perturbations in individual cells to changes in gene expression and illuminating regulatory networks underlying diseases. Despite their promise, single-cell CRISPR screens present considerable statistical challenges. We demonstrate through theoretical and real data analyses that a standard method for estimation and inference in single-cell CRISPR screens-"thresholded regression"-exhibits attenuation bias and a bias-variance tradeoff as a function of an intrinsic, challenging-to-select tuning parameter. To overcome these difficulties, we introduce GLM-EIV ("GLM-based errors-in-variables"), a new method for single-cell CRISPR screen analysis. GLM-EIV extends the classical errors-in-variables model to responses and noisy predictors that are exponential family-distributed and potentially impacted by the same set of confounding variables. We develop a computational infrastructure to deploy GLM-EIV across hundreds of processors on clouds (e.g. Microsoft Azure) and high-performance clusters. Leveraging this infrastructure, we apply GLM-EIV to analyze two recent, large-scale, single-cell CRISPR screen datasets, yielding several new insights.
Collapse
Affiliation(s)
- Timothy Barry
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Building 2 435, 655 Huntington Ave, Boston, MA 02115, United States
| | - Kathryn Roeder
- Department of Statistics and Data Science, Carnegie Mellon University, Baker Hall 228B, 4909 Frew St, Pittsburgh, PA 15213, United States
| | - Eugene Katsevich
- Department of Statistics and Data Science, University of Pennsylvania, Academic Research Building 311, 265 South 37th Street Philadelphia, PA 19104, United States
| |
Collapse
|
14
|
Tian J, Bai X, Quek C. Single-Cell Informatics for Tumor Microenvironment and Immunotherapy. Int J Mol Sci 2024; 25:4485. [PMID: 38674070 PMCID: PMC11050520 DOI: 10.3390/ijms25084485] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2024] [Revised: 04/12/2024] [Accepted: 04/16/2024] [Indexed: 04/28/2024] Open
Abstract
Cancer comprises malignant cells surrounded by the tumor microenvironment (TME), a dynamic ecosystem composed of heterogeneous cell populations that exert unique influences on tumor development. The immune community within the TME plays a substantial role in tumorigenesis and tumor evolution. The innate and adaptive immune cells "talk" to the tumor through ligand-receptor interactions and signaling molecules, forming a complex communication network to influence the cellular and molecular basis of cancer. Such intricate intratumoral immune composition and interactions foster the application of immunotherapies, which empower the immune system against cancer to elicit durable long-term responses in cancer patients. Single-cell technologies have allowed for the dissection and characterization of the TME to an unprecedented level, while recent advancements in bioinformatics tools have expanded the horizon and depth of high-dimensional single-cell data analysis. This review will unravel the intertwined networks between malignancy and immunity, explore the utilization of computational tools for a deeper understanding of tumor-immune communications, and discuss the application of these approaches to aid in diagnosis or treatment decision making in the clinical setting, as well as the current challenges faced by the researchers with their potential future improvements.
Collapse
Affiliation(s)
| | | | - Camelia Quek
- Faculty of Medicine and Health, The University of Sydney, Sydney, NSW 2006, Australia; (J.T.); (X.B.)
| |
Collapse
|
15
|
Baharav TZ, Tse D, Salzman J. OASIS: An interpretable, finite-sample valid alternative to Pearson's X2 for scientific discovery. Proc Natl Acad Sci U S A 2024; 121:e2304671121. [PMID: 38564640 PMCID: PMC11009617 DOI: 10.1073/pnas.2304671121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Accepted: 02/08/2024] [Indexed: 04/04/2024] Open
Abstract
Contingency tables, data represented as counts matrices, are ubiquitous across quantitative research and data-science applications. Existing statistical tests are insufficient however, as none are simultaneously computationally efficient and statistically valid for a finite number of observations. In this work, motivated by a recent application in reference-free genomic inference [K. Chaung et al., Cell 186, 5440-5456 (2023)], we develop Optimized Adaptive Statistic for Inferring Structure (OASIS), a family of statistical tests for contingency tables. OASIS constructs a test statistic which is linear in the normalized data matrix, providing closed-form P-value bounds through classical concentration inequalities. In the process, OASIS provides a decomposition of the table, lending interpretability to its rejection of the null. We derive the asymptotic distribution of the OASIS test statistic, showing that these finite-sample bounds correctly characterize the test statistic's P-value up to a variance term. Experiments on genomic sequencing data highlight the power and interpretability of OASIS. Using OASIS, we develop a method that can detect SARS-CoV-2 and Mycobacterium tuberculosis strains de novo, which existing approaches cannot achieve. We demonstrate in simulations that OASIS is robust to overdispersion, a common feature in genomic data like single-cell RNA sequencing, where under accepted noise models OASIS provides good control of the false discovery rate, while Pearson's [Formula: see text] consistently rejects the null. Additionally, we show in simulations that OASIS is more powerful than Pearson's [Formula: see text] in certain regimes, including for some important two group alternatives, which we corroborate with approximate power calculations.
Collapse
Affiliation(s)
- Tavor Z. Baharav
- Eric and Wendy Schmidt Center, Broad Institute, Cambridge, MA02142
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA02115
| | - David Tse
- Department of Electrical Engineering, Stanford University, Stanford, CA94305
| | - Julia Salzman
- Department of Biomedical Data Science, Stanford University, Stanford, CA94305
- Department of Biochemistry, Stanford University, Stanford, CA94305
- Department of Statistics (by courtesy), Stanford University, Stanford, CA94305
| |
Collapse
|
16
|
Hozumi Y, Tanemura KA, Wei GW. Preprocessing of Single Cell RNA Sequencing Data Using Correlated Clustering and Projection. J Chem Inf Model 2024; 64:2829-2838. [PMID: 37402705 PMCID: PMC11009150 DOI: 10.1021/acs.jcim.3c00674] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/06/2023]
Abstract
Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity in cells, which has given us insights into cell-cell communication, cell differentiation, and differential gene expression. However, analyzing scRNA-seq data is a challenge due to sparsity and the large number of genes involved. Therefore, dimensionality reduction and feature selection are important for removing spurious signals and enhancing the downstream analysis. We present Correlated Clustering and Projection (CCP), a new data-domain dimensionality reduction method, for the first time. CCP projects each cluster of similar genes into a supergene defined as the accumulated pairwise nonlinear gene-gene correlations among all cells. Using 14 benchmark data sets, we demonstrate that CCP has significant advantages over classical principal component analysis (PCA) for clustering and/or classification problems with intrinsically high dimensionality. In addition, we introduce the Residue-Similarity index (RSI) as a novel metric for clustering and classification and the R-S plot as a new visualization tool. We show that the RSI correlates with accuracy without requiring the knowledge of the true labels. The R-S plot provides a unique alternative to the uniform manifold approximation and projection (UMAP) and t-distributed stochastic neighbor embedding (t-SNE) for data with a large number of cell types.
Collapse
Affiliation(s)
- Yuta Hozumi
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Kiyoto Aramis Tanemura
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan 48824, United States
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan 48824, United States
| |
Collapse
|
17
|
Gao Q, Ji Z, Wang L, Owzar K, Li QJ, Chan C, Xie J. SifiNet: A robust and accurate method to identify feature gene sets and annotate cells. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.05.24.541352. [PMID: 37577619 PMCID: PMC10418061 DOI: 10.1101/2023.05.24.541352] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/15/2023]
Abstract
SifiNet is a robust and accurate computational pipeline for identifying distinct gene sets, extracting and annotating cellular subpopulations, and elucidating intrinsic relationships among these subpopulations. Uniquely, SifiNet bypasses the cell clustering stage, commonly integrated into other cellular annotation pipelines, thereby circumventing potential inaccuracies in clustering that may compromise subsequent analyses. Consequently, SifiNet has demonstrated superior performance in multiple experimental datasets compared with other state-of-the-art methods. SifiNet can analyze both single-cell RNA and ATAC sequencing data, thereby rendering comprehensive multiomic cellular profiles. It is conveniently available as an open-source R package.
Collapse
|
18
|
Sakaue S, Weinand K, Isaac S, Dey KK, Jagadeesh K, Kanai M, Watts GFM, Zhu Z, Brenner MB, McDavid A, Donlin LT, Wei K, Price AL, Raychaudhuri S. Tissue-specific enhancer-gene maps from multimodal single-cell data identify causal disease alleles. Nat Genet 2024; 56:615-626. [PMID: 38594305 DOI: 10.1038/s41588-024-01682-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Accepted: 02/07/2024] [Indexed: 04/11/2024]
Abstract
Translating genome-wide association study (GWAS) loci into causal variants and genes requires accurate cell-type-specific enhancer-gene maps from disease-relevant tissues. Building enhancer-gene maps is essential but challenging with current experimental methods in primary human tissues. Here we developed a nonparametric statistical method, SCENT (single-cell enhancer target gene mapping), that models association between enhancer chromatin accessibility and gene expression in single-cell or nucleus multimodal RNA sequencing and ATAC sequencing data. We applied SCENT to 9 multimodal datasets including >120,000 single cells or nuclei and created 23 cell-type-specific enhancer-gene maps. These maps were highly enriched for causal variants in expression quantitative loci and GWAS for 1,143 diseases and traits. We identified likely causal genes for both common and rare diseases and linked somatic mutation hotspots to target genes. We demonstrate that application of SCENT to multimodal data from disease-relevant human tissue enables the scalable construction of accurate cell-type-specific enhancer-gene maps, essential for defining noncoding variant function.
Collapse
Affiliation(s)
- Saori Sakaue
- Center for Data Sciences, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Divisions of Genetics and Rheumatology, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Kathryn Weinand
- Center for Data Sciences, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Divisions of Genetics and Rheumatology, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Shakson Isaac
- Center for Data Sciences, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Divisions of Genetics and Rheumatology, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Kushal K Dey
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Epidemiology, Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | - Karthik Jagadeesh
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Epidemiology, Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | - Masahiro Kanai
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Center for Computational and Integrative Biology, Massachusetts General Hospital, Boston, MA, USA
| | - Gerald F M Watts
- Division of Rheumatology, Inflammation, and Immunity, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
| | - Zhu Zhu
- Division of Rheumatology, Inflammation, and Immunity, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
| | - Michael B Brenner
- Division of Rheumatology, Inflammation, and Immunity, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
| | - Andrew McDavid
- Department of Biostatistics and Computational Biology, University of Rochester Medical Center, Rochester, NY, USA
| | - Laura T Donlin
- Hospital for Special Surgery, New York, NY, USA
- Weill Cornell Medicine, New York, NY, USA
| | - Kevin Wei
- Division of Rheumatology, Inflammation, and Immunity, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
| | - Alkes L Price
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Epidemiology, Harvard T. H. Chan School of Public Health, Boston, MA, USA
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | - Soumya Raychaudhuri
- Center for Data Sciences, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
- Divisions of Genetics and Rheumatology, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
19
|
Weine E, Carbonetto P, Stephens M. Accelerated dimensionality reduction of single-cell RNA sequencing data with fastglmpca. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.23.586420. [PMID: 38585920 PMCID: PMC10996495 DOI: 10.1101/2024.03.23.586420] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/09/2024]
Abstract
Motivated by theoretical and practical issues that arise when applying Principal Components Analysis (PCA) to count data, Townes et al introduced "Poisson GLM-PCA", a variation of PCA adapted to count data, as a tool for dimensionality reduction of single-cell RNA sequencing (RNA-seq) data. However, fitting GLM-PCA is computationally challenging. Here we study this problem, and show that a simple algorithm, which we call "Alternating Poisson Regression" (APR), produces better quality fits, and in less time, than existing algorithms. APR is also memory-efficient, and lends itself to parallel implementation on multi-core processors, both of which are helpful for handling large single-cell RNA-seq data sets. We illustrate the benefits of this approach in two published single-cell RNA-seq data sets. The new algorithms are implemented in an R package, fastglmpca.
Collapse
Affiliation(s)
- Eric Weine
- Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
- Department of Data Science, Dana Farber Cancer Institute, Boston, MA 02215, USA
| | - Peter Carbonetto
- Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA
| | - Matthew Stephens
- Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA
- Department of Statistics, University of Chicago, Chicago, IL 60637, USA
| |
Collapse
|
20
|
Lin KZ, Qiu Y, Roeder K. eSVD-DE: cohort-wide differential expression in single-cell RNA-seq data using exponential-family embeddings. BMC Bioinformatics 2024; 25:113. [PMID: 38486150 PMCID: PMC10941434 DOI: 10.1186/s12859-024-05724-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Accepted: 02/28/2024] [Indexed: 03/17/2024] Open
Abstract
BACKGROUND Single-cell RNA-sequencing (scRNA) datasets are becoming increasingly popular in clinical and cohort studies, but there is a lack of methods to investigate differentially expressed (DE) genes among such datasets with numerous individuals. While numerous methods exist to find DE genes for scRNA data from limited individuals, differential-expression testing for large cohorts of case and control individuals using scRNA data poses unique challenges due to substantial effects of human variation, i.e., individual-level confounding covariates that are difficult to account for in the presence of sparsely-observed genes. RESULTS We develop the eSVD-DE, a matrix factorization that pools information across genes and removes confounding covariate effects, followed by a novel two-sample test in mean expression between case and control individuals. In general, differential testing after dimension reduction yields an inflation of Type-1 errors. However, we overcome this by testing for differences between the case and control individuals' posterior mean distributions via a hierarchical model. In previously published datasets of various biological systems, eSVD-DE has more accuracy and power compared to other DE methods typically repurposed for analyzing cohort-wide differential expression. CONCLUSIONS eSVD-DE proposes a novel and powerful way to test for DE genes among cohorts after performing a dimension reduction. Accurate identification of differential expression on the individual level, instead of the cell level, is important for linking scRNA-seq studies to our understanding of the human population.
Collapse
Affiliation(s)
- Kevin Z Lin
- Department of Biostatistics, University of Washington, Seattle, WA, USA.
| | - Yixuan Qiu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, People's Republic of China
| | - Kathryn Roeder
- Department of Statistics and Data Science, Carnegie Mellon University, Pittsburgh, PA, USA
| |
Collapse
|
21
|
Truchi M, Lacoux C, Gille C, Fassy J, Magnone V, Lopes Goncalves R, Girard-Riboulleau C, Manosalva-Pena I, Gautier-Isola M, Lebrigand K, Barbry P, Spicuglia S, Vassaux G, Rezzonico R, Barlaud M, Mari B. Detecting subtle transcriptomic perturbations induced by lncRNAs knock-down in single-cell CRISPRi screening using a new sparse supervised autoencoder neural network. FRONTIERS IN BIOINFORMATICS 2024; 4:1340339. [PMID: 38501112 PMCID: PMC10945021 DOI: 10.3389/fbinf.2024.1340339] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2023] [Accepted: 02/14/2024] [Indexed: 03/20/2024] Open
Abstract
Single-cell CRISPR-based transcriptome screens are potent genetic tools for concomitantly assessing the expression profiles of cells targeted by a set of guides RNA (gRNA), and inferring target gene functions from the observed perturbations. However, due to various limitations, this approach lacks sensitivity in detecting weak perturbations and is essentially reliable when studying master regulators such as transcription factors. To overcome the challenge of detecting subtle gRNA induced transcriptomic perturbations and classifying the most responsive cells, we developed a new supervised autoencoder neural network method. Our Sparse supervised autoencoder (SSAE) neural network provides selection of both relevant features (genes) and actual perturbed cells. We applied this method on an in-house single-cell CRISPR-interference-based (CRISPRi) transcriptome screening (CROP-Seq) focusing on a subset of long non-coding RNAs (lncRNAs) regulated by hypoxia, a condition that promote tumor aggressiveness and drug resistance, in the context of lung adenocarcinoma (LUAD). The CROP-seq library of validated gRNA against a subset of lncRNAs and, as positive controls, HIF1A and HIF2A, the 2 main transcription factors of the hypoxic response, was transduced in A549 LUAD cells cultured in normoxia or exposed to hypoxic conditions during 3, 6 or 24 h. We first validated the SSAE approach on HIF1A and HIF2 by confirming the specific effect of their knock-down during the temporal switch of the hypoxic response. Next, the SSAE method was able to detect stable short hypoxia-dependent transcriptomic signatures induced by the knock-down of some lncRNAs candidates, outperforming previously published machine learning approaches. This proof of concept demonstrates the relevance of the SSAE approach for deciphering weak perturbations in single-cell transcriptomic data readout as part of CRISPR-based screening.
Collapse
Affiliation(s)
- Marin Truchi
- Université Côte d’Azur, IPMC, UMR CNRS 7275 Inserm 1323, IHU RespiERA, Valbonne, France
| | - Caroline Lacoux
- Université Côte d’Azur, IPMC, UMR CNRS 7275 Inserm 1323, IHU RespiERA, Valbonne, France
| | - Cyprien Gille
- Université Côte d’Azur, I3S, CNRS UMR7271, Nice, France
| | - Julien Fassy
- Université Côte d’Azur, IPMC, UMR CNRS 7275 Inserm 1323, IHU RespiERA, Valbonne, France
| | - Virginie Magnone
- Université Côte d’Azur, IPMC, UMR CNRS 7275 Inserm 1323, IHU RespiERA, Valbonne, France
| | | | | | - Iris Manosalva-Pena
- Aix-Marseille University, Inserm, TAGC, UMR1090, Equipe Labélisée Ligue Contre le Cancer, Marseille, France
| | - Marine Gautier-Isola
- Université Côte d’Azur, IPMC, UMR CNRS 7275 Inserm 1323, IHU RespiERA, Valbonne, France
| | - Kevin Lebrigand
- Université Côte d’Azur, IPMC, UMR CNRS 7275 Inserm 1323, IHU RespiERA, Valbonne, France
| | - Pascal Barbry
- Université Côte d’Azur, IPMC, UMR CNRS 7275 Inserm 1323, IHU RespiERA, Valbonne, France
| | - Salvatore Spicuglia
- Aix-Marseille University, Inserm, TAGC, UMR1090, Equipe Labélisée Ligue Contre le Cancer, Marseille, France
| | - Georges Vassaux
- Université Côte d’Azur, IPMC, UMR CNRS 7275 Inserm 1323, IHU RespiERA, Valbonne, France
| | - Roger Rezzonico
- Université Côte d’Azur, IPMC, UMR CNRS 7275 Inserm 1323, IHU RespiERA, Valbonne, France
| | | | - Bernard Mari
- Université Côte d’Azur, IPMC, UMR CNRS 7275 Inserm 1323, IHU RespiERA, Valbonne, France
| |
Collapse
|
22
|
Xia L, Lee C, Li JJ. Statistical method scDEED for detecting dubious 2D single-cell embeddings and optimizing t-SNE and UMAP hyperparameters. Nat Commun 2024; 15:1753. [PMID: 38409103 PMCID: PMC10897166 DOI: 10.1038/s41467-024-45891-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Accepted: 02/06/2024] [Indexed: 02/28/2024] Open
Abstract
Two-dimensional (2D) embedding methods are crucial for single-cell data visualization. Popular methods such as t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP) are commonly used for visualizing cell clusters; however, it is well known that t-SNE and UMAP's 2D embeddings might not reliably inform the similarities among cell clusters. Motivated by this challenge, we present a statistical method, scDEED, for detecting dubious cell embeddings output by a 2D-embedding method. By calculating a reliability score for every cell embedding based on the similarity between the cell's 2D-embedding neighbors and pre-embedding neighbors, scDEED identifies the cell embeddings with low reliability scores as dubious and those with high reliability scores as trustworthy. Moreover, by minimizing the number of dubious cell embeddings, scDEED provides intuitive guidance for optimizing the hyperparameters of an embedding method. We show the effectiveness of scDEED on multiple datasets for detecting dubious cell embeddings and optimizing the hyperparameters of t-SNE and UMAP.
Collapse
Affiliation(s)
- Lucy Xia
- Department of ISOM, School of Business and Management, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, China
| | - Christy Lee
- Department of Statistics and Data Science, University of California, Los Angeles, Los Angeles, CA, USA
| | - Jingyi Jessica Li
- Department of Statistics and Data Science, University of California, Los Angeles, Los Angeles, CA, USA.
- Department of Biostatistics, University of California, Los Angeles, Los Angeles, CA, USA.
- Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA, USA.
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA, USA.
- Radcliffe Institute of Advanced Study, Harvard University, Cambridge, MA, USA.
| |
Collapse
|
23
|
Gregory W, Sarwar N, Kevrekidis G, Villar S, Dumitrascu B. MarkerMap: nonlinear marker selection for single-cell studies. NPJ Syst Biol Appl 2024; 10:17. [PMID: 38351188 PMCID: PMC10864304 DOI: 10.1038/s41540-024-00339-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Accepted: 01/17/2024] [Indexed: 02/16/2024] Open
Abstract
Single-cell RNA-seq data allow the quantification of cell type differences across a growing set of biological contexts. However, pinpointing a small subset of genomic features explaining this variability can be ill-defined and computationally intractable. Here we introduce MarkerMap, a generative model for selecting minimal gene sets which are maximally informative of cell type origin and enable whole transcriptome reconstruction. MarkerMap provides a scalable framework for both supervised marker selection, aimed at identifying specific cell type populations, and unsupervised marker selection, aimed at gene expression imputation and reconstruction. We benchmark MarkerMap's competitive performance against previously published approaches on real single cell gene expression data sets. MarkerMap is available as a pip installable package, as a community resource aimed at developing explainable machine learning techniques for enhancing interpretability in single-cell studies.
Collapse
Affiliation(s)
- Wilson Gregory
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Nabeel Sarwar
- Center for Data Science, New York University, New York, NY, 10012, USA
| | - George Kevrekidis
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Soledad Villar
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, 21218, USA.
- Mathematical Institute for Data Science, Johns Hopkins University, Baltimore, MD, 21218, USA.
| | - Bianca Dumitrascu
- Department of Statistics, Columbia University, New York, NY, 10027, USA.
- Irving Institute for Cancer Dynamics, Columbia University, New York, NY, 10027, USA.
| |
Collapse
|
24
|
Church SH, Mah JL, Wagner G, Dunn CW. Normalizing need not be the norm: count-based math for analyzing single-cell data. Theory Biosci 2024; 143:45-62. [PMID: 37947999 DOI: 10.1007/s12064-023-00408-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Accepted: 10/13/2023] [Indexed: 11/12/2023]
Abstract
Counting transcripts of mRNA are a key method of observation in modern biology. With advances in counting transcripts in single cells (single-cell RNA sequencing or scRNA-seq), these data are routinely used to identify cells by their transcriptional profile, and to identify genes with differential cellular expression. Because the total number of transcripts counted per cell can vary for technical reasons, the first step of many commonly used scRNA-seq workflows is to normalize by sequencing depth, transforming counts into proportional abundances. The primary objective of this step is to reshape the data such that cells with similar biological proportions of transcripts end up with similar transformed measurements. But there is growing concern that normalization and other transformations result in unintended distortions that hinder both analyses and the interpretation of results. This has led to an intense focus on optimizing methods for normalization and transformation of scRNA-seq data. Here, we take an alternative approach, by avoiding normalization and transformation altogether. We abandon the use of distances to compare cells, and instead use a restricted algebra, motivated by measurement theory and abstract algebra, that preserves the count nature of the data. We demonstrate that this restricted algebra is sufficient to draw meaningful and practical comparisons of gene expression through the use of the dot product and other elementary operations. This approach sidesteps many of the problems with common transformations, and has the added benefit of being simpler and more intuitive. We implement our approach in the package countland, available in python and R.
Collapse
Affiliation(s)
- Samuel H Church
- Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT, USA.
| | - Jasmine L Mah
- Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT, USA
| | - Günter Wagner
- Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT, USA
- Yale Systems Biology Institute, Yale University, New Haven, CT, USA
- Department of Obstetrics, Gynecology and Reproductive Sciences, Yale Medical School, New Haven, CT, USA
- Department of Obstetrics and Gynecology, Wayne State University, Detroit, MI, USA
| | - Casey W Dunn
- Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT, USA
| |
Collapse
|
25
|
Tyler SR, Lozano-Ojalvo D, Guccione E, Schadt EE. Anti-correlated feature selection prevents false discovery of subpopulations in scRNAseq. Nat Commun 2024; 15:699. [PMID: 38267438 PMCID: PMC10808220 DOI: 10.1038/s41467-023-43406-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2022] [Accepted: 11/07/2023] [Indexed: 01/26/2024] Open
Abstract
While sub-clustering cell-populations has become popular in single cell-omics, negative controls for this process are lacking. Popular feature-selection/clustering algorithms fail the null-dataset problem, allowing erroneous subdivisions of homogenous clusters until nearly each cell is called its own cluster. Using real and synthetic datasets, we find that anti-correlated gene selection reduces or eliminates erroneous subdivisions, increases marker-gene selection efficacy, and efficiently scales to millions of cells.
Collapse
Affiliation(s)
- Scott R Tyler
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
- Department of Oncological Sciences, Tisch Cancer Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
| | - Daniel Lozano-Ojalvo
- Department of Dermatology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Ernesto Guccione
- Department of Oncological Sciences, Tisch Cancer Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Center for Therapeutics Discovery, Department of Oncological Sciences and Pharmacological Sciences, Tisch Cancer Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Bioinformatics for Next Generation Sequencing (BiNGS) Shared Resource Facility, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Eric E Schadt
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
| |
Collapse
|
26
|
Rodriguez LA, Tran MN, Garcia-Flores R, Oh S, Phillips RA, Pattie EA, Divecha HR, Kim SH, Shin JH, Lee YK, Montoya C, Jaffe AE, Collado-Torres L, Page SC, Martinowich K. TrkB-dependent regulation of molecular signaling across septal cell types. Transl Psychiatry 2024; 14:52. [PMID: 38263132 PMCID: PMC10805920 DOI: 10.1038/s41398-024-02758-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Revised: 01/04/2024] [Accepted: 01/08/2024] [Indexed: 01/25/2024] Open
Abstract
The lateral septum (LS), a GABAergic structure located in the basal forebrain, is implicated in social behavior, learning, and memory. We previously demonstrated that expression of tropomyosin kinase receptor B (TrkB) in LS neurons is required for social novelty recognition. To better understand molecular mechanisms by which TrkB signaling controls behavior, we locally knocked down TrkB in LS and used bulk RNA-sequencing to identify changes in gene expression downstream of TrkB. TrkB knockdown induces upregulation of genes associated with inflammation and immune responses, and downregulation of genes associated with synaptic signaling and plasticity. Next, we generated one of the first atlases of molecular profiles for LS cell types using single nucleus RNA-sequencing (snRNA-seq). We identified markers for the septum broadly, and the LS specifically, as well as for all neuronal cell types. We then investigated whether the differentially expressed genes (DEGs) induced by TrkB knockdown map to specific LS cell types. Enrichment testing identified that downregulated DEGs are broadly expressed across neuronal clusters. Enrichment analyses of these DEGs demonstrated that downregulated genes are uniquely expressed in the LS, and associated with either synaptic plasticity or neurodevelopmental disorders. Upregulated genes are enriched in LS microglia, associated with immune response and inflammation, and linked to both neurodegenerative disease and neuropsychiatric disorders. In addition, many of these genes are implicated in regulating social behaviors. In summary, the findings implicate TrkB signaling in the LS as a critical regulator of gene networks associated with psychiatric disorders that display social deficits, including schizophrenia and autism, and with neurodegenerative diseases, including Alzheimer's.
Collapse
Affiliation(s)
- Lionel A Rodriguez
- Department of Neuroscience, Johns Hopkins School of Medicine, Baltimore, MD, 21205, USA
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
| | - Matthew Nguyen Tran
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
| | - Renee Garcia-Flores
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
| | - Seyun Oh
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
| | - Robert A Phillips
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
| | - Elizabeth A Pattie
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
| | - Heena R Divecha
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
| | - Sun Hong Kim
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
| | - Joo Heon Shin
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
- Department of Neurology, Johns Hopkins School of Medicine, Baltimore, MD, 21205, USA
| | - Yong Kyu Lee
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
| | - Carly Montoya
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
| | - Andrew E Jaffe
- Department of Neuroscience, Johns Hopkins School of Medicine, Baltimore, MD, 21205, USA
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, MD, 21205, USA
| | - Leonardo Collado-Torres
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
| | - Stephanie C Page
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA.
| | - Keri Martinowich
- Department of Neuroscience, Johns Hopkins School of Medicine, Baltimore, MD, 21205, USA.
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA.
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins School of Medicine, Baltimore, MD, 21205, USA.
- The Kavli Neuroscience Discovery Institute, Johns Hopkins University, Baltimore, MD, 21205, USA.
| |
Collapse
|
27
|
Chen Y, Zheng R, Liu J, Li M. scMLC: an accurate and robust multiplex community detection method for single-cell multi-omics data. Brief Bioinform 2024; 25:bbae101. [PMID: 38493339 PMCID: PMC10944569 DOI: 10.1093/bib/bbae101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2023] [Revised: 01/03/2024] [Accepted: 02/15/2024] [Indexed: 03/18/2024] Open
Abstract
Clustering cells based on single-cell multi-modal sequencing technologies provides an unprecedented opportunity to create high-resolution cell atlas, reveal cellular critical states and study health and diseases. However, effectively integrating different sequencing data for cell clustering remains a challenging task. Motivated by the successful application of Louvain in scRNA-seq data, we propose a single-cell multi-modal Louvain clustering framework, called scMLC, to tackle this problem. scMLC builds multiplex single- and cross-modal cell-to-cell networks to capture modal-specific and consistent information between modalities and then adopts a robust multiplex community detection method to obtain the reliable cell clusters. In comparison with 15 state-of-the-art clustering methods on seven real datasets simultaneously measuring gene expression and chromatin accessibility, scMLC achieves better accuracy and stability in most datasets. Synthetic results also indicate that the cell-network-based integration strategy of multi-omics data is superior to other strategies in terms of generalization. Moreover, scMLC is flexible and can be extended to single-cell sequencing data with more than two modalities.
Collapse
Affiliation(s)
- Yuxuan Chen
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Ruiqing Zheng
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Jin Liu
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| |
Collapse
|
28
|
Grabski IN, Heymach JV, Kehl KL, Kopetz S, Lau KS, Riely GJ, Schrag D, Yaeger R, Irizarry RA, Haigis KM. Effects of KRAS Genetic Interactions on Outcomes in Cancers of the Lung, Pancreas, and Colorectum. Cancer Epidemiol Biomarkers Prev 2024; 33:158-169. [PMID: 37943166 PMCID: PMC10841605 DOI: 10.1158/1055-9965.epi-23-0262] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2023] [Revised: 07/02/2023] [Accepted: 11/07/2023] [Indexed: 11/10/2023] Open
Abstract
BACKGROUND KRAS is among the most commonly mutated oncogenes in cancer, and previous studies have shown associations with survival in many cancer contexts. Evidence from both clinical observations and mouse experiments further suggests that these associations are allele- and tissue-specific. These findings motivate using clinical data to understand gene interactions and clinical covariates within different alleles and tissues. METHODS We analyze genomic and clinical data from the AACR Project GENIE Biopharma Collaborative for samples from lung, colorectal, and pancreatic cancers. For each of these cancer types, we report epidemiological associations for different KRAS alleles, apply principal component analysis (PCA) to discover groups of genes co-mutated with KRAS, and identify distinct clusters of patient profiles with implications for survival. RESULTS KRAS mutations were associated with inferior survival in lung, colon, and pancreas, although the specific mutations implicated varied by disease. Tissue- and allele-specific associations with smoking, sex, age, and race were found. Tissue-specific genetic interactions with KRAS were identified by PCA, which were clustered to produce five, four, and two patient profiles in lung, colon, and pancreas. Membership in these profiles was associated with survival in all three cancer types. CONCLUSIONS KRAS mutations have tissue- and allele-specific associations with inferior survival, clinical covariates, and genetic interactions. IMPACT Our results provide greater insight into the tissue- and allele-specific associations with KRAS mutations and identify clusters of patients that are associated with survival and clinical attributes from combinations of genetic interactions with KRAS mutations.
Collapse
Affiliation(s)
- Isabella N. Grabski
- Department of Data Science, Dana-Farber Cancer Institute, Boston, Massachusetts, USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA
| | - John V. Heymach
- Department of Thoracic and Head and Neck Medical Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Kenneth L. Kehl
- Division of Population Sciences, Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Scott Kopetz
- Department of Gastrointestinal Medical Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Ken S. Lau
- Department of Cell and Developmental Biology, Vanderbilt University School of Medicine, Nashville, TN, USA
| | - Gregory J. Riely
- Department of Medicine, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Deborah Schrag
- Department of Medicine, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Rona Yaeger
- Department of Medicine, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Rafael A. Irizarry
- Department of Data Science, Dana-Farber Cancer Institute, Boston, Massachusetts, USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA
| | - Kevin M. Haigis
- Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, Massachusetts, USA
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, Massachusetts, USA
| |
Collapse
|
29
|
Wang TG, Shang JL, Liu JX, Li F, Yuan S, Wang J. Joint L 2,p-norm and random walk graph constrained PCA for single-cell RNA-seq data. Comput Methods Biomech Biomed Engin 2024; 27:498-511. [PMID: 36912759 DOI: 10.1080/10255842.2023.2188106] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2022] [Accepted: 03/02/2023] [Indexed: 03/14/2023]
Abstract
The development and widespread utilization of high-throughput sequencing technologies in biology has fueled the rapid growth of single-cell RNA sequencing (scRNA-seq) data over the past decade. The development of scRNA-seq technology has significantly expanded researchers' understanding of cellular heterogeneity. Accurate cell type identification is the prerequisite for any research on heterogeneous cell populations. However, due to the high noise and high dimensionality of scRNA-seq data, improving the effectiveness of cell type identification remains a challenge. As an effective dimensionality reduction method, Principal Component Analysis (PCA) is an essential tool for visualizing high-dimensional scRNA-seq data and identifying cell subpopulations. However, traditional PCA has some defects when used in mining the nonlinear manifold structure of the data and usually suffers from over-density of principal components (PCs). Therefore, we present a novel method in this paper called joint L 2 , p -norm and random walk graph constrained PCA (RWPPCA). RWPPCA aims to retain the data's local information in the process of mapping high-dimensional data to low-dimensional space, to more accurately obtain sparse principal components and to then identify cell types more precisely. Specifically, RWPPCA combines the random walk (RW) algorithm with graph regularization to more accurately determine the local geometric relationships between data points. Moreover, to mitigate the adverse effects of dense PCs, the L 2 , p -norm is introduced to make the PCs sparser, thus increasing their interpretability. Then, we evaluate the effectiveness of RWPPCA on simulated data and scRNA-seq data. The results show that RWPPCA performs well in cell type identification and outperforms other comparison methods.
Collapse
Affiliation(s)
- Tai-Ge Wang
- School of Computer Science, Qufu Normal University, Rizhao 276826, China
| | - Jun-Liang Shang
- School of Computer Science, Qufu Normal University, Rizhao 276826, China
| | - Jin-Xing Liu
- School of Computer Science, Qufu Normal University, Rizhao 276826, China
| | - Feng Li
- School of Computer Science, Qufu Normal University, Rizhao 276826, China
| | - Shasha Yuan
- School of Computer Science, Qufu Normal University, Rizhao 276826, China
| | - Juan Wang
- School of Computer Science, Qufu Normal University, Rizhao 276826, China
| |
Collapse
|
30
|
Møller AF, Madsen JGS. JOINTLY: interpretable joint clustering of single-cell transcriptomes. Nat Commun 2023; 14:8473. [PMID: 38123569 PMCID: PMC10733431 DOI: 10.1038/s41467-023-44279-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Accepted: 12/06/2023] [Indexed: 12/23/2023] Open
Abstract
Single-cell and single-nucleus RNA-sequencing (sxRNA-seq) is increasingly being used to characterise the transcriptomic state of cell types at homeostasis, during development and in disease. However, this is a challenging task, as biological effects can be masked by technical variation. Here, we present JOINTLY, an algorithm enabling joint clustering of sxRNA-seq datasets across batches. JOINTLY performs on par or better than state-of-the-art batch integration methods in clustering tasks and outperforms other intrinsically interpretable methods. We demonstrate that JOINTLY is robust against over-correction while retaining subtle cell state differences between biological conditions and highlight how the interpretation of JOINTLY can be used to annotate cell types and identify active signalling programs across cell types and pseudo-time. Finally, we use JOINTLY to construct a reference atlas of white adipose tissue (WATLAS), an expandable and comprehensive community resource, in which we describe four adipocyte subpopulations and map compositional changes in obesity and between depots.
Collapse
Affiliation(s)
- Andreas Fønss Møller
- Institute of Biochemistry and Molecular Biology, University of Southern, Odense, Denmark
- Sino-Danish College (SDC), University of Chinese Academy of Sciences, Beijing, China
| | - Jesper Grud Skat Madsen
- Institute of Biochemistry and Molecular Biology, University of Southern, Odense, Denmark.
- Institute of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark.
- Center for Functional Genomics and Tissue Plasticity (ATLAS), Odense M, 5230, Denmark.
- The Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA.
| |
Collapse
|
31
|
Gilis J, Perin L, Malfait M, Van den Berge K, Takele Assefa A, Verbist B, Risso D, Clement L. Differential detection workflows for multi-sample single-cell RNA-seq data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.17.572043. [PMID: 38187695 PMCID: PMC10769270 DOI: 10.1101/2023.12.17.572043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2024]
Abstract
In single-cell transcriptomics, differential gene expression (DE) analyses typically focus on testing differences in the average expression of genes between cell types or conditions of interest. Single-cell transcriptomics, however, also has the promise to prioritise genes for which the expression differ in other aspects of the distribution. Here we develop a workflow for assessing differential detection (DD), which tests for differences in the average fraction of samples or cells in which a gene is detected. After benchmarking eight different DD data analysis strategies, we provide a unified workflow for jointly assessing DE and DD. Using simulations and two case studies, we show that DE and DD analysis provide complementary information, both in terms of the individual genes they report and in the functional interpretation of those genes.
Collapse
Affiliation(s)
- Jeroen Gilis
- These authors contributed equally
- Applied Mathematics, Computer science and Statistics, Ghent University, Ghent, 9000, Belgium
- Bioinformatics Institute, Ghent University, Ghent, 9000, Belgium
- Data Mining and Modeling for Biomedicine, VIB Flemish Institute for Biotechnology, Ghent, 9000, Belgium
| | - Laura Perin
- These authors contributed equally
- Department of Statistical Sciences, University of Padova, Padova, Italy
| | - Milan Malfait
- Applied Mathematics, Computer science and Statistics, Ghent University, Ghent, 9000, Belgium
| | - Koen Van den Berge
- Statistics and Decision Sciences, Johnson and Johnson Innovative Medicine, Beerse, Belgium
| | - Alemu Takele Assefa
- Statistics and Decision Sciences, Johnson and Johnson Innovative Medicine, Beerse, Belgium
| | - Bie Verbist
- Statistics and Decision Sciences, Johnson and Johnson Innovative Medicine, Beerse, Belgium
| | - Davide Risso
- Department of Statistical Sciences, University of Padova, Padova, Italy
- Padua Center for Network Medicine, University of Padova, Padova, Italy
| | - Lieven Clement
- Applied Mathematics, Computer science and Statistics, Ghent University, Ghent, 9000, Belgium
- Bioinformatics Institute, Ghent University, Ghent, 9000, Belgium
| |
Collapse
|
32
|
Neufeld A, Gao LL, Popp J, Battle A, Witten D. Inference after latent variable estimation for single-cell RNA sequencing data. Biostatistics 2023; 25:270-287. [PMID: 36511385 DOI: 10.1093/biostatistics/kxac047] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Revised: 10/17/2022] [Accepted: 11/26/2022] [Indexed: 12/14/2022] Open
Abstract
In the analysis of single-cell RNA sequencing data, researchers often characterize the variation between cells by estimating a latent variable, such as cell type or pseudotime, representing some aspect of the cell's state. They then test each gene for association with the estimated latent variable. If the same data are used for both of these steps, then standard methods for computing p-values in the second step will fail to achieve statistical guarantees such as Type 1 error control. Furthermore, approaches such as sample splitting that can be applied to solve similar problems in other settings are not applicable in this context. In this article, we introduce count splitting, a flexible framework that allows us to carry out valid inference in this setting, for virtually any latent variable estimation technique and inference approach, under a Poisson assumption. We demonstrate the Type 1 error control and power of count splitting in a simulation study and apply count splitting to a data set of pluripotent stem cells differentiating to cardiomyocytes.
Collapse
Affiliation(s)
- Anna Neufeld
- Department of Statistics, University of Washington, Seattle, WA 98195, USA
| | - Lucy L Gao
- Department of Statistics, University of British Columbia, BC V6T 1Z4, Canada
| | - Joshua Popp
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Alexis Battle
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA and Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Daniela Witten
- Department of Statistics, University of Washington, Seattle, WA 98195, USA and Department of Biostatistics, University of Washington, Seattle, WA 98195, USA
| |
Collapse
|
33
|
Edrisi M, Huang X, Ogilvie HA, Nakhleh L. Accurate integration of single-cell DNA and RNA for analyzing intratumor heterogeneity using MaCroDNA. Nat Commun 2023; 14:8262. [PMID: 38092737 PMCID: PMC10719311 DOI: 10.1038/s41467-023-44014-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Accepted: 11/27/2023] [Indexed: 12/17/2023] Open
Abstract
Cancers develop and progress as mutations accumulate, and with the advent of single-cell DNA and RNA sequencing, researchers can observe these mutations and their transcriptomic effects and predict proteomic changes with remarkable temporal and spatial precision. However, to connect genomic mutations with their transcriptomic and proteomic consequences, cells with either only DNA data or only RNA data must be mapped to a common domain. For this purpose, we present MaCroDNA, a method that uses maximum weighted bipartite matching of per-gene read counts from single-cell DNA and RNA-seq data. Using ground truth information from colorectal cancer data, we demonstrate the advantage of MaCroDNA over existing methods in accuracy and speed. Exemplifying the utility of single-cell data integration in cancer research, we suggest, based on results derived using MaCroDNA, that genomic mutations of large effect size increasingly contribute to differential expression between cells as Barrett's esophagus progresses to esophageal cancer, reaffirming the findings of the previous studies.
Collapse
Affiliation(s)
| | - Xiru Huang
- Department of Computer Science, Rice University, Houston, Texas, USA
| | - Huw A Ogilvie
- Department of Computer Science, Rice University, Houston, Texas, USA.
| | - Luay Nakhleh
- Department of Computer Science, Rice University, Houston, Texas, USA.
| |
Collapse
|
34
|
Ting KK, Coleman P, Kim HJ, Zhao Y, Mulangala J, Cheng NC, Li W, Gunatilake D, Johnstone DM, Loo L, Neely GG, Yang P, Götz J, Vadas MA, Gamble JR. Vascular senescence and leak are features of the early breakdown of the blood-brain barrier in Alzheimer's disease models. GeroScience 2023; 45:3307-3331. [PMID: 37782439 PMCID: PMC10643714 DOI: 10.1007/s11357-023-00927-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2022] [Accepted: 08/27/2023] [Indexed: 10/03/2023] Open
Abstract
Alzheimer's disease (AD) is an age-related disease, with loss of integrity of the blood-brain barrier (BBB) being an early feature. Cellular senescence is one of the reported nine hallmarks of aging. Here, we show for the first time the presence of senescent cells in the vasculature in AD patients and mouse models of AD. Senescent endothelial cells and pericytes are present in APP/PS1 transgenic mice but not in wild-type littermates at the time of amyloid deposition. In vitro, senescent endothelial cells display altered VE-cadherin expression and loss of cell junction formation and increased permeability. Consistent with this, senescent endothelial cells in APP/PS1 mice are present at areas of vascular leak that have decreased claudin-5 and VE-cadherin expression confirming BBB breakdown. Furthermore, single cell sequencing of endothelial cells from APP/PS1 transgenic mice confirms that adhesion molecule pathways are among the most highly altered pathways in these cells. At the pre-plaque stage, the vasculature shows significant signs of breakdown, with a general loss of VE-cadherin, leakage within the microcirculation, and obvious pericyte perturbation. Although senescent vascular cells were not directly observed at sites of vascular leak, senescent cells were close to the leak area. Thus, we would suggest in AD that there is a progressive induction of senescence in constituents of the neurovascular unit contributing to an increasing loss of vascular integrity. Targeting the vasculature early in AD, either with senolytics or with drugs that improve the integrity of the BBB may be valid therapeutic strategies.
Collapse
Affiliation(s)
- Ka Ka Ting
- Vascular Biology Program, Centenary Institute, Camperdown, NSW, Australia.
- School of Medical Sciences, University of Sydney, Camperdown, NSW, Australia.
| | - Paul Coleman
- Vascular Biology Program, Centenary Institute, Camperdown, NSW, Australia
- School of Medical Sciences, University of Sydney, Camperdown, NSW, Australia
| | - Hani Jieun Kim
- Computational Systems Biology Group, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW, 2145, Australia
| | - Yang Zhao
- School of Medicine & Holistic Integrative Medicine, Nanjing University of Chinese Medicine, Nanjing, 210023, Jiangsu, China
| | - Jocelyne Mulangala
- Vascular Biology Program, Centenary Institute, Camperdown, NSW, Australia
| | - Ngan Ching Cheng
- Vascular Biology Program, Centenary Institute, Camperdown, NSW, Australia
| | - Wan Li
- Department of General Surgery, Jiangsu Province Hospital of Chinese Medicine, Affiliated Hospital of Nanjing University of Chinese Medicine, Nanjing, 210029, China
| | - Dilini Gunatilake
- Vascular Biology Program, Centenary Institute, Camperdown, NSW, Australia
| | - Daniel M Johnstone
- School of Medical Sciences, University of Sydney, Camperdown, NSW, Australia
- School of Biomedical Sciences & Pharmacy, University of Newcastle, Callaghan, NSW, Australia
| | - Lipin Loo
- Charles Perkins Centre, Dr. John and Anne Chong Lab for Functional Genomics, Centenary Institute, & School of Life and Environmental Sciences, University of Sydney, Camperdown, NSW, Australia
| | - G Gregory Neely
- Charles Perkins Centre, Dr. John and Anne Chong Lab for Functional Genomics, Centenary Institute, & School of Life and Environmental Sciences, University of Sydney, Camperdown, NSW, Australia
| | - Pengyi Yang
- Computational Systems Biology Group, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW, 2145, Australia
| | - Jürgen Götz
- Clem Jones Centre for Ageing Dementia Research, Queensland Brain Institute, The University of Queensland, Brisbane, Australia
| | - Mathew A Vadas
- Vascular Biology Program, Centenary Institute, Camperdown, NSW, Australia
- Heart Research Institute, Sydney, NSW, Australia
| | - Jennifer R Gamble
- Vascular Biology Program, Centenary Institute, Camperdown, NSW, Australia.
- School of Medical Sciences, University of Sydney, Camperdown, NSW, Australia.
| |
Collapse
|
35
|
Apenteng OO, Aarestrup FM, Vigre H. Modelling the effectiveness of surveillance based on metagenomics in detecting, monitoring, and forecasting antimicrobial resistance in livestock production under economic constraints. Sci Rep 2023; 13:20410. [PMID: 37990114 PMCID: PMC10663573 DOI: 10.1038/s41598-023-47754-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2022] [Accepted: 11/17/2023] [Indexed: 11/23/2023] Open
Abstract
Current surveillance of antimicrobial resistance (AMR) is mostly based on testing indicator bacteria using minimum inhibitory concentration (MIC) panels. Metagenomics has the potential to identify all known antimicrobial resistant genes (ARGs) in complex samples and thereby detect changes in the occurrence earlier. Here, we simulate the results of an AMR surveillance program based on metagenomics in the Danish pig population. We modelled both an increase in the occurrence of ARGs and an introduction of a new ARG in a few farms and the subsequent spread to the entire population. To make the simulation realistic, the total cost of the surveillance was constrained, and the sampling schedule was set at one pool per month with 5, 20, 50, or 100 samples. Our simulations demonstrate that a pool of 20-50 samples and a sequencing depth of 250 million fragments resulted in the shortest time to detection in both scenarios, with a time delay to detection of change of [Formula: see text]15 months in all scenarios. Compared with culture-based surveillance, our simulation indicates that there are neither significant reductions nor increases in time to detect a change using metagenomics. The benefit of metagenomics is that it is possible to monitor all known resistance in one sampling and laboratory procedure in contrast to the current monitoring that is based on the phenotypic characterisation of selected indicator bacterial species. Therefore, overall changes in AMR in a population will be detected earlier using metagenomics due to the fact that the resistance gene does not have to be transferred to and expressed by an indicator bacteria before it is possible to detect.
Collapse
Affiliation(s)
- Ofosuhene O Apenteng
- Research Group for Genomic Epidemiology, National Food Institute, Technical University of Denmark, Kongens Lyngby, Denmark.
- Section of Animal Welfare and Disease Control, Department of Veterinary and Animal Sciences, University of Copenhagen, Copenhagen, Denmark.
| | - Frank M Aarestrup
- Research Group for Genomic Epidemiology, National Food Institute, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Håkan Vigre
- Research Group for Genomic Epidemiology, National Food Institute, Technical University of Denmark, Kongens Lyngby, Denmark.
| |
Collapse
|
36
|
Huang H, Liu C, Wagle MM, Yang P. Evaluation of deep learning-based feature selection for single-cell RNA sequencing data analysis. Genome Biol 2023; 24:259. [PMID: 37950331 PMCID: PMC10638755 DOI: 10.1186/s13059-023-03100-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2022] [Accepted: 10/24/2023] [Indexed: 11/12/2023] Open
Abstract
BACKGROUND Feature selection is an essential task in single-cell RNA-seq (scRNA-seq) data analysis and can be critical for gene dimension reduction and downstream analyses, such as gene marker identification and cell type classification. Most popular methods for feature selection from scRNA-seq data are based on the concept of differential distribution wherein a statistical model is used to detect changes in gene expression among cell types. Recent development of deep learning-based feature selection methods provides an alternative approach compared to traditional differential distribution-based methods in that the importance of a gene is determined by neural networks. RESULTS In this work, we explore the utility of various deep learning-based feature selection methods for scRNA-seq data analysis. We sample from Tabula Muris and Tabula Sapiens atlases to create scRNA-seq datasets with a range of data properties and evaluate the performance of traditional and deep learning-based feature selection methods for cell type classification, feature selection reproducibility and diversity, and computational time. CONCLUSIONS Our study provides a reference for future development and application of deep learning-based feature selection methods for single-cell omics data analyses.
Collapse
Affiliation(s)
- Hao Huang
- Computational Systems Biology Unit, Faculty of Medicine and Health, Children's Medical Research Institute, University of Sydney, Westmead, NSW, 2145, Australia
- School of Mathematics and Statistics, Faculty of Science, University of Sydney, Camperdown, NSW, 2006, Australia
- Sydney Precision Data Science Centre, University of Sydney, Camperdown, NSW, 2006, Australia
| | - Chunlei Liu
- Computational Systems Biology Unit, Faculty of Medicine and Health, Children's Medical Research Institute, University of Sydney, Westmead, NSW, 2145, Australia
- Sydney Precision Data Science Centre, University of Sydney, Camperdown, NSW, 2006, Australia
| | - Manoj M Wagle
- Computational Systems Biology Unit, Faculty of Medicine and Health, Children's Medical Research Institute, University of Sydney, Westmead, NSW, 2145, Australia
- School of Mathematics and Statistics, Faculty of Science, University of Sydney, Camperdown, NSW, 2006, Australia
- Sydney Precision Data Science Centre, University of Sydney, Camperdown, NSW, 2006, Australia
| | - Pengyi Yang
- Computational Systems Biology Unit, Faculty of Medicine and Health, Children's Medical Research Institute, University of Sydney, Westmead, NSW, 2145, Australia.
- School of Mathematics and Statistics, Faculty of Science, University of Sydney, Camperdown, NSW, 2006, Australia.
- Sydney Precision Data Science Centre, University of Sydney, Camperdown, NSW, 2006, Australia.
- Charles Perkins Centre, University of Sydney, Camperdown, NSW, 2006, Australia.
| |
Collapse
|
37
|
Zelig A, Kariti H, Kaplan N. KMD clustering: robust general-purpose clustering of biological data. Commun Biol 2023; 6:1110. [PMID: 37919399 PMCID: PMC10622433 DOI: 10.1038/s42003-023-05480-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2023] [Accepted: 10/18/2023] [Indexed: 11/04/2023] Open
Abstract
The noisy and high-dimensional nature of biological data has spawned advanced clustering algorithms that are tailored for specific biological datatypes. However, the performance of such methods varies greatly between datasets and they require post hoc tuning of cryptic hyperparameters. We present k minimal distance (KMD) clustering, a general-purpose method based on a generalization of single and average linkage hierarchical clustering. We introduce a generalized silhouette-like function to eliminate the cryptic hyperparameter k, and use sampling to enable application to million-object datasets. Rigorous comparisons to general and specialized clustering methods on simulated, mass cytometry and scRNA-seq datasets show consistent high performance of KMD clustering across all datasets.
Collapse
Affiliation(s)
- Aviv Zelig
- Data Science & Engineering Program, Faculty of Industrial Engineering & Management, Technion - Israel Institute of Technology, Haifa, Israel
- Department of Physiology, Biophysics & Systems Biology, Rappaport Faculty of Medicine, Technion - Israel Institute of Technology, Haifa, Israel
| | - Hagai Kariti
- Department of Physiology, Biophysics & Systems Biology, Rappaport Faculty of Medicine, Technion - Israel Institute of Technology, Haifa, Israel
| | - Noam Kaplan
- Department of Physiology, Biophysics & Systems Biology, Rappaport Faculty of Medicine, Technion - Israel Institute of Technology, Haifa, Israel.
| |
Collapse
|
38
|
Ahsanuddin S, Wu AY. Single-cell transcriptomics of the ocular anterior segment: a comprehensive review. Eye (Lond) 2023; 37:3334-3350. [PMID: 37138096 PMCID: PMC10156079 DOI: 10.1038/s41433-023-02539-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2022] [Revised: 03/07/2023] [Accepted: 04/11/2023] [Indexed: 05/05/2023] Open
Abstract
Elucidating the cellular and genetic composition of ocular tissues is essential for uncovering the pathophysiology of ocular diseases. Since the introduction of single-cell RNA sequencing (scRNA-seq) in 2009, vision researchers have performed extensive single-cell analyses to better understand transcriptome complexity and heterogeneity of ocular structures. This technology has revolutionized our ability to identify rare cell populations and to make cross-species comparisons of gene expression in both steady state and disease conditions. Importantly, single-cell transcriptomic analyses have enabled the identification of cell-type specific gene markers and signalling pathways between ocular cell populations. While most scRNA-seq studies have been conducted on retinal tissues, large-scale transcriptomic atlases pertaining to the ocular anterior segment have also been constructed in the past three years. This timely review provides vision researchers with an overview of scRNA-seq experimental design, technical limitations, and clinical applications in a variety of anterior segment-related ocular pathologies. We review open-access anterior segment-related scRNA-seq datasets and illustrate how scRNA-seq can be an indispensable tool for the development of targeted therapeutics.
Collapse
Affiliation(s)
- Sofia Ahsanuddin
- Department of Ophthalmology, Byers Eye Institute, Stanford University School of Medicine, Stanford, CA, USA
- Department of Ophthalmology, New York Eye and Ear Infirmary of Mount Sinai, New York City, NY, USA
- Department of Ophthalmology, Icahn School of Medicine at Mount Sinai, New York City, NY, USA
| | - Albert Y Wu
- Department of Ophthalmology, Byers Eye Institute, Stanford University School of Medicine, Stanford, CA, USA.
| |
Collapse
|
39
|
Zhou Y, Luo K, Liang L, Chen M, He X. A new Bayesian factor analysis method improves detection of genes and biological processes affected by perturbations in single-cell CRISPR screening. Nat Methods 2023; 20:1693-1703. [PMID: 37770710 PMCID: PMC10630124 DOI: 10.1038/s41592-023-02017-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2022] [Accepted: 08/18/2023] [Indexed: 09/30/2023]
Abstract
Clustered regularly interspaced short palindromic repeats (CRISPR) screening coupled with single-cell RNA sequencing has emerged as a powerful tool to characterize the effects of genetic perturbations on the whole transcriptome at a single-cell level. However, due to its sparsity and complex structure, analysis of single-cell CRISPR screening data is challenging. In particular, standard differential expression analysis methods are often underpowered to detect genes affected by CRISPR perturbations. We developed a statistical method for such data, called guided sparse factor analysis (GSFA). GSFA infers latent factors that represent coregulated genes or gene modules; by borrowing information from these factors, it infers the effects of genetic perturbations on individual genes. We demonstrated through extensive simulation studies that GSFA detects perturbation effects with much higher power than state-of-the-art methods. Using single-cell CRISPR data from human CD8+ T cells and neural progenitor cells, we showed that GSFA identified biologically relevant gene modules and specific genes affected by CRISPR perturbations, many of which were missed by existing methods, providing new insights into the functions of genes involved in T cell activation and neurodevelopment.
Collapse
Affiliation(s)
- Yifan Zhou
- Graduate Program of Biophysical Sciences, University of Chicago, Chicago, IL, USA
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
| | - Kaixuan Luo
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
| | - Lifan Liang
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
| | - Mengjie Chen
- Department of Human Genetics, University of Chicago, Chicago, IL, USA.
- Department of Medicine, University of Chicago, Chicago, IL, USA.
| | - Xin He
- Department of Human Genetics, University of Chicago, Chicago, IL, USA.
| |
Collapse
|
40
|
Carbonetto P, Luo K, Sarkar A, Hung A, Tayeb K, Pott S, Stephens M. GoM DE: interpreting structure in sequence count data with differential expression analysis allowing for grades of membership. Genome Biol 2023; 24:236. [PMID: 37858253 PMCID: PMC10588049 DOI: 10.1186/s13059-023-03067-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2023] [Accepted: 09/20/2023] [Indexed: 10/21/2023] Open
Abstract
Parts-based representations, such as non-negative matrix factorization and topic modeling, have been used to identify structure from single-cell sequencing data sets, in particular structure that is not as well captured by clustering or other dimensionality reduction methods. However, interpreting the individual parts remains a challenge. To address this challenge, we extend methods for differential expression analysis by allowing cells to have partial membership to multiple groups. We call this grade of membership differential expression (GoM DE). We illustrate the benefits of GoM DE for annotating topics identified in several single-cell RNA-seq and ATAC-seq data sets.
Collapse
Affiliation(s)
- Peter Carbonetto
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Research Computing Center, University of Chicago, Chicago, IL, USA
| | - Kaixuan Luo
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
| | - Abhishek Sarkar
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Vesalius Therapeutics, Cambridge, MA, USA
| | - Anthony Hung
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Section of Genetic Medicine, University of Chicago, Chicago, IL, USA
| | - Karl Tayeb
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Committee on Genetics, Genomics and Systems Biology, University of Chicago, Chicago, IL, USA
| | - Sebastian Pott
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Section of Genetic Medicine, University of Chicago, Chicago, IL, USA
| | - Matthew Stephens
- Department of Human Genetics, University of Chicago, Chicago, IL, USA.
- Department of Statistics, University of Chicago, Chicago, IL, USA.
| |
Collapse
|
41
|
Xia L, Lee C, Li JJ. scDEED: a statistical method for detecting dubious 2D single-cell embeddings and optimizing t-SNE and UMAP hyperparameters. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.04.21.537839. [PMID: 37163087 PMCID: PMC10168265 DOI: 10.1101/2023.04.21.537839] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
Two-dimensional (2D) embedding methods are crucial for single-cell data visualization. Popular methods such as t-SNE and UMAP are commonly used for visualizing cell clusters; however, it is well known that t-SNE and UMAP's 2D embedding might not reliably inform the similarities among cell clusters. Motivated by this challenge, we developed a statistical method, scDEED, for detecting dubious cell embeddings output by any 2D-embedding method. By calculating a reliability score for every cell embedding, scDEED identifies the cell embeddings with low reliability scores as dubious and those with high reliability scores as trustworthy. Moreover, by minimizing the number of dubious cell embeddings, scDEED provides intuitive guidance for optimizing the hyperparameters of an embedding method. Applied to multiple scRNA-seq datasets, scDEED demonstrates its effectiveness for detecting dubious cell embeddings and optimizing the hyperparameters of t-SNE and UMAP.
Collapse
|
42
|
Carbonetto P, Luo K, Sarkar A, Hung A, Tayeb K, Pott S, Stephens M. GoM DE: interpreting structure in sequence count data with differential expression analysis allowing for grades of membership. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.03.531029. [PMID: 36945441 PMCID: PMC10028846 DOI: 10.1101/2023.03.03.531029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/11/2023]
Abstract
Parts-based representations, such as non-negative matrix factorization and topic modeling, have been used to identify structure from single-cell sequencing data sets, in particular structure that is not as well captured by clustering or other dimensionality reduction methods. However, interpreting the individual parts remains a challenge. To address this challenge, we extend methods for differential expression analysis by allowing cells to have partial membership to multiple groups. We call this grade of membership differential expression (GoM DE). We illustrate the benefits of GoM DE for annotating topics identified in several single-cell RNA-seq and ATAC-seq data sets.
Collapse
Affiliation(s)
- Peter Carbonetto
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Research Computing Center, University of Chicago, Chicago, IL, USA
| | - Kaixuan Luo
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
| | - Abhishek Sarkar
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Vesalius Therapeutics, Cambridge, MA, USA
| | - Anthony Hung
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Section of Genetic Medicine, University of Chicago, Chicago, IL, USA
| | - Karl Tayeb
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Committee on Genetics, Genomics and Systems Biology, University of Chicago, Chicago, IL, USA
| | - Sebastian Pott
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Section of Genetic Medicine, University of Chicago, Chicago, IL, USA
| | - Matthew Stephens
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Department of Statistics, University of Chicago, Chicago, IL, USA
| |
Collapse
|
43
|
O’Connor LM, O’Connor BA, Zeng J, Lo CH. Data Mining of Microarray Datasets in Translational Neuroscience. Brain Sci 2023; 13:1318. [PMID: 37759919 PMCID: PMC10527016 DOI: 10.3390/brainsci13091318] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2023] [Revised: 09/04/2023] [Accepted: 09/10/2023] [Indexed: 09/29/2023] Open
Abstract
Data mining involves the computational analysis of a plethora of publicly available datasets to generate new hypotheses that can be further validated by experiments for the improved understanding of the pathogenesis of neurodegenerative diseases. Although the number of sequencing datasets is on the rise, microarray analysis conducted on diverse biological samples represent a large collection of datasets with multiple web-based programs that enable efficient and convenient data analysis. In this review, we first discuss the selection of biological samples associated with neurological disorders, and the possibility of a combination of datasets, from various types of samples, to conduct an integrated analysis in order to achieve a holistic understanding of the alterations in the examined biological system. We then summarize key approaches and studies that have made use of the data mining of microarray datasets to obtain insights into translational neuroscience applications, including biomarker discovery, therapeutic development, and the elucidation of the pathogenic mechanisms of neurodegenerative diseases. We further discuss the gap to be bridged between microarray and sequencing studies to improve the utilization and combination of different types of datasets, together with experimental validation, for more comprehensive analyses. We conclude by providing future perspectives on integrating multi-omics, to advance precision phenotyping and personalized medicine for neurodegenerative diseases.
Collapse
Affiliation(s)
- Lance M. O’Connor
- College of Biological Sciences, University of Minnesota, Minneapolis, MN 55455, USA;
| | - Blake A. O’Connor
- School of Pharmacy, University of Wisconsin, Madison, WI 53705, USA;
| | - Jialiu Zeng
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore 308232, Singapore;
| | - Chih Hung Lo
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore 308232, Singapore;
| |
Collapse
|
44
|
Schuster V, Krogh A. The Deep Generative Decoder: MAP estimation of representations improves modelling of single-cell RNA data. Bioinformatics 2023; 39:btad497. [PMID: 37572301 PMCID: PMC10483129 DOI: 10.1093/bioinformatics/btad497] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2022] [Revised: 07/12/2023] [Accepted: 08/10/2023] [Indexed: 08/14/2023] Open
Abstract
MOTIVATION Learning low-dimensional representations of single-cell transcriptomics has become instrumental to its downstream analysis. The state of the art is currently represented by neural network models, such as variational autoencoders, which use a variational approximation of the likelihood for inference. RESULTS We here present the Deep Generative Decoder (DGD), a simple generative model that computes model parameters and representations directly via maximum a posteriori estimation. The DGD handles complex parameterized latent distributions naturally unlike variational autoencoders, which typically use a fixed Gaussian distribution, because of the complexity of adding other types. We first show its general functionality on a commonly used benchmark set, Fashion-MNIST. Secondly, we apply the model to multiple single-cell datasets. Here, the DGD learns low-dimensional, meaningful, and well-structured latent representations with sub-clustering beyond the provided labels. The advantages of this approach are its simplicity and its capability to provide representations of much smaller dimensionality than a comparable variational autoencoder. AVAILABILITY AND IMPLEMENTATION scDGD is available as a python package at https://github.com/Center-for-Health-Data-Science/scDGD. The remaining code is made available here: https://github.com/Center-for-Health-Data-Science/dgd.
Collapse
Affiliation(s)
- Viktoria Schuster
- Center for Health Data Science, University of Copenhagen, 2200 Copenhagen, Denmark
| | - Anders Krogh
- Center for Health Data Science, University of Copenhagen, 2200 Copenhagen, Denmark
- Department of Computer Science, University of Copenhagen, 2100 Copenhagen, Denmark
| |
Collapse
|
45
|
Nelson ED, Maynard KR, Nicholas KR, Tran MN, Divecha HR, Collado-Torres L, Hicks SC, Martinowich K. Activity-regulated gene expression across cell types of the mouse hippocampus. Hippocampus 2023; 33:1009-1027. [PMID: 37226416 PMCID: PMC11129873 DOI: 10.1002/hipo.23548] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2022] [Revised: 04/19/2023] [Accepted: 05/06/2023] [Indexed: 05/26/2023]
Abstract
Activity-regulated gene (ARG) expression patterns in the hippocampus (HPC) regulate synaptic plasticity, learning, and memory, and are linked to both risk and treatment responses for many neuropsychiatric disorders. The HPC contains discrete classes of neurons with specialized functions, but cell type-specific activity-regulated transcriptional programs are not well characterized. Here, we used single-nucleus RNA-sequencing (snRNA-seq) in a mouse model of acute electroconvulsive seizures (ECS) to identify cell type-specific molecular signatures associated with induced activity in HPC neurons. We used unsupervised clustering and a priori marker genes to computationally annotate 15,990 high-quality HPC neuronal nuclei from N = 4 mice across all major HPC subregions and neuron types. Activity-induced transcriptomic responses were divergent across neuron populations, with dentate granule cells being particularly responsive to activity. Differential expression analysis identified both upregulated and downregulated cell type-specific gene sets in neurons following ECS. Within these gene sets, we identified enrichment of pathways associated with varying biological processes such as synapse organization, cellular signaling, and transcriptional regulation. Finally, we used matrix factorization to reveal continuous gene expression patterns differentially associated with cell type, ECS, and biological processes. This work provides a rich resource for interrogating activity-regulated transcriptional responses in HPC neurons at single-nuclei resolution in the context of ECS, which can provide biological insight into the roles of defined neuronal subtypes in HPC function.
Collapse
Affiliation(s)
- Erik D. Nelson
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
- Department of Neuroscience, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
| | - Kristen R. Maynard
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
| | - Kyndall R. Nicholas
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
| | - Matthew N Tran
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
| | - Heena R. Divecha
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
| | - Leonardo Collado-Torres
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
| | - Stephanie C. Hicks
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, 21205, USA
| | - Keri Martinowich
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21205, USA
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
- Department of Neuroscience, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
- The Kavli Neuroscience Discovery Institute, Johns Hopkins University, Baltimore, MD, 21205
| |
Collapse
|
46
|
Bailey R, Sarkar A, Singh A, Dobra A, Kahveci T. Optimal Supervised Reduction of High Dimensional Transcription Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:3093-3105. [PMID: 37276117 DOI: 10.1109/tcbb.2023.3280557] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
The plight of navigating high-dimensional transcription datasets remains a persistent problem. This problem is further amplified for complex disorders, such as cancer as these disorders are often multigenic traits with multiple subsets of genes collectively affecting the type, stage, and severity of the trait. We are often faced with a trade off between reducing the dimensionality of our datasets and maintaining the integrity of our data. To accomplish both tasks simultaneously for very high dimensional transcriptome for complex multigenic traits, we propose a new supervised technique, Class Separation Transformation (CST). CST accomplishes both tasks simultaneously by significantly reducing the dimensionality of the input space into a one-dimensional transformed space that provides optimal separation between the differing classes. Furthermore, CST offers an means of explainable ML, as it computes the relative importance of each feature for its contribution to class distinction, which can thus lead to deeper insights and discovery. We compare our method with existing state-of-the-art methods using both real and synthetic datasets, demonstrating that CST is the more accurate, robust, scalable, and computationally advantageous technique relative to existing methods. Code used in this paper is available on https://github.com/richiebailey74/CST.
Collapse
|
47
|
Kang JB, Raveane A, Nathan A, Soranzo N, Raychaudhuri S. Methods and Insights from Single-Cell Expression Quantitative Trait Loci. Annu Rev Genomics Hum Genet 2023; 24:277-303. [PMID: 37196361 PMCID: PMC10784788 DOI: 10.1146/annurev-genom-101422-100437] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/19/2023]
Abstract
Recent advancements in single-cell technologies have enabled expression quantitative trait locus (eQTL) analysis across many individuals at single-cell resolution. Compared with bulk RNA sequencing, which averages gene expression across cell types and cell states, single-cell assays capture the transcriptional states of individual cells, including fine-grained, transient, and difficult-to-isolate populations at unprecedented scale and resolution. Single-cell eQTL (sc-eQTL) mapping can identify context-dependent eQTLs that vary with cell states, including some that colocalize with disease variants identified in genome-wide association studies. By uncovering the precise contexts in which these eQTLs act, single-cell approaches can unveil previously hidden regulatory effects and pinpoint important cell states underlying molecular mechanisms of disease. Here, we present an overview of recently deployed experimental designs in sc-eQTL studies. In the process, we consider the influence of study design choices such as cohort, cell states, and ex vivo perturbations. We then discuss current methodologies, modeling approaches, and technical challenges as well as future opportunities and applications.
Collapse
Affiliation(s)
- Joyce B Kang
- Center for Data Sciences and Divisions of Genetics and Rheumatology, Department of Medicine, Brigham and Women's Hospital, Boston, Massachusetts, USA; ,
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA;
| | | | - Aparna Nathan
- Center for Data Sciences and Divisions of Genetics and Rheumatology, Department of Medicine, Brigham and Women's Hospital, Boston, Massachusetts, USA; ,
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA;
| | - Nicole Soranzo
- Human Technopole, Milan, Italy; ,
- Department of Human Genetics, Wellcome Sanger Institute, Hinxton, United Kingdom
- British Heart Foundation Centre of Research Excellence and Department of Haematology, University of Cambridge, Cambridge, United Kingdom
| | - Soumya Raychaudhuri
- Center for Data Sciences and Divisions of Genetics and Rheumatology, Department of Medicine, Brigham and Women's Hospital, Boston, Massachusetts, USA; ,
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA;
- Centre for Genetics and Genomics Versus Arthritis, University of Manchester, Manchester, United Kingdom
| |
Collapse
|
48
|
Casey MJ, Fliege J, Sánchez-García RJ, MacArthur BD. An information-theoretic approach to single cell sequencing analysis. BMC Bioinformatics 2023; 24:311. [PMID: 37573291 PMCID: PMC10422744 DOI: 10.1186/s12859-023-05424-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2022] [Accepted: 07/18/2023] [Indexed: 08/14/2023] Open
Abstract
BACKGROUND Single-cell sequencing (sc-Seq) experiments are producing increasingly large data sets. However, large data sets do not necessarily contain large amounts of information. RESULTS Here, we formally quantify the information obtained from a sc-Seq experiment and show that it corresponds to an intuitive notion of gene expression heterogeneity. We demonstrate a natural relation between our notion of heterogeneity and that of cell type, decomposing heterogeneity into that component attributable to differential expression between cell types (inter-cluster heterogeneity) and that remaining (intra-cluster heterogeneity). We test our definition of heterogeneity as the objective function of a clustering algorithm, and show that it is a useful descriptor for gene expression patterns associated with different cell types. CONCLUSIONS Thus, our definition of gene heterogeneity leads to a biologically meaningful notion of cell type, as groups of cells that are statistically equivalent with respect to their patterns of gene expression. Our measure of heterogeneity, and its decomposition into inter- and intra-cluster, is non-parametric, intrinsic, unbiased, and requires no additional assumptions about expression patterns. Based on this theory, we develop an efficient method for the automatic unsupervised clustering of cells from sc-Seq data, and provide an R package implementation.
Collapse
Affiliation(s)
- Michael J Casey
- Mathematical Sciences, University of Southampton, Southampton, UK
- Institute for Life Sciences, University of Southampton, Southampton, UK
| | - Jörg Fliege
- Mathematical Sciences, University of Southampton, Southampton, UK
| | - Rubén J Sánchez-García
- Mathematical Sciences, University of Southampton, Southampton, UK.
- Institute for Life Sciences, University of Southampton, Southampton, UK.
- The Alan Turing Institute, London, UK.
| | - Ben D MacArthur
- Mathematical Sciences, University of Southampton, Southampton, UK.
- Institute for Life Sciences, University of Southampton, Southampton, UK.
- The Alan Turing Institute, London, UK.
- Centre for Human Development, Stem Cells and Regeneration, Faculty of Medicine, University of Southampton, Southampton, UK.
| |
Collapse
|
49
|
Su C, Xu Z, Shan X, Cai B, Zhao H, Zhang J. Cell-type-specific co-expression inference from single cell RNA-sequencing data. Nat Commun 2023; 14:4846. [PMID: 37563115 PMCID: PMC10415381 DOI: 10.1038/s41467-023-40503-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2023] [Accepted: 07/28/2023] [Indexed: 08/12/2023] Open
Abstract
The advancement of single cell RNA-sequencing (scRNA-seq) technology has enabled the direct inference of co-expressions in specific cell types, facilitating our understanding of cell-type-specific biological functions. For this task, the high sequencing depth variations and measurement errors in scRNA-seq data present two significant challenges, and they have not been adequately addressed by existing methods. We propose a statistical approach, CS-CORE, for estimating and testing cell-type-specific co-expressions, that explicitly models sequencing depth variations and measurement errors in scRNA-seq data. Systematic evaluations show that most existing methods suffered from inflated false positives as well as biased co-expression estimates and clustering analysis, whereas CS-CORE gave accurate estimates in these experiments. When applied to scRNA-seq data from postmortem brain samples from Alzheimer's disease patients/controls and blood samples from COVID-19 patients/controls, CS-CORE identified cell-type-specific co-expressions and differential co-expressions that were more reproducible and/or more enriched for relevant biological pathways than those inferred from existing methods.
Collapse
Affiliation(s)
- Chang Su
- Department of Biostatistics, Yale University, New Haven, CT, USA
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA, USA
| | - Zichun Xu
- Department of Biostatistics, Yale University, New Haven, CT, USA
- Department of Biostatistics, University of Washington, Seattle, WA, USA
| | - Xinning Shan
- Department of Biostatistics, Yale University, New Haven, CT, USA
| | - Biao Cai
- Department of Biostatistics, Yale University, New Haven, CT, USA
- Department of Mathematical Sciences, University of Cincinnati, Cincinnati, OH, USA
| | - Hongyu Zhao
- Department of Biostatistics, Yale University, New Haven, CT, USA.
| | - Jingfei Zhang
- Information Systems and Operations Management, Emory University, Atlanta, GA, USA.
| |
Collapse
|
50
|
Lause J, Ziegenhain C, Hartmanis L, Berens P, Kobak D. Compound models and Pearson residuals for normalization of single-cell RNA-seq data without UMIs. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.08.02.551637. [PMID: 37577688 PMCID: PMC10418209 DOI: 10.1101/2023.08.02.551637] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/15/2023]
Abstract
Before downstream analysis can reveal biological signals in single-cell RNA sequencing data, normalization and variance stabilization are required to remove technical noise. Recently, Pearson residuals based on negative binomial models have been suggested as an efficient normalization approach. These methods were developed for UMI-based sequencing protocols, where unique molecular identifiers (UMIs) help to remove PCR amplification noise by keeping track of the original molecules. In contrast, full-length protocols such as Smart-seq2 lack UMIs and retain amplification noise, making negative binomial models inapplicable. Here, we extend Pearson residuals to such read count data by modeling them as a compound process: we assume that the captured RNA molecules follow the negative binomial distribution, but are replicated according to an amplification distribution. Based on this model, we introduce compound Pearson residuals and show that they can be analytically obtained without explicit knowledge of the amplification distribution. Further, we demonstrate that compound Pearson residuals lead to a biologically meaningful gene selection and low-dimensional embeddings of complex Smart-seq2 datasets. Finally, we empirically study amplification distributions across several sequencing protocols, and suggest that they can be described by a broken power law. We show that the resulting compound distribution captures overdispersion and zero-inflation patterns characteristic of read count data. In summary, compound Pearson residuals provide an efficient and effective way to normalize read count data based on simple mechanistic assumptions.
Collapse
Affiliation(s)
- Jan Lause
- Hertie Institute for AI in Brain Health, University of Tübingen, Germany
- Tübingen AI Center, Tübingen, Germany
| | | | - Leonard Hartmanis
- Department of Cell & Molecular Biology, Karolinska Institutet, Sweden
| | - Philipp Berens
- Hertie Institute for AI in Brain Health, University of Tübingen, Germany
- Tübingen AI Center, Tübingen, Germany
| | - Dmitry Kobak
- Hertie Institute for AI in Brain Health, University of Tübingen, Germany
- Tübingen AI Center, Tübingen, Germany
| |
Collapse
|