1
|
Zhu X, Zhao L, Teng F, Meng S, Xie M. ScAGCN: Graph Convolutional Network with Adaptive Aggregation Mechanism for scRNA-seq Data Dimensionality Reduction. Interdiscip Sci 2025:10.1007/s12539-025-00702-w. [PMID: 40281370 DOI: 10.1007/s12539-025-00702-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2024] [Revised: 03/10/2025] [Accepted: 03/12/2025] [Indexed: 04/29/2025]
Abstract
With the development of single-cell RNA-sequencing (scRNA-seq) technology, scRNA-seq data analysis suffers huge challenges due to large scale, high dimensionality, high noise, and high sparsity. To achieve accurately embedded representation in the large-scale scRNA-seq data, we try to design a novel graph convolutional network with an adaptive aggregation mechanism. Based on the assumption that the aggregation order of different cells would be different, a graph convolutional network with an adaptive aggregation-based dimensionality reduction algorithm for scRNA-seq data is developed, named scAGCN. In scAGCN, a preprocessing consisting of quality control and feature selection is implemented. Then, an approximate nearest neighbor graph is rapidly constructed. Finally, a graph convolutional network with an adaptive aggregation mechanism is constructed, in which the neighborhood selection strategy based on node distribution and similarity boxplots is designed, and the aggregation function is optimized by defining a similarity measurement between neighborhood nodes and the central node. The results show that scAGCN outperforms existing dimensionality reduction methods on 15 real scRNA-seq datasets, especially in 10 large-scale scRNA-seq datasets.
Collapse
Affiliation(s)
- Xiaoshu Zhu
- School of Computer and Information Security, Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin, 541004, China.
| | - Liquan Zhao
- School of Computer and Information Security, Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin, 541004, China
| | - Fei Teng
- School of Computer and Information Security, Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin, 541004, China
| | - Shuang Meng
- School of Computer Science and Engineering, Guangxi Normal University, Guilin, 541006, China
| | - Miao Xie
- School of Computer Science and Engineering, Yulin Normal University, Yulin, 537000, China.
| |
Collapse
|
2
|
Kedzierska KZ, Crawford L, Amini AP, Lu AX. Zero-shot evaluation reveals limitations of single-cell foundation models. Genome Biol 2025; 26:101. [PMID: 40251685 PMCID: PMC12007350 DOI: 10.1186/s13059-025-03574-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2024] [Accepted: 04/09/2025] [Indexed: 04/20/2025] Open
Abstract
Foundation models such as scGPT and Geneformer have not been rigorously evaluated in a setting where they are used without any further training (i.e., zero-shot). Understanding the performance of models in zero-shot settings is critical to applications that exclude the ability to fine-tune, such as discovery settings where labels are unknown. Our evaluation of the zero-shot performance of Geneformer and scGPT suggests that, in some cases, these models may face reliability challenges and could be outperformed by simpler methods. Our findings underscore the importance of zero-shot evaluations in development and deployment of foundation models in single-cell research.
Collapse
Affiliation(s)
| | | | | | - Alex X Lu
- Microsoft Research, Cambridge, MA, USA.
| |
Collapse
|
3
|
Xie Y, Jing Z, Pan H, Xu X, Fang Q. Redefining the high variable genes by optimized LOESS regression with positive ratio. BMC Bioinformatics 2025; 26:104. [PMID: 40234751 PMCID: PMC12001687 DOI: 10.1186/s12859-025-06112-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2024] [Accepted: 03/10/2025] [Indexed: 04/17/2025] Open
Abstract
BACKGROUND Single-cell RNA sequencing allows for the exploration of transcriptomic features at the individual cell level, but the high dimensionality and sparsity of the data pose substantial challenges for downstream analysis. Feature selection, therefore, is a critical step to reduce dimensionality and enhance interpretability. RESULTS We developed a robust feature selection algorithm that leverages optimized locally estimated scatterplot smoothing regression (LOESS) to precisely capture the relationship between gene average expression level and positive ratio while minimizing overfitting. Our evaluations showed that our algorithm consistently outperforms eight leading feature selection methods across three benchmark criteria and helps improve downstream analysis, thus offering a significant improvement in gene subset selection. CONCLUSIONS By preserving key biological information through feature selection, GLP provides informative features to enhance the accuracy and effectiveness of downstream analyses.
Collapse
Affiliation(s)
- Yue Xie
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, 100049, China
- BGI Research, Shenzhen, 518083, China
- BGI Research, Hangzhou, 310030, China
| | - Zehua Jing
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, 100049, China
- BGI Research, Shenzhen, 518083, China
- BGI Research, Hangzhou, 310030, China
| | | | - Xun Xu
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, 100049, China.
- BGI Research, Shenzhen, 518083, China.
| | - Qi Fang
- BGI Research, Shenzhen, 518083, China.
| |
Collapse
|
4
|
Wu CH, Zhou X, Chen M. Exploring and mitigating shortcomings in single-cell differential expression analysis with a new statistical paradigm. Genome Biol 2025; 26:58. [PMID: 40098192 PMCID: PMC11912664 DOI: 10.1186/s13059-025-03525-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Accepted: 03/05/2025] [Indexed: 03/19/2025] Open
Abstract
BACKGROUND Differential expression analysis is pivotal in single-cell transcriptomics for unraveling cell-type-specific responses to stimuli. While numerous methods are available to identify differentially expressed genes in single-cell data, recent evaluations of both single-cell-specific methods and methods adapted from bulk studies have revealed significant shortcomings in performance. In this paper, we dissect the four major challenges in single-cell differential expression analysis: excessive zeros, normalization, donor effects, and cumulative biases. These "curses" underscore the limitations and conceptual pitfalls in existing workflows. RESULTS To address the limitations of current single-cell differential expression analysis methods, we propose GLIMES, a statistical framework that leverages UMI counts and zero proportions within a generalized Poisson/Binomial mixed-effects model to account for batch effects and within-sample variation. We rigorously benchmarked GLIMES against six existing differential expression methods using three case studies and simulations across different experimental scenarios, including comparisons across cell types, tissue regions, and cell states. Our results demonstrate that GLIMES is more adaptable to diverse experimental designs in single-cell studies and effectively mitigates key shortcomings of current approaches, particularly those related to normalization procedures. By preserving biologically meaningful signals, GLIMES offers improved performance in detecting differentially expressed genes. CONCLUSIONS By using absolute RNA expression rather than relative abundance, GLIMES improves sensitivity, reduces false discoveries, and enhances biological interpretability. This paradigm shift challenges existing workflows and highlights the need for careful consideration of normalization strategies, ultimately paving the way for more accurate and robust single-cell transcriptomic analyses.
Collapse
Affiliation(s)
- Chih-Hsuan Wu
- Department of Statistics, University of Chicago, Chicago, USA
| | - Xiang Zhou
- Department of Biostatistics, University of Michigan, Ann Arbor, USA
| | - Mengjie Chen
- Department of Human Genetics and Department of Medicine, University of Chicago, Chicago, USA.
| |
Collapse
|
5
|
Zhang Y, Wang Y, Liu X, Feng X. PbImpute: Precise Zero Discrimination and Balanced Imputation in Single-Cell RNA Sequencing Data. J Chem Inf Model 2025; 65:2670-2684. [PMID: 39957720 PMCID: PMC11898086 DOI: 10.1021/acs.jcim.4c02125] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2024] [Revised: 01/31/2025] [Accepted: 02/03/2025] [Indexed: 02/18/2025]
Abstract
Single-cell RNA sequencing (scRNA-seq) has emerged as a transformative technology for elucidating cellular heterogeneity at unprecedented resolution. However, technical limitations such as limited sequencing depth and mRNA capture efficiency often result in zero counts, commonly referred to as "dropout zeros" in scRNA-seq data. These zeros pose significant challenges to downstream analysis, as they can distort the interpretation of cellular transcriptomes. While numerous computational methods have been developed to address this challenge, existing approaches frequently suffer from either insufficient imputation of zeros (under-imputation) or excessive modification of zeros (over-imputation). Here, we propose a precisely balanced imputation (PbImpute) method designed to achieve optimal equilibrium between dropout recovery and biological zero preservation in scRNA-seq data. PbImpute employs a multistage approach: (1) Initial discrimination between technical dropouts and biological zeros through parameter optimization of a new zero-inflated negative binomial (ZINB) distribution model, followed by initial imputation; (2) Application of a uniquely designed static repair algorithm to enhance data fidelity; (3) Secondary dropout identification based on gene expression frequency and partition-specific coefficient of variation; (4) Graph-embedding neural network-based imputation; and (5) Implementation of a uniquely designed dynamic repair mechanism to mitigate over-imputation effects. PbImpute distinguishes itself by uniquely integrating ZINB modeling with static and dynamic repair. This advantageous combined approach achieves a balance between over- and under-imputation, while simultaneously preserving true biological zeros and reducing signal distortion. Comprehensive evaluation using both simulated and real scRNA-seq data sets demonstrated that PbImpute achieves superior performance (F1 Score = 0.88 at 83% dropout rate, ARI = 0.78 on PBMC) in discriminating between technical dropouts and biological zeros compared to state-of-the-art methods. The method significantly improves gene-gene and cell-cell correlation structures, enhances differential expression analysis sensitivity, optimizes clustering resolution and dimensional reduction visualization, and facilitates more accurate trajectory inference. Ablation studies confirmed the essential contribution of both the imputation and repair modules to the method's performance. The code is available at https://github.com/WyBioTeam/PbImpute. By enhancing the accuracy of scRNA-seq data imputation, PbImpute can improve the identification of cell subpopulations and the detection of differentially expressed genes, thereby facilitating more precise analyses of cellular heterogeneity and advancing disease research.
Collapse
Affiliation(s)
- Yi Zhang
- School
of Computer Science and Engineering, Guilin
University of Technology, 12 Jiangan Road, Qixing District, Guilin 541004, China
- Guangxi
Key Laboratory of Embedded Technology and Intelligent System, Guilin University of Technology, 12 Jiangan Road, Qixing District, Guilin 541004, China
| | - Yin Wang
- School
of Computer Science and Engineering, Guilin
University of Technology, 12 Jiangan Road, Qixing District, Guilin 541004, China
- Guangxi
Key Laboratory of Embedded Technology and Intelligent System, Guilin University of Technology, 12 Jiangan Road, Qixing District, Guilin 541004, China
| | - Xinyuan Liu
- School
of Computer Science and Engineering, Guilin
University of Technology, 12 Jiangan Road, Qixing District, Guilin 541004, China
- Guangxi
Key Laboratory of Embedded Technology and Intelligent System, Guilin University of Technology, 12 Jiangan Road, Qixing District, Guilin 541004, China
| | - Xi Feng
- School
of Computer Science and Engineering, Guilin
University of Technology, 12 Jiangan Road, Qixing District, Guilin 541004, China
- Guangxi
Key Laboratory of Embedded Technology and Intelligent System, Guilin University of Technology, 12 Jiangan Road, Qixing District, Guilin 541004, China
| |
Collapse
|
6
|
Pouyabahar D, Andrews T, Bader GD. Interpretable single-cell factor decomposition using sciRED. Nat Commun 2025; 16:1878. [PMID: 39987196 PMCID: PMC11846867 DOI: 10.1038/s41467-025-57157-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2024] [Accepted: 02/10/2025] [Indexed: 02/24/2025] Open
Abstract
Single-cell RNA sequencing maps gene expression heterogeneity within a tissue. However, identifying biological signals in this data is challenging due to confounding technical factors, sparsity, and high dimensionality. Data factorization methods address this by separating and identifying signals in the data, such as gene expression programs, but the resulting factors must be manually interpreted. We developed Single-Cell Interpretable REsidual Decomposition (sciRED) to improve the interpretation of scRNA-seq factor analysis. sciRED removes known confounding effects, uses rotations to improve factor interpretability, maps factors to known covariates, identifies unexplained factors that may capture hidden biological phenomena, and determines the genes and biological processes represented by the resulting factors. We apply sciRED to multiple scRNA-seq datasets and identify sex-specific variation in a kidney map, discern strong and weak immune stimulation signals in a PBMC dataset, reduce ambient RNA contamination in a rat liver atlas to help identify strain variation and reveal rare cell type signatures and anatomical zonation gene programs in a healthy human liver map. These demonstrate that sciRED is useful in characterizing diverse biological signals within scRNA-seq data.
Collapse
Affiliation(s)
- Delaram Pouyabahar
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
- The Donnelly Centre, University of Toronto, Toronto, ON, Canada
| | - Tallulah Andrews
- Department of Biochemistry, Schulich School of Medicine and Dentistry, University of Western Ontario, London, ON, Canada
- Department of Computer Science, University of Western Ontario, London, ON, Canada
| | - Gary D Bader
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada.
- The Donnelly Centre, University of Toronto, Toronto, ON, Canada.
- Department of Computer Science, University of Toronto, Toronto, ON, Canada.
- Lunenfeld-Tanenbaum Research Institute, Toronto, ON, Canada.
- Princess Margaret Research Institute, University Health Network, Toronto, ON, Canada.
- CIFAR Multiscale Human Program, CIFAR, Toronto, ON, Canada.
| |
Collapse
|
7
|
Chockalingam SP, Aluru M, Aluru S. SCEMENT: scalable and memory efficient integration of large-scale single-cell RNA-sequencing data. Bioinformatics 2025; 41:btaf057. [PMID: 39985442 PMCID: PMC12013815 DOI: 10.1093/bioinformatics/btaf057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2024] [Revised: 11/18/2024] [Accepted: 02/20/2025] [Indexed: 02/24/2025] Open
Abstract
MOTIVATION Integrative analysis of large-scale single-cell data collected from diverse cell populations promises an improved understanding of complex biological systems. While several algorithms have been developed for single-cell RNA-sequencing data integration, many lack the scalability to handle large numbers of datasets and/or millions of cells due to their memory and run time requirements. The few tools that can handle large data do so by reducing the computational burden through strategies such as subsampling of the data or selecting a reference dataset to improve computational efficiency and scalability. Such shortcuts, however, hamper the accuracy of downstream analyses, especially those requiring quantitative gene expression information. RESULTS We present SCEMENT, a SCalablE and Memory-Efficient iNTegration method, to overcome these limitations. Our new parallel algorithm builds upon and extends the linear regression model previously applied in ComBat to an unsupervised sparse matrix setting to enable accurate integration of diverse and large collections of single-cell RNA-sequencing data. Using tens to hundreds of real single-cell RNA-seq datasets, we show that SCEMENT outperforms ComBat as well as FastIntegration and Scanorama in runtime (upto 214× faster) and memory usage (upto 17.5× less). It not only performs batch correction and integration of millions of cells in under 25 min, but also facilitates the discovery of new rare cell types and more robust reconstruction of gene regulatory networks with full quantitative gene expression information. AVAILABILITY AND IMPLEMENTATION Source code freely available for download at https://github.com/AluruLab/scement, implemented in C++ and supported on Linux.
Collapse
Affiliation(s)
- Sriram P Chockalingam
- Institute for Data Engineering and Science, Georgia Institute of Technology, Atlanta, GA-30332, United States
| | - Maneesha Aluru
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA-30332, United States
| | - Srinivas Aluru
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA-30332, United States
| |
Collapse
|
8
|
Golchin A, Shams F, Moradi F, Sadrabadi AE, Parviz S, Alipour S, Ranjbarvan P, Hemmati Y, Rahnama M, Rasmi Y, Aziz SGG. Single-cell Technology in Stem Cell Research. Curr Stem Cell Res Ther 2025; 20:9-32. [PMID: 38243989 DOI: 10.2174/011574888x265479231127065541] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Revised: 09/23/2023] [Accepted: 10/04/2023] [Indexed: 01/22/2024]
Abstract
Single-cell technology (SCT), which enables the examination of the fundamental units comprising biological organs, tissues, and cells, has emerged as a powerful tool, particularly in the field of biology, with a profound impact on stem cell research. This innovative technology opens new pathways for acquiring cell-specific data and gaining insights into the molecular pathways governing organ function and biology. SCT is not only frequently used to explore rare and diverse cell types, including stem cells, but it also unveils the intricacies of cellular diversity and dynamics. This perspective, crucial for advancing stem cell research, facilitates non-invasive analyses of molecular dynamics and cellular functions over time. Despite numerous investigations into potential stem cell therapies for genetic disorders, degenerative conditions, and severe injuries, the number of approved stem cell-based treatments remains limited. This limitation is attributed to the various heterogeneities present among stem cell sources, hindering their widespread clinical utilization. Furthermore, stem cell research is intimately connected with cutting-edge technologies, such as microfluidic organoids, CRISPR technology, and cell/tissue engineering. Each strategy developed to overcome the constraints of stem cell research has the potential to significantly impact advanced stem cell therapies. Drawing on the advantages and progress achieved through SCT-based approaches, this study aims to provide an overview of the advancements and concepts associated with the utilization of SCT in stem cell research and its related fields.
Collapse
Affiliation(s)
- Ali Golchin
- Cellular and Molecular Research Center, Cellular and Molecular Medicine Institute, Urmia University of Medical Sciences, Urmia, Iran
- Department of Clinical Biochemistry and Applied Cell Sciences, School of Medicine, Urmia University of Medical Sciences, Urmia, Iran
| | - Forough Shams
- Department of Medical Biotechnology, School of Advanced Technologies in Medicine, Shahid, Beheshti University of Medical Sciences, Tehran, Iran
| | - Faezeh Moradi
- Department of Tissue Engineering, School of Medicine, Tarbiat Modares University, Tehran, Iran
| | - Amin Ebrahimi Sadrabadi
- Department of Stem Cells and Developmental Biology, Cell Science Research Center, Royan Institute for Stem Cell Biology and Technology, ACECR , Tehran, Iran
| | - Shima Parviz
- Department of Tissue Engineering and Applied Cell Sciences, School of Advanced Medical Sciences and Technologies, Shiraz, University of Medical Sciences, Shiraz, Iran
| | - Shahriar Alipour
- Cellular and Molecular Research Center, Cellular and Molecular Medicine Institute, Urmia University of Medical Sciences, Urmia, Iran
- Department of Clinical Biochemistry and Applied Cell Sciences, School of Medicine, Urmia University of Medical Sciences, Urmia, Iran
| | - Parviz Ranjbarvan
- Cellular and Molecular Research Center, Cellular and Molecular Medicine Institute, Urmia University of Medical Sciences, Urmia, Iran
- Department of Clinical Biochemistry and Applied Cell Sciences, School of Medicine, Urmia University of Medical Sciences, Urmia, Iran
| | - Yaser Hemmati
- Department of Prosthodontics, Dental Faculty, Urmia University of Medical Science, Urmia, Iran
| | - Maryam Rahnama
- Department of Clinical Biochemistry and Applied Cell Sciences, School of Medicine, Urmia University of Medical Sciences, Urmia, Iran
| | - Yousef Rasmi
- Department of Clinical Biochemistry and Applied Cell Sciences, School of Medicine, Urmia University of Medical Sciences, Urmia, Iran
| | - Shiva Gholizadeh-Ghaleh Aziz
- Department of Clinical Biochemistry and Applied Cell Sciences, School of Medicine, Urmia University of Medical Sciences, Urmia, Iran
| |
Collapse
|
9
|
Pouyabahar D, Andrews T, Bader GD. Interpretable single-cell factor decomposition using sciRED. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.08.01.605536. [PMID: 39149356 PMCID: PMC11326131 DOI: 10.1101/2024.08.01.605536] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 08/17/2024]
Abstract
Single-cell RNA sequencing (scRNA-seq) maps gene expression heterogeneity within a tissue. However, identifying biological signals in this data is challenging due to confounding technical factors, sparsity, and high dimensionality. Data factorization methods address this by separating and identifying signals in the data, such as gene expression programs, but the resulting factors must be manually interpreted. We developed Single-Cell Interpretable REsidual Decomposition (sciRED) to improve the interpretation of scRNA-seq factor analysis. sciRED removes known confounding effects, uses rotations to improve factor interpretability, maps factors to known covariates, identifies unexplained factors that may capture hidden biological phenomena and determines the genes and biological processes represented by the resulting factors. We apply sciRED to multiple scRNA-seq datasets and identify sex-specific variation in a kidney map, discern strong and weak immune stimulation signals in a PBMC dataset, reduce ambient RNA contamination in a rat liver atlas to help identify strain variation, and reveal rare cell type signatures and anatomical zonation gene programs in a healthy human liver map. These demonstrate that sciRED is useful in characterizing diverse biological signals within scRNA-seq data.
Collapse
Affiliation(s)
- Delaram Pouyabahar
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
- The Donnelly Centre, University of Toronto, Toronto, Ontario, Canada
| | - Tallulah Andrews
- Department of Biochemistry, Schulich School of Medicine and Dentistry, University of Western Ontario, London, Ontario, Canada
- Department of Computer Science, University of Western Ontario, London, Ontario, Canada
| | - Gary D Bader
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
- The Donnelly Centre, University of Toronto, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Lunenfeld-Tanenbaum Research Institute, Toronto, Ontario, Canada
- Princess Margaret Research Institute, University Health Network, Toronto, Ontario, Canada
- CIFAR Multiscale Human Program, CIFAR, Toronto, Ontario, Canada
| |
Collapse
|
10
|
Shi M, Li X. Addressing scalability and managing sparsity and dropout events in single-cell representation identification with ZIGACL. Brief Bioinform 2024; 26:bbae703. [PMID: 39775477 PMCID: PMC11705091 DOI: 10.1093/bib/bbae703] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Revised: 11/06/2024] [Accepted: 12/23/2024] [Indexed: 01/11/2025] Open
Abstract
Despite significant advancements in single-cell representation learning, scalability and managing sparsity and dropout events continue to challenge the field as scRNA-seq datasets expand. While current computational tools struggle to maintain both efficiency and accuracy, the accurate connection of these dropout events to specific biological functions usually requires additional, complex experiments, often hampered by potential inaccuracies in cell-type annotation. To tackle these challenges, the Zero-Inflated Graph Attention Collaborative Learning (ZIGACL) method has been developed. This innovative approach combines a Zero-Inflated Negative Binomial model with a Graph Attention Network, leveraging mutual information from neighboring cells to enhance dimensionality reduction and apply dynamic adjustments to the learning process through a co-supervised deep graph clustering model. ZIGACL's integration of denoising and topological embedding significantly improves clustering accuracy and ensures similar cells are grouped closely in the latent space. Comparative analyses across nine real scRNA-seq datasets have shown that ZIGACL significantly enhances single-cell data analysis by offering superior clustering performance and improved stability in cell representations, effectively addressing scalability and managing sparsity and dropout events, thereby advancing our understanding of cellular heterogeneity.
Collapse
Affiliation(s)
- Mingguang Shi
- School of Electrical Engineering and Automation, Hefei University of Technology, Hefei, Anhui, China
| | - Xuefeng Li
- School of Electrical Engineering and Automation, Hefei University of Technology, Hefei, Anhui, China
| |
Collapse
|
11
|
Guo W, Li X, Wang D, Yan N, Hu Q, Yang F, Zhang X, Yao J, Gu J. scStateDynamics: deciphering the drug-responsive tumor cell state dynamics by modeling single-cell level expression changes. Genome Biol 2024; 25:297. [PMID: 39574111 PMCID: PMC11583649 DOI: 10.1186/s13059-024-03436-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2024] [Accepted: 11/15/2024] [Indexed: 11/24/2024] Open
Abstract
Understanding tumor cell heterogeneity and plasticity is crucial for overcoming drug resistance. Single-cell technologies enable analyzing cell states at a given condition, but catenating static cell snapshots to characterize dynamic drug responses remains challenging. Here, we propose scStateDynamics, an algorithm to infer tumor cell state dynamics and identify common drug effects by modeling single-cell level gene expression changes. Its reliability is validated on both simulated and lineage tracing data. Application to real tumor drug treatment datasets identifies more subtle cell subclusters with different drug responses beyond static transcriptome similarity and disentangles drug action mechanisms from the cell-level expression changes.
Collapse
Affiliation(s)
- Wenbo Guo
- MOE Key Lab of Bioinformatics, Department of Automation, BNRIST Bioinformatics Division, Tsinghua University, Beijing, China
| | - Xinqi Li
- MOE Key Lab of Bioinformatics, Department of Automation, BNRIST Bioinformatics Division, Tsinghua University, Beijing, China
| | - Dongfang Wang
- Biomedical Pioneering Innovation Center (BIOPIC), Peking University, Beijing, China
| | - Nan Yan
- MOE Key Lab of Bioinformatics, Department of Automation, BNRIST Bioinformatics Division, Tsinghua University, Beijing, China
| | - Qifan Hu
- MOE Key Lab of Bioinformatics, Department of Automation, BNRIST Bioinformatics Division, Tsinghua University, Beijing, China
| | - Fan Yang
- AI Lab, Shenzhen, Tencent, China
| | - Xuegong Zhang
- MOE Key Lab of Bioinformatics, Department of Automation, BNRIST Bioinformatics Division, Tsinghua University, Beijing, China
- Center for Synthetic and Systems Biology, School of Life Sciences and School of Medicine, Tsinghua University, Beijing, China
| | | | - Jin Gu
- MOE Key Lab of Bioinformatics, Department of Automation, BNRIST Bioinformatics Division, Tsinghua University, Beijing, China.
| |
Collapse
|
12
|
Yang J, Wang L, Liu L, Zheng X. GraphPCA: a fast and interpretable dimension reduction algorithm for spatial transcriptomics data. Genome Biol 2024; 25:287. [PMID: 39511664 PMCID: PMC11545739 DOI: 10.1186/s13059-024-03429-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2024] [Accepted: 10/29/2024] [Indexed: 11/15/2024] Open
Abstract
The rapid advancement of spatial transcriptomics technologies has revolutionized our understanding of cell heterogeneity and intricate spatial structures within tissues and organs. However, the high dimensionality and noise in spatial transcriptomic data present significant challenges for downstream data analyses. Here, we develop GraphPCA, an interpretable and quasi-linear dimension reduction algorithm that leverages the strengths of graphical regularization and principal component analysis. Comprehensive evaluations on simulated and multi-resolution spatial transcriptomic datasets generated from various platforms demonstrate the capacity of GraphPCA to enhance downstream analysis tasks including spatial domain detection, denoising, and trajectory inference compared to other state-of-the-art methods.
Collapse
Affiliation(s)
- Jiyuan Yang
- Center for Single-Cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Lu Wang
- Center for Single-Cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, China
- The Guangxi Key Laboratory of Intelligent Precision Medicine, Guangxi Zhuang Autonomous Region, Nanning, China
| | - Lin Liu
- Institute of Natural Sciences, MOE-LSC, School of Mathematical Sciences, CMA-Shanghai, SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University and Shanghai Artificial Intelligence Laboratory, Shanghai, China
| | - Xiaoqi Zheng
- Center for Single-Cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, China.
| |
Collapse
|
13
|
Usman K, Wan F, Zhao D, Peng J, Zeng J. Analyzing Large-Scale Single-Cell RNA-Seq Data Using Coreset. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:1784-1793. [PMID: 38913513 DOI: 10.1109/tcbb.2024.3418078] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/26/2024]
Abstract
The recent boom in single-cell sequencing technologies provides valuable insights into the transcriptomes of individual cells. Through single-cell data analyses, a number of biological discoveries, such as novel cell types, developmental cell lineage trajectories, and gene regulatory networks, have been uncovered. However, the massive and increasingly accumulated single-cell datasets have also posed a seriously computational and analytical challenge for researchers. To address this issue, one typically applies dimensionality reduction approaches to reduce the large-scale datasets. However, these approaches are generally computationally infeasible for tall matrices. In addition, the downstream data analysis tasks such as clustering still take a large time complexity even on the dimension-reduced datasets. We present single-cell Coreset (scCoreset), a data summarization framework that extracts a small weighted subset of cells from a huge sparse single-cell RNA-seq data to facilitate the downstream data analysis tasks. Single-cell data analyses run on the extracted subset yield similar results to those derived from the original uncompressed data. Tests on various single-cell datasets show that scCoreset outperforms the existing data summarization approaches for common downstream tasks such as visualization and clustering. We believe that scCoreset can serve as a useful plug-in tool to improve the efficiency of current single-cell RNA-seq data analyses.
Collapse
|
14
|
Wang H, Torous W, Gong B, Purdom E. Visualizing scRNA-Seq data at population scale with GloScope. Genome Biol 2024; 25:259. [PMID: 39380041 PMCID: PMC11463121 DOI: 10.1186/s13059-024-03398-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Accepted: 09/20/2024] [Indexed: 10/10/2024] Open
Abstract
Increasingly, scRNA-Seq studies explore cell populations across different samples and the effect of sample heterogeneity on organism's phenotype. However, relatively few bioinformatic methods have been developed which adequately address the variation between samples for such population-level analyses. We propose a framework for representing the entire single-cell profile of a sample, which we call a GloScope representation. We implement GloScope on scRNA-Seq datasets from study designs ranging from 12 to over 300 samples and demonstrate how GloScope allows researchers to perform essential bioinformatic tasks at the sample-level, in particular visualization and quality control assessment.
Collapse
Affiliation(s)
- Hao Wang
- Division of Biostatistics, University of California, Berkeley, CA, USA
| | - William Torous
- Department of Statistics, University of California, Berkeley, CA, USA
| | - Boying Gong
- Division of Biostatistics, University of California, Berkeley, CA, USA
| | - Elizabeth Purdom
- Department of Statistics, University of California, Berkeley, CA, USA.
- Center for Computational Biology, University of California, Berkeley, CA, USA.
| |
Collapse
|
15
|
Xu Y, Lv D, Zou X, Wu L, Xu X, Zhao X. BFAST: joint dimension reduction and spatial clustering with Bayesian factor analysis for zero-inflated spatial transcriptomics data. Brief Bioinform 2024; 25:bbae594. [PMID: 39552067 PMCID: PMC11570543 DOI: 10.1093/bib/bbae594] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2024] [Revised: 09/03/2024] [Accepted: 11/01/2024] [Indexed: 11/19/2024] Open
Abstract
The development of spatially resolved transcriptomics (ST) technologies has made it possible to measure gene expression profiles coupled with cellular spatial context and assist biologists in comprehensively characterizing cellular phenotype heterogeneity and tissue microenvironment. Spatial clustering is vital for biological downstream analysis. However, due to high noise and dropout events, clustering spatial transcriptomics data poses numerous challenges due to the lack of effective algorithms. Here we develop a novel method, jointly performing dimension reduction and spatial clustering with Bayesian Factor Analysis for zero-inflated Spatial Transcriptomics data (BFAST). BFAST has showcased exceptional performance on simulation data and real spatial transcriptomics datasets, as proven by benchmarking against currently available methods. It effectively extracts more biologically informative low-dimensional features compared to traditional dimensionality reduction approaches, thereby enhancing the accuracy and precision of clustering.
Collapse
Affiliation(s)
- Yang Xu
- BGI-Research, 313, Gaoteng Avenue, Jiulongpo, Chongqing 400039, China
- BGI-Research, 9, Yunhua Road, Yantian, Shenzhen 518083, China
| | - Dian Lv
- BGI-Research, 313, Gaoteng Avenue, Jiulongpo, Chongqing 400039, China
- BGI-Research, 9, Yunhua Road, Yantian, Shenzhen 518083, China
| | - Xuanxuan Zou
- BGI-Research, 313, Gaoteng Avenue, Jiulongpo, Chongqing 400039, China
- BGI-Research, 9, Yunhua Road, Yantian, Shenzhen 518083, China
| | - Liang Wu
- BGI-Research, 313, Gaoteng Avenue, Jiulongpo, Chongqing 400039, China
- BGI-Research, 9, Yunhua Road, Yantian, Shenzhen 518083, China
| | - Xun Xu
- BGI-Research, 9, Yunhua Road, Yantian, Shenzhen 518083, China
| | - Xin Zhao
- BGI-Research, 313, Gaoteng Avenue, Jiulongpo, Chongqing 400039, China
- BGI-Research, 9, Yunhua Road, Yantian, Shenzhen 518083, China
| |
Collapse
|
16
|
Sparta B, Hamilton T, Natesan G, Aragones SD, Deeds EJ. Binomial models uncover biological variation during feature selection of droplet-based single-cell RNA sequencing. PLoS Comput Biol 2024; 20:e1012386. [PMID: 39241106 PMCID: PMC11410258 DOI: 10.1371/journal.pcbi.1012386] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2024] [Revised: 09/18/2024] [Accepted: 08/05/2024] [Indexed: 09/08/2024] Open
Abstract
Effective analysis of single-cell RNA sequencing (scRNA-seq) data requires a rigorous distinction between technical noise and biological variation. In this work, we propose a simple feature selection model, termed "Differentially Distributed Genes" or DDGs, where a binomial sampling process for each mRNA species produces a null model of technical variation. Using scRNA-seq data where cell identities have been established a priori, we find that the DDG model of biological variation outperforms existing methods. We demonstrate that DDGs distinguish a validated set of real biologically varying genes, minimize neighborhood distortion, and enable accurate partitioning of cells into their established cell-type groups.
Collapse
Affiliation(s)
- Breanne Sparta
- Department of Integrative Biology and Physiology, University of California, Los Angeles, California, United States of America
- Institute for Quantitative and Computational Biosciences, University of California, Los Angeles, California, United States of America
| | - Timothy Hamilton
- Institute for Quantitative and Computational Biosciences, University of California, Los Angeles, California, United States of America
- Bioinformatics Interdepartmental Program, University of California, Los Angeles, California, United States of America
| | - Gunalan Natesan
- Institute for Quantitative and Computational Biosciences, University of California, Los Angeles, California, United States of America
| | - Samuel D Aragones
- Institute for Quantitative and Computational Biosciences, University of California, Los Angeles, California, United States of America
| | - Eric J Deeds
- Department of Integrative Biology and Physiology, University of California, Los Angeles, California, United States of America
- Institute for Quantitative and Computational Biosciences, University of California, Los Angeles, California, United States of America
| |
Collapse
|
17
|
Marghi Y, Gala R, Baftizadeh F, Sümbül U. Joint inference of discrete cell types and continuous type-specific variability in single-cell datasets with MMIDAS. NATURE COMPUTATIONAL SCIENCE 2024; 4:706-722. [PMID: 39317764 DOI: 10.1038/s43588-024-00683-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Accepted: 08/06/2024] [Indexed: 09/26/2024]
Abstract
Reproducible definition and identification of cell types is essential to enable investigations into their biological function and to understand their relevance in the context of development, disease and evolution. Current approaches model variability in data as continuous latent factors, followed by clustering as a separate step, or immediately apply clustering on the data. We show that such approaches can suffer from qualitative mistakes in identifying cell types robustly, particularly when the number of such cell types is in the hundreds or even thousands. Here we propose an unsupervised method, Mixture Model Inference with Discrete-coupled AutoencoderS (MMIDAS), which combines a generalized mixture model with a multi-armed deep neural network to jointly infer the discrete type and continuous type-specific variability. Using four recent datasets of brain cells spanning different technologies, species and conditions, we demonstrate that MMIDAS can identify reproducible cell types and infer cell type-dependent continuous variability in both unimodal and multimodal datasets.
Collapse
Affiliation(s)
| | | | | | - Uygar Sümbül
- Allen Institute, Seattle, WA, USA.
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA.
| |
Collapse
|
18
|
Pouyabahar D, Andrews T, Bader GD. Interpretable single-cell factor decomposition using sciRED. RESEARCH SQUARE 2024:rs.3.rs-4819117. [PMID: 39149508 PMCID: PMC11326389 DOI: 10.21203/rs.3.rs-4819117/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/17/2024]
Abstract
Single-cell RNA sequencing (scRNA-seq) maps gene expression heterogeneity within a tissue. However, identifying biological signals in this data is challenging due to confounding technical factors, sparsity, and high dimensionality. Data factorization methods address this by separating and identifying signals in the data, such as gene expression programs, but the resulting factors must be manually interpreted. We developed Single-Cell Interpretable Residual Decomposition (sciRED) to improve the interpretation of scRNA-seq factor analysis. sciRED removes known confounding effects, uses rotations to improve factor interpretability, maps factors to known covariates, identifies unexplained factors that may capture hidden biological phenomena and determines the genes and biological processes represented by the resulting factors. We apply sciRED to multiple scRNA-seq datasets and identify sex-specific variation in a kidney map, discern strong and weak immune stimulation signals in a PBMC dataset, reduce ambient RNA contamination in a rat liver atlas to help identify strain variation, and reveal rare cell type signatures and anatomical zonation gene programs in a healthy human liver map. These demonstrate that sciRED is useful in characterizing diverse biological signals within scRNA-seq data.
Collapse
Affiliation(s)
- Delaram Pouyabahar
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
- The Donnelly Centre, University of Toronto, Toronto, Ontario, Canada
| | - Tallulah Andrews
- Department of Biochemistry, Schulich School of Medicine and Dentistry, University of Western Ontario, London, Ontario, Canada
- Department of Computer Science, University of Western Ontario, London, Ontario, Canada
| | - Gary D Bader
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
- The Donnelly Centre, University of Toronto, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Lunenfeld-Tanenbaum Research Institute, Toronto, Ontario, Canada
- Princess Margaret Research Institute, University Health Network, Toronto, Ontario, Canada
- CIFAR Multiscale Human Program, CIFAR, Toronto, Ontario, Canada
| |
Collapse
|
19
|
Wang H, Torous W, Gong B, Purdom E. Visualizing scRNA-Seq Data at Population Scale with GloScope. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.05.29.542786. [PMID: 37398321 PMCID: PMC10312527 DOI: 10.1101/2023.05.29.542786] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/04/2023]
Abstract
Increasingly, scRNA-Seq studies explore cell populations across different samples and the effect of sample heterogeneity on organism's phenotype. However, relatively few bioinformatic methods have been developed which adequately address the variation between samples for such population-level analyses. We propose a framework for representing the entire single-cell profile of a sample, which we call a GloScope representation. We implement GloScope on scRNA-Seq datasets from study designs ranging from 12 to over 300 samples and demonstrate how GloScope allows researchers to perform essential bioinformatic tasks at the sample-level, in particular visualization and quality control assessment.
Collapse
Affiliation(s)
- Hao Wang
- Division of Biostatistics, University of California, Berkeley, CA, USA
| | - William Torous
- Department of Statistics, University of California, Berkeley, CA, USA
| | - Boying Gong
- Division of Biostatistics, University of California, Berkeley, CA, USA
| | - Elizabeth Purdom
- Department of Statistics, University of California, Berkeley, CA, USA
- Center for Computational Biology, University of California, Berkeley, CA, USA
| |
Collapse
|
20
|
Jiang H, Wang MN, Huang YA, Huang Y. Graph-Regularized Non-Negative Matrix Factorization for Single-Cell Clustering in scRNA-Seq Data. IEEE J Biomed Health Inform 2024; 28:4986-4994. [PMID: 38787664 DOI: 10.1109/jbhi.2024.3400050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/26/2024]
Abstract
The advent of single-cell RNA sequencing (scRNA-seq) has brought forth fresh perspectives on intricate biological processes, revealing the nuances and divergences present among distinct cells. Accurate single-cell analysis is a crucial prerequisite for in-depth investigation into the underlying mechanisms of heterogeneity. Due to various technical noises, like the impact of dropout values, scRNA-seq data remains challenging to interpret. In this work, we propose an unsupervised learning framework for scRNA-seq data analysis (aka Sc-GNNMF). Based on the non-negativity and sparsity of scRNA-seq data, we propose employing graph-regularized non-negative matrix factorization (GNNMF) algorithm for the analysis of scRNA-seq data, which involves estimating cell-cell sparse similarity and gene-gene sparse similarity through Laplacian kernels and p-nearest neighbor graphs ( p-NNG). By assuming intrinsic geometric local invariance, we use a weighted p-nearest known neighbors ( p-NKN) to optimize the scRNA-seq data. The optimized scRNA-seq data then participates in the matrix decomposition process, promoting the closeness of cells with similar types in cell-gene data space and determining a more suitable embedding space for clustering. Sc-GNNMF demonstrates superior performance compared to other methods and maintains satisfactory compatibility and robustness, as evidenced by experiments on 11 real scRNA-seq datasets. Furthermore, Sc-GNNMF yields excellent results in clustering tasks, extracting useful gene markers, and pseudo-temporal analysis.
Collapse
|
21
|
Xu Y, Wang Y, Ma S. SingleCellGGM enables gene expression program identification from single-cell transcriptomes and facilitates universal cell label transfer. CELL REPORTS METHODS 2024; 4:100813. [PMID: 38971150 PMCID: PMC11294836 DOI: 10.1016/j.crmeth.2024.100813] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/05/2023] [Revised: 04/23/2024] [Accepted: 06/13/2024] [Indexed: 07/08/2024]
Abstract
Gene co-expression analysis of single-cell transcriptomes, aiming to define functional relationships between genes, is challenging due to excessive dropout values. Here, we developed a single-cell graphical Gaussian model (SingleCellGGM) algorithm to conduct single-cell gene co-expression network analysis. When applied to mouse single-cell datasets, SingleCellGGM constructed networks from which gene co-expression modules with highly significant functional enrichment were identified. We considered the modules as gene expression programs (GEPs). These GEPs enable direct cell-type annotation of individual cells without cell clustering, and they are enriched with genes required for the functions of the corresponding cells, sometimes at levels greater than 10-fold. The GEPs are conserved across datasets and enable universal cell-type label transfer across different studies. We also proposed a dimension-reduction method through averaging by GEPs for single-cell analysis, enhancing the interpretability of results. Thus, SingleCellGGM offers a unique GEP-based perspective to analyze single-cell transcriptomes and reveals biological insights shared by different single-cell datasets.
Collapse
Affiliation(s)
- Yupu Xu
- MOE Key Laboratory for Cellular Dynamics, School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Innovation Academy for Seed Design, Chinese Academy of Sciences, Hefei, China
| | - Yuzhou Wang
- MOE Key Laboratory for Cellular Dynamics, School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Innovation Academy for Seed Design, Chinese Academy of Sciences, Hefei, China; The First Affiliated Hospital of USTC, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
| | - Shisong Ma
- MOE Key Laboratory for Cellular Dynamics, School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Innovation Academy for Seed Design, Chinese Academy of Sciences, Hefei, China; School of Data Science, University of Science and Technology of China, Hefei, China.
| |
Collapse
|
22
|
Bilous M, Hérault L, Gabriel AA, Teleman M, Gfeller D. Building and analyzing metacells in single-cell genomics data. Mol Syst Biol 2024; 20:744-766. [PMID: 38811801 PMCID: PMC11220014 DOI: 10.1038/s44320-024-00045-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2024] [Revised: 05/03/2024] [Accepted: 05/08/2024] [Indexed: 05/31/2024] Open
Abstract
The advent of high-throughput single-cell genomics technologies has fundamentally transformed biological sciences. Currently, millions of cells from complex biological tissues can be phenotypically profiled across multiple modalities. The scaling of computational methods to analyze and visualize such data is a constant challenge, and tools need to be regularly updated, if not redesigned, to cope with ever-growing numbers of cells. Over the last few years, metacells have been introduced to reduce the size and complexity of single-cell genomics data while preserving biologically relevant information and improving interpretability. Here, we review recent studies that capitalize on the concept of metacells-and the many variants in nomenclature that have been used. We further outline how and when metacells should (or should not) be used to analyze single-cell genomics data and what should be considered when analyzing such data at the metacell level. To facilitate the exploration of metacells, we provide a comprehensive tutorial on the construction and analysis of metacells from single-cell RNA-seq data ( https://github.com/GfellerLab/MetacellAnalysisTutorial ) as well as a fully integrated pipeline to rapidly build, visualize and evaluate metacells with different methods ( https://github.com/GfellerLab/MetacellAnalysisToolkit ).
Collapse
Affiliation(s)
- Mariia Bilous
- Department of Oncology, Ludwig Institute for Cancer Research Lausanne, University of Lausanne, 1011, Lausanne, Switzerland
- Agora Cancer Research Centre, 1011, Lausanne, Switzerland
- Swiss Cancer Center Leman (SCCL), Lausanne, Switzerland
- Swiss Institute of Bioinformatics (SIB), 1015, Lausanne, Switzerland
| | - Léonard Hérault
- Department of Oncology, Ludwig Institute for Cancer Research Lausanne, University of Lausanne, 1011, Lausanne, Switzerland
- Agora Cancer Research Centre, 1011, Lausanne, Switzerland
- Swiss Cancer Center Leman (SCCL), Lausanne, Switzerland
- Swiss Institute of Bioinformatics (SIB), 1015, Lausanne, Switzerland
| | - Aurélie Ag Gabriel
- Department of Oncology, Ludwig Institute for Cancer Research Lausanne, University of Lausanne, 1011, Lausanne, Switzerland
- Agora Cancer Research Centre, 1011, Lausanne, Switzerland
- Swiss Cancer Center Leman (SCCL), Lausanne, Switzerland
- Swiss Institute of Bioinformatics (SIB), 1015, Lausanne, Switzerland
| | - Matei Teleman
- Department of Oncology, Ludwig Institute for Cancer Research Lausanne, University of Lausanne, 1011, Lausanne, Switzerland
- Agora Cancer Research Centre, 1011, Lausanne, Switzerland
- Swiss Cancer Center Leman (SCCL), Lausanne, Switzerland
- Swiss Institute of Bioinformatics (SIB), 1015, Lausanne, Switzerland
| | - David Gfeller
- Department of Oncology, Ludwig Institute for Cancer Research Lausanne, University of Lausanne, 1011, Lausanne, Switzerland.
- Agora Cancer Research Centre, 1011, Lausanne, Switzerland.
- Swiss Cancer Center Leman (SCCL), Lausanne, Switzerland.
- Swiss Institute of Bioinformatics (SIB), 1015, Lausanne, Switzerland.
| |
Collapse
|
23
|
Kuo A, Hansen KD, Hicks SC. Quantification and statistical modeling of droplet-based single-nucleus RNA-sequencing data. Biostatistics 2024; 25:801-817. [PMID: 37257175 PMCID: PMC11247185 DOI: 10.1093/biostatistics/kxad010] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2022] [Revised: 03/22/2023] [Accepted: 04/19/2023] [Indexed: 06/02/2023] Open
Abstract
In complex tissues containing cells that are difficult to dissociate, single-nucleus RNA-sequencing (snRNA-seq) has become the preferred experimental technology over single-cell RNA-sequencing (scRNA-seq) to measure gene expression. To accurately model these data in downstream analyses, previous work has shown that droplet-based scRNA-seq data are not zero-inflated, but whether droplet-based snRNA-seq data follow the same probability distributions has not been systematically evaluated. Using pseudonegative control data from nuclei in mouse cortex sequenced with the 10x Genomics Chromium system and mouse kidney sequenced with the DropSeq system, we found that droplet-based snRNA-seq data follow a negative binomial distribution, suggesting that parametric statistical models applied to scRNA-seq are transferable to snRNA-seq. Furthermore, we found that the quantification choices in adapting quantification mapping strategies from scRNA-seq to snRNA-seq can play a significant role in downstream analyses and biological interpretation. In particular, reference transcriptomes that do not include intronic regions result in significantly smaller library sizes and incongruous cell type classifications. We also confirmed the presence of a gene length bias in snRNA-seq data, which we show is present in both exonic and intronic reads, and investigate potential causes for the bias.
Collapse
Affiliation(s)
- Albert Kuo
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 N Wolfe St, Baltimore, MD 21205, USA
| | - Kasper D Hansen
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 N Wolfe St, Baltimore, MD 21205, USA
- Department of Genetic Medicine, Johns Hopkins School of Medicine, 733 N Broadway, Baltimore, MD 21205, USA
| | - Stephanie C Hicks
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 N Wolfe St, Baltimore, MD 21205, USA
| |
Collapse
|
24
|
Xiong J, Gong F, Ma L, Wan L. scVIC: deep generative modeling of heterogeneity for scRNA-seq data. BIOINFORMATICS ADVANCES 2024; 4:vbae086. [PMID: 39027640 PMCID: PMC11256938 DOI: 10.1093/bioadv/vbae086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Revised: 05/15/2024] [Accepted: 06/12/2024] [Indexed: 07/20/2024]
Abstract
Motivation Single-cell RNA sequencing (scRNA-seq) has become a valuable tool for studying cellular heterogeneity. However, the analysis of scRNA-seq data is challenging because of inherent noise and technical variability. Existing methods often struggle to simultaneously explore heterogeneity across cells, handle dropout events, and account for batch effects. These drawbacks call for a robust and comprehensive method that can address these challenges and provide accurate insights into heterogeneity at the single-cell level. Results In this study, we introduce scVIC, an algorithm designed to account for variational inference, while simultaneously handling biological heterogeneity and batch effects at the single-cell level. scVIC explicitly models both biological heterogeneity and technical variability to learn cellular heterogeneity in a manner free from dropout events and the bias of batch effects. By leveraging variational inference, we provide a robust framework for inferring the parameters of scVIC. To test the performance of scVIC, we employed both simulated and biological scRNA-seq datasets, either including, or not, batch effects. scVIC was found to outperform other approaches because of its superior clustering ability and circumvention of the batch effects problem. Availability and implementation The code of scVIC and replication for this study are available at https://github.com/HiBearME/scVIC/tree/v1.0.
Collapse
Affiliation(s)
- Jiankang Xiong
- National Center for Mathematics and Interdisciplinary Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Fuzhou Gong
- National Center for Mathematics and Interdisciplinary Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Liang Ma
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
| | - Lin Wan
- National Center for Mathematics and Interdisciplinary Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| |
Collapse
|
25
|
Chi J, Ye J, Zhou Y. A GLM-based zero-inflated generalized Poisson factor model for analyzing microbiome data. Front Microbiol 2024; 15:1394204. [PMID: 38873138 PMCID: PMC11173601 DOI: 10.3389/fmicb.2024.1394204] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2024] [Accepted: 05/20/2024] [Indexed: 06/15/2024] Open
Abstract
Motivation High-throughput sequencing technology facilitates the quantitative analysis of microbial communities, improving the capacity to investigate the associations between the human microbiome and diseases. Our primary motivating application is to explore the association between gut microbes and obesity. The complex characteristics of microbiome data, including high dimensionality, zero inflation, and over-dispersion, pose new statistical challenges for downstream analysis. Results We propose a GLM-based zero-inflated generalized Poisson factor analysis (GZIGPFA) model to analyze microbiome data with complex characteristics. The GZIGPFA model is based on a zero-inflated generalized Poisson (ZIGP) distribution for modeling microbiome count data. A link function between the generalized Poisson rate and the probability of excess zeros is established within the generalized linear model (GLM) framework. The latent parameters of the GZIGPFA model constitute a low-rank matrix comprising a low-dimensional score matrix and a loading matrix. An alternating maximum likelihood algorithm is employed to estimate the unknown parameters, and cross-validation is utilized to determine the rank of the model in this study. The proposed GZIGPFA model demonstrates superior performance and advantages through comprehensive simulation studies and real data applications.
Collapse
Affiliation(s)
- Jinling Chi
- School of Mathematics and Statistics, Xidian University, Xi'an, China
| | - Jimin Ye
- School of Mathematics and Statistics, Xidian University, Xi'an, China
| | - Ying Zhou
- School of Mathematical Sciences, Heilongjiang University, Harbin, China
| |
Collapse
|
26
|
Gao Y, Dong K, Gao Y, Jin X, Yang J, Yan G, Liu Q. Unified cross-modality integration and analysis of T cell receptors and T cell transcriptomes by low-resource-aware representation learning. CELL GENOMICS 2024; 4:100553. [PMID: 38688285 PMCID: PMC11099349 DOI: 10.1016/j.xgen.2024.100553] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Revised: 03/09/2024] [Accepted: 04/06/2024] [Indexed: 05/02/2024]
Abstract
Single-cell RNA sequencing (scRNA-seq) and T cell receptor sequencing (TCR-seq) are pivotal for investigating T cell heterogeneity. Integrating these modalities, which is expected to uncover profound insights in immunology that might otherwise go unnoticed with a single modality, faces computational challenges due to the low-resource characteristics of the multimodal data. Herein, we present UniTCR, a novel low-resource-aware multimodal representation learning framework designed for the unified cross-modality integration, enabling comprehensive T cell analysis. By designing a dual-modality contrastive learning module and a single-modality preservation module to effectively embed each modality into a common latent space, UniTCR demonstrates versatility in connecting TCR sequences with T cell transcriptomes across various tasks, including single-modality analysis, modality gap analysis, epitope-TCR binding prediction, and TCR profile cross-modality generation, in a low-resource-aware way. Extensive evaluations conducted on multiple scRNA-seq/TCR-seq paired datasets showed the superior performance of UniTCR, exhibiting the ability of exploring the complexity of immune system.
Collapse
Affiliation(s)
- Yicheng Gao
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Tongji Hospital, School of Medicine, Frontier Science Center for Stem Cell Research, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China; State Key Laboratory of Cardiology and Medical Innovation Center, Shanghai East Hospital, Frontier Science Center for Stem Cell Research, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Kejing Dong
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Tongji Hospital, School of Medicine, Frontier Science Center for Stem Cell Research, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China; State Key Laboratory of Cardiology and Medical Innovation Center, Shanghai East Hospital, Frontier Science Center for Stem Cell Research, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Yuli Gao
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Tongji Hospital, School of Medicine, Frontier Science Center for Stem Cell Research, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China; State Key Laboratory of Cardiology and Medical Innovation Center, Shanghai East Hospital, Frontier Science Center for Stem Cell Research, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Xuan Jin
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Tongji Hospital, School of Medicine, Frontier Science Center for Stem Cell Research, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China; State Key Laboratory of Cardiology and Medical Innovation Center, Shanghai East Hospital, Frontier Science Center for Stem Cell Research, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Jingya Yang
- Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai 201804, China
| | - Gang Yan
- Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai 201804, China.
| | - Qi Liu
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Tongji Hospital, School of Medicine, Frontier Science Center for Stem Cell Research, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China; State Key Laboratory of Cardiology and Medical Innovation Center, Shanghai East Hospital, Frontier Science Center for Stem Cell Research, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China; Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai 201804, China; Research Institute of Intelligent Computing, Zhejiang Lab, Hangzhou 311121, China.
| |
Collapse
|
27
|
Jiao F, Li J, Liu T, Zhu Y, Che W, Bleris L, Jia C. What can we learn when fitting a simple telegraph model to a complex gene expression model? PLoS Comput Biol 2024; 20:e1012118. [PMID: 38743803 PMCID: PMC11125521 DOI: 10.1371/journal.pcbi.1012118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Revised: 05/24/2024] [Accepted: 04/27/2024] [Indexed: 05/16/2024] Open
Abstract
In experiments, the distributions of mRNA or protein numbers in single cells are often fitted to the random telegraph model which includes synthesis and decay of mRNA or protein, and switching of the gene between active and inactive states. While commonly used, this model does not describe how fluctuations are influenced by crucial biological mechanisms such as feedback regulation, non-exponential gene inactivation durations, and multiple gene activation pathways. Here we investigate the dynamical properties of four relatively complex gene expression models by fitting their steady-state mRNA or protein number distributions to the simple telegraph model. We show that despite the underlying complex biological mechanisms, the telegraph model with three effective parameters can accurately capture the steady-state gene product distributions, as well as the conditional distributions in the active gene state, of the complex models. Some effective parameters are reliable and can reflect realistic dynamic behaviors of the complex models, while others may deviate significantly from their real values in the complex models. The effective parameters can also be applied to characterize the capability for a complex model to exhibit multimodality. Using additional information such as single-cell data at multiple time points, we provide an effective method of distinguishing the complex models from the telegraph model. Furthermore, using measurements under varying experimental conditions, we show that fitting the mRNA or protein number distributions to the telegraph model may even reveal the underlying gene regulation mechanisms of the complex models. The effectiveness of these methods is confirmed by analysis of single-cell data for E. coli and mammalian cells. All these results are robust with respect to cooperative transcriptional regulation and extrinsic noise. In particular, we find that faster relaxation speed to the steady state results in more precise parameter inference under large extrinsic noise.
Collapse
Affiliation(s)
- Feng Jiao
- Guangzhou Center for Applied Mathematics, Guangzhou University, Guangzhou, China
| | - Jing Li
- Guangzhou Center for Applied Mathematics, Guangzhou University, Guangzhou, China
| | - Ting Liu
- Guangzhou Center for Applied Mathematics, Guangzhou University, Guangzhou, China
| | - Yifeng Zhu
- Guangzhou Center for Applied Mathematics, Guangzhou University, Guangzhou, China
| | - Wenhao Che
- Guangzhou Center for Applied Mathematics, Guangzhou University, Guangzhou, China
| | - Leonidas Bleris
- Bioengineering Department, The University of Texas at Dallas, Richardson, Texas, United States of America
- Center for Systems Biology, The University of Texas at Dallas, Richardson, Texas, United States of America
- Department of Biological Sciences, The University of Texas at Dallas, Richardson, Texas, United States of America
| | - Chen Jia
- Applied and Computational Mathematics Division, Beijing Computational Science Research Center, Beijing, China
| |
Collapse
|
28
|
Ranek JS, Stallaert W, Milner JJ, Redick M, Wolff SC, Beltran AS, Stanley N, Purvis JE. DELVE: feature selection for preserving biological trajectories in single-cell data. Nat Commun 2024; 15:2765. [PMID: 38553455 PMCID: PMC10980758 DOI: 10.1038/s41467-024-46773-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2023] [Accepted: 03/07/2024] [Indexed: 04/02/2024] Open
Abstract
Single-cell technologies can measure the expression of thousands of molecular features in individual cells undergoing dynamic biological processes. While examining cells along a computationally-ordered pseudotime trajectory can reveal how changes in gene or protein expression impact cell fate, identifying such dynamic features is challenging due to the inherent noise in single-cell data. Here, we present DELVE, an unsupervised feature selection method for identifying a representative subset of molecular features which robustly recapitulate cellular trajectories. In contrast to previous work, DELVE uses a bottom-up approach to mitigate the effects of confounding sources of variation, and instead models cell states from dynamic gene or protein modules based on core regulatory complexes. Using simulations, single-cell RNA sequencing, and iterative immunofluorescence imaging data in the context of cell cycle and cellular differentiation, we demonstrate how DELVE selects features that better define cell-types and cell-type transitions. DELVE is available as an open-source python package: https://github.com/jranek/delve .
Collapse
Affiliation(s)
- Jolene S Ranek
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Computational Medicine Program, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Wayne Stallaert
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
| | - J Justin Milner
- Department of Microbiology and Immunology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill School of Medicine, Chapel Hill, NC, USA
| | - Margaret Redick
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Computational Medicine Program, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Samuel C Wolff
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Computational Medicine Program, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Adriana S Beltran
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Human Pluripotent Cell Core, University of North Carolina at Chapel Hill School of Medicine, Chapel Hill, NC, USA
| | - Natalie Stanley
- Computational Medicine Program, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
- Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
| | - Jeremy E Purvis
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
- Computational Medicine Program, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
| |
Collapse
|
29
|
Li T, Qian K, Wang X, Li WV, Li H. scBiG for representation learning of single-cell gene expression data based on bipartite graph embedding. NAR Genom Bioinform 2024; 6:lqae004. [PMID: 38288376 PMCID: PMC10823585 DOI: 10.1093/nargab/lqae004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2023] [Revised: 12/19/2023] [Accepted: 01/09/2024] [Indexed: 01/31/2024] Open
Abstract
Analyzing single-cell RNA sequencing (scRNA-seq) data remains a challenge due to its high dimensionality, sparsity and technical noise. Recognizing the benefits of dimensionality reduction in simplifying complexity and enhancing the signal-to-noise ratio, we introduce scBiG, a novel graph node embedding method designed for representation learning in scRNA-seq data. scBiG establishes a bipartite graph connecting cells and expressed genes, and then constructs a multilayer graph convolutional network to learn cell and gene embeddings. Through a series of extensive experiments, we demonstrate that scBiG surpasses commonly used dimensionality reduction techniques in various analytical tasks. Downstream tasks encompass unsupervised cell clustering, cell trajectory inference, gene expression reconstruction and gene co-expression analysis. Additionally, scBiG exhibits notable computational efficiency and scalability. In summary, scBiG offers a useful graph neural network framework for representation learning in scRNA-seq data, empowering a diverse array of downstream analyses.
Collapse
Affiliation(s)
- Ting Li
- School of Mathematics and Physics, China University of Geosciences, Wuhan 430074, China
| | - Kun Qian
- School of Mathematics and Physics, China University of Geosciences, Wuhan 430074, China
| | - Xiang Wang
- School of Mathematics and Physics, China University of Geosciences, Wuhan 430074, China
| | - Wei Vivian Li
- Department of Statistics, University of California, Riverside, Riverside, CA 92507, USA
| | - Hongwei Li
- School of Mathematics and Physics, China University of Geosciences, Wuhan 430074, China
| |
Collapse
|
30
|
Xia L, Lee C, Li JJ. Statistical method scDEED for detecting dubious 2D single-cell embeddings and optimizing t-SNE and UMAP hyperparameters. Nat Commun 2024; 15:1753. [PMID: 38409103 PMCID: PMC10897166 DOI: 10.1038/s41467-024-45891-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Accepted: 02/06/2024] [Indexed: 02/28/2024] Open
Abstract
Two-dimensional (2D) embedding methods are crucial for single-cell data visualization. Popular methods such as t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP) are commonly used for visualizing cell clusters; however, it is well known that t-SNE and UMAP's 2D embeddings might not reliably inform the similarities among cell clusters. Motivated by this challenge, we present a statistical method, scDEED, for detecting dubious cell embeddings output by a 2D-embedding method. By calculating a reliability score for every cell embedding based on the similarity between the cell's 2D-embedding neighbors and pre-embedding neighbors, scDEED identifies the cell embeddings with low reliability scores as dubious and those with high reliability scores as trustworthy. Moreover, by minimizing the number of dubious cell embeddings, scDEED provides intuitive guidance for optimizing the hyperparameters of an embedding method. We show the effectiveness of scDEED on multiple datasets for detecting dubious cell embeddings and optimizing the hyperparameters of t-SNE and UMAP.
Collapse
Affiliation(s)
- Lucy Xia
- Department of ISOM, School of Business and Management, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, China
| | - Christy Lee
- Department of Statistics and Data Science, University of California, Los Angeles, Los Angeles, CA, USA
| | - Jingyi Jessica Li
- Department of Statistics and Data Science, University of California, Los Angeles, Los Angeles, CA, USA.
- Department of Biostatistics, University of California, Los Angeles, Los Angeles, CA, USA.
- Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA, USA.
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA, USA.
- Radcliffe Institute of Advanced Study, Harvard University, Cambridge, MA, USA.
| |
Collapse
|
31
|
Zhu Q, Conrad DN, Gartner ZJ. deMULTIplex2: robust sample demultiplexing for scRNA-seq. Genome Biol 2024; 25:37. [PMID: 38291503 PMCID: PMC10829271 DOI: 10.1186/s13059-024-03177-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2023] [Accepted: 01/18/2024] [Indexed: 02/01/2024] Open
Abstract
Sample multiplexing enables pooled analysis during single-cell RNA sequencing workflows, thereby increasing throughput and reducing batch effects. A challenge for all multiplexing techniques is to link sample-specific barcodes with cell-specific barcodes, then demultiplex sample identity post-sequencing. However, existing demultiplexing tools fail under many real-world conditions where barcode cross-contamination is an issue. We therefore developed deMULTIplex2, an algorithm inspired by a mechanistic model of barcode cross-contamination. deMULTIplex2 employs generalized linear models and expectation-maximization to probabilistically determine the sample identity of each cell. Benchmarking reveals superior performance across various experimental conditions, particularly on large or noisy datasets with unbalanced sample compositions.
Collapse
Affiliation(s)
- Qin Zhu
- Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, CA, 94158, USA.
| | - Daniel N Conrad
- Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, CA, 94158, USA
| | - Zev J Gartner
- Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, CA, 94158, USA.
- Chan Zuckerberg Biohub, San Francisco, CA, 94158, USA.
- Center for Cellular Construction, University of California, San Francisco, CA, 94158, USA.
| |
Collapse
|
32
|
Cho H, She J, De Marchi D, El-Zaatari H, Barnes EL, Kahkoska AR, Kosorok MR, Virkud AV. Machine Learning and Health Science Research: Tutorial. J Med Internet Res 2024; 26:e50890. [PMID: 38289657 PMCID: PMC10865203 DOI: 10.2196/50890] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2023] [Revised: 11/30/2023] [Accepted: 12/21/2023] [Indexed: 02/01/2024] Open
Abstract
Machine learning (ML) has seen impressive growth in health science research due to its capacity for handling complex data to perform a range of tasks, including unsupervised learning, supervised learning, and reinforcement learning. To aid health science researchers in understanding the strengths and limitations of ML and to facilitate its integration into their studies, we present here a guideline for integrating ML into an analysis through a structured framework, covering steps from framing a research question to study design and analysis techniques for specialized data types.
Collapse
Affiliation(s)
- Hunyong Cho
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
| | - Jane She
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
| | - Daniel De Marchi
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
| | - Helal El-Zaatari
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
| | - Edward L Barnes
- Division of Gastroenterology and Hepatology, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
- Center for Gastrointestinal Biology and Diseases, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
| | - Anna R Kahkoska
- Department of Nutrition, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
- Division of Endocrinology and Metabolism, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
- Center for Aging and Health, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
| | - Michael R Kosorok
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
| | - Arti V Virkud
- Kidney Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
| |
Collapse
|
33
|
Wang TG, Shang JL, Liu JX, Li F, Yuan S, Wang J. Joint L 2,p-norm and random walk graph constrained PCA for single-cell RNA-seq data. Comput Methods Biomech Biomed Engin 2024; 27:498-511. [PMID: 36912759 DOI: 10.1080/10255842.2023.2188106] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2022] [Accepted: 03/02/2023] [Indexed: 03/14/2023]
Abstract
The development and widespread utilization of high-throughput sequencing technologies in biology has fueled the rapid growth of single-cell RNA sequencing (scRNA-seq) data over the past decade. The development of scRNA-seq technology has significantly expanded researchers' understanding of cellular heterogeneity. Accurate cell type identification is the prerequisite for any research on heterogeneous cell populations. However, due to the high noise and high dimensionality of scRNA-seq data, improving the effectiveness of cell type identification remains a challenge. As an effective dimensionality reduction method, Principal Component Analysis (PCA) is an essential tool for visualizing high-dimensional scRNA-seq data and identifying cell subpopulations. However, traditional PCA has some defects when used in mining the nonlinear manifold structure of the data and usually suffers from over-density of principal components (PCs). Therefore, we present a novel method in this paper called joint L 2 , p -norm and random walk graph constrained PCA (RWPPCA). RWPPCA aims to retain the data's local information in the process of mapping high-dimensional data to low-dimensional space, to more accurately obtain sparse principal components and to then identify cell types more precisely. Specifically, RWPPCA combines the random walk (RW) algorithm with graph regularization to more accurately determine the local geometric relationships between data points. Moreover, to mitigate the adverse effects of dense PCs, the L 2 , p -norm is introduced to make the PCs sparser, thus increasing their interpretability. Then, we evaluate the effectiveness of RWPPCA on simulated data and scRNA-seq data. The results show that RWPPCA performs well in cell type identification and outperforms other comparison methods.
Collapse
Affiliation(s)
- Tai-Ge Wang
- School of Computer Science, Qufu Normal University, Rizhao 276826, China
| | - Jun-Liang Shang
- School of Computer Science, Qufu Normal University, Rizhao 276826, China
| | - Jin-Xing Liu
- School of Computer Science, Qufu Normal University, Rizhao 276826, China
| | - Feng Li
- School of Computer Science, Qufu Normal University, Rizhao 276826, China
| | - Shasha Yuan
- School of Computer Science, Qufu Normal University, Rizhao 276826, China
| | - Juan Wang
- School of Computer Science, Qufu Normal University, Rizhao 276826, China
| |
Collapse
|
34
|
Berg M, Petoukhov I, van den Ende I, Meyer KB, Guryev V, Vonk JM, Carpaij O, Banchero M, Hendriks RW, van den Berge M, Nawijn MC. FastCAR: fast correction for ambient RNA to facilitate differential gene expression analysis in single-cell RNA-sequencing datasets. BMC Genomics 2023; 24:722. [PMID: 38030970 PMCID: PMC10687889 DOI: 10.1186/s12864-023-09822-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2023] [Accepted: 11/20/2023] [Indexed: 12/01/2023] Open
Abstract
Cell type-specific differential gene expression analyses based on single-cell transcriptome datasets are sensitive to the presence of cell-free mRNA in the droplets containing single cells. This so-called ambient RNA contamination may differ between samples obtained from patients and healthy controls. Current ambient RNA correction methods were not developed specifically for single-cell differential gene expression (sc-DGE) analyses and might therefore not sufficiently correct for ambient RNA-derived signals. Here, we show that ambient RNA levels are highly sample-specific. We found that without ambient RNA correction, sc-DGE analyses erroneously identify transcripts originating from ambient RNA as cell type-specific disease-associated genes. We therefore developed a computationally lean and intuitive correction method, Fast Correction for Ambient RNA (FastCAR), optimized for sc-DGE analysis of scRNA-Seq datasets generated by droplet-based methods including the 10XGenomics Chromium platform. FastCAR uses the profile of transcripts observed in libraries that likely represent empty droplets to determine the level of ambient RNA in each individual sample, and then corrects for these ambient RNA gene expression values. FastCAR can be applied as part of the data pre-processing and QC in sc-DGE workflows comparing scRNA-Seq data in a health versus disease experimental design. We compared FastCAR with two methods previously developed to remove ambient RNA, SoupX and CellBender. All three methods identified additional genes in sc-DGE analyses that were not identified in the absence of ambient RNA correction. However, we show that FastCAR performs better at correcting gene expression values attributed to ambient RNA, resulting in a lower frequency of false-positive observations. Moreover, the use of FastCAR in a sc-DGE workflow increases the cell-type specificity of sc-DGE analyses across disease conditions.
Collapse
Affiliation(s)
- Marijn Berg
- Department of Pathology and Medical Biology, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands.
- University of Groningen, University Medical Center Groningen, Groningen Research Institute, for Asthma and COPD (GRIAC), Groningen, The Netherlands.
| | | | | | - Kerstin B Meyer
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
| | - Victor Guryev
- University of Groningen, University Medical Center Groningen, Groningen Research Institute, for Asthma and COPD (GRIAC), Groningen, The Netherlands
- European Research Institute for the Biology of Ageing (ERIBA), University of Groningen, University Medical Center Groningen, Groningen, The Netherlands
| | - Judith M Vonk
- University of Groningen, University Medical Center Groningen, Groningen Research Institute, for Asthma and COPD (GRIAC), Groningen, The Netherlands
- Department of Epidemiology, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands
| | - Orestes Carpaij
- University of Groningen, University Medical Center Groningen, Groningen Research Institute, for Asthma and COPD (GRIAC), Groningen, The Netherlands
- Department of Pulmonology, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands
| | - Martin Banchero
- Department of Pathology and Medical Biology, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands
- University of Groningen, University Medical Center Groningen, Groningen Research Institute, for Asthma and COPD (GRIAC), Groningen, The Netherlands
| | - Rudi W Hendriks
- Department of Pulmonary Medicine, Erasmus MC, University Medical Center, Rotterdam, The Netherlands
| | - Maarten van den Berge
- University of Groningen, University Medical Center Groningen, Groningen Research Institute, for Asthma and COPD (GRIAC), Groningen, The Netherlands
- Department of Pulmonology, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands
| | - Martijn C Nawijn
- Department of Pathology and Medical Biology, University of Groningen, University Medical Center Groningen, Groningen, The Netherlands
- University of Groningen, University Medical Center Groningen, Groningen Research Institute, for Asthma and COPD (GRIAC), Groningen, The Netherlands
| |
Collapse
|
35
|
Li J, Wang J, Lin Z. SGCAST: symmetric graph convolutional auto-encoder for scalable and accurate study of spatial transcriptomics. Brief Bioinform 2023; 25:bbad490. [PMID: 38171928 PMCID: PMC10782917 DOI: 10.1093/bib/bbad490] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2023] [Revised: 08/02/2023] [Accepted: 12/07/2023] [Indexed: 01/05/2024] Open
Abstract
Recent advances in spatial transcriptomics (ST) have enabled comprehensive profiling of gene expression with spatial information in the context of the tissue microenvironment. However, with the improvements in the resolution and scale of ST data, deciphering spatial domains precisely while ensuring efficiency and scalability is still challenging. Here, we develop SGCAST, an efficient auto-encoder framework to identify spatial domains. SGCAST adopts a symmetric graph convolutional auto-encoder to learn aggregated latent embeddings via integrating the gene expression similarity and the proximity of the spatial spots. This framework in SGCAST enables a mini-batch training strategy, which makes SGCAST memory-efficient and scalable to high-resolution spatial transcriptomic data with a large number of spots. SGCAST improves the overall accuracy of spatial domain identification on benchmarking data. We also validated the performance of SGCAST on ST datasets at various scales across multiple platforms. Our study illustrates the superior capacity of SGCAST on analyzing spatial transcriptomic data.
Collapse
Affiliation(s)
- Jinzhao Li
- Department of Statistics, The Chinese University of Hong Kong, Sha Tin, Hong Kong, China
| | - Jiong Wang
- School of Science and Engineering, The Chinese University of Hong Kong (Shenzhen), Shenzhen, 518172, China
| | - Zhixiang Lin
- Department of Statistics, The Chinese University of Hong Kong, Sha Tin, Hong Kong, China
| |
Collapse
|
36
|
Jiang H, Huang Y, Li Q, Feng B. ScLSTM: single-cell type detection by siamese recurrent network and hierarchical clustering. BMC Bioinformatics 2023; 24:417. [PMID: 37932672 PMCID: PMC10629177 DOI: 10.1186/s12859-023-05494-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2023] [Accepted: 09/21/2023] [Indexed: 11/08/2023] Open
Abstract
MOTIVATION Categorizing cells into distinct types can shed light on biological tissue functions and interactions, and uncover specific mechanisms under pathological conditions. Since gene expression throughout a population of cells is averaged out by conventional sequencing techniques, it is challenging to distinguish between different cell types. The accumulation of single-cell RNA sequencing (scRNA-seq) data provides the foundation for a more precise classification of cell types. It is crucial building a high-accuracy clustering approach to categorize cell types since the imbalance of cell types and differences in the distribution of scRNA-seq data affect single-cell clustering and visualization outcomes. RESULT To achieve single-cell type detection, we propose a meta-learning-based single-cell clustering model called ScLSTM. Specifically, ScLSTM transforms the single-cell type detection problem into a hierarchical classification problem based on feature extraction by the siamese long-short term memory (LSTM) network. The similarity matrix derived from the improved sigmoid kernel is mapped to the siamese LSTM feature space to analyze the differences between cells. ScLSTM demonstrated superior classification performance on 8 scRNA-seq data sets of different platforms, species, and tissues. Further quantitative analysis and visualization of the human breast cancer data set validated the superiority and capability of ScLSTM in recognizing cell types.
Collapse
Affiliation(s)
- Hanjing Jiang
- Key Laboratory of Image Information Processing and Intelligent Control of Education Ministry of China, Institute of Artificial Intelligence, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, 430074, China
| | - Yabing Huang
- Department of Pathology, Renmin Hospital of Wuhan University, Wuhan, 430060, China.
| | - Qianpeng Li
- Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China
| | - Boyuan Feng
- Key Laboratory of Image Information Processing and Intelligent Control of Education Ministry of China, Institute of Artificial Intelligence, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, 430074, China
| |
Collapse
|
37
|
Carbonetto P, Luo K, Sarkar A, Hung A, Tayeb K, Pott S, Stephens M. GoM DE: interpreting structure in sequence count data with differential expression analysis allowing for grades of membership. Genome Biol 2023; 24:236. [PMID: 37858253 PMCID: PMC10588049 DOI: 10.1186/s13059-023-03067-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2023] [Accepted: 09/20/2023] [Indexed: 10/21/2023] Open
Abstract
Parts-based representations, such as non-negative matrix factorization and topic modeling, have been used to identify structure from single-cell sequencing data sets, in particular structure that is not as well captured by clustering or other dimensionality reduction methods. However, interpreting the individual parts remains a challenge. To address this challenge, we extend methods for differential expression analysis by allowing cells to have partial membership to multiple groups. We call this grade of membership differential expression (GoM DE). We illustrate the benefits of GoM DE for annotating topics identified in several single-cell RNA-seq and ATAC-seq data sets.
Collapse
Affiliation(s)
- Peter Carbonetto
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Research Computing Center, University of Chicago, Chicago, IL, USA
| | - Kaixuan Luo
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
| | - Abhishek Sarkar
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Vesalius Therapeutics, Cambridge, MA, USA
| | - Anthony Hung
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Section of Genetic Medicine, University of Chicago, Chicago, IL, USA
| | - Karl Tayeb
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Committee on Genetics, Genomics and Systems Biology, University of Chicago, Chicago, IL, USA
| | - Sebastian Pott
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Section of Genetic Medicine, University of Chicago, Chicago, IL, USA
| | - Matthew Stephens
- Department of Human Genetics, University of Chicago, Chicago, IL, USA.
- Department of Statistics, University of Chicago, Chicago, IL, USA.
| |
Collapse
|
38
|
Du J, Gu XR, Yu XX, Cao YJ, Hou J. Essential procedures of single-cell RNA sequencing in multiple myeloma and its translational value. BLOOD SCIENCE 2023; 5:221-236. [PMID: 37941914 PMCID: PMC10629747 DOI: 10.1097/bs9.0000000000000172] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Accepted: 09/18/2023] [Indexed: 11/10/2023] Open
Abstract
Multiple myeloma (MM) is a malignant neoplasm characterized by clonal proliferation of abnormal plasma cells. In many countries, it ranks as the second most prevalent malignant neoplasm of the hematopoietic system. Although treatment methods for MM have been continuously improved and the survival of patients has been dramatically prolonged, MM remains an incurable disease with a high probability of recurrence. As such, there are still many challenges to be addressed. One promising approach is single-cell RNA sequencing (scRNA-seq), which can elucidate the transcriptome heterogeneity of individual cells and reveal previously unknown cell types or states in complex tissues. In this review, we outlined the experimental workflow of scRNA-seq in MM, listed some commonly used scRNA-seq platforms and analytical tools. In addition, with the advent of scRNA-seq, many studies have made new progress in the key molecular mechanisms during MM clonal evolution, cell interactions and molecular regulation in the microenvironment, and drug resistance mechanisms in target therapy. We summarized the main findings and sequencing platforms for applying scRNA-seq to MM research and proposed broad directions for targeted therapies based on these findings.
Collapse
Affiliation(s)
- Jun Du
- Department of Hematology, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai 200127, China
| | - Xiao-Ran Gu
- School of Medicine, Shanghai Jiao Tong University, Shanghai 200025, China
| | - Xiao-Xiao Yu
- School of Medicine, Shanghai Jiao Tong University, Shanghai 200025, China
| | - Yang-Jia Cao
- Department of Hematology, First Affiliated Hospital of Xi’an Jiaotong University, Xi’an, Shanxi 710000, China
| | - Jian Hou
- Department of Hematology, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai 200127, China
| |
Collapse
|
39
|
Xia L, Lee C, Li JJ. scDEED: a statistical method for detecting dubious 2D single-cell embeddings and optimizing t-SNE and UMAP hyperparameters. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.04.21.537839. [PMID: 37163087 PMCID: PMC10168265 DOI: 10.1101/2023.04.21.537839] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
Two-dimensional (2D) embedding methods are crucial for single-cell data visualization. Popular methods such as t-SNE and UMAP are commonly used for visualizing cell clusters; however, it is well known that t-SNE and UMAP's 2D embedding might not reliably inform the similarities among cell clusters. Motivated by this challenge, we developed a statistical method, scDEED, for detecting dubious cell embeddings output by any 2D-embedding method. By calculating a reliability score for every cell embedding, scDEED identifies the cell embeddings with low reliability scores as dubious and those with high reliability scores as trustworthy. Moreover, by minimizing the number of dubious cell embeddings, scDEED provides intuitive guidance for optimizing the hyperparameters of an embedding method. Applied to multiple scRNA-seq datasets, scDEED demonstrates its effectiveness for detecting dubious cell embeddings and optimizing the hyperparameters of t-SNE and UMAP.
Collapse
|
40
|
Carbonetto P, Luo K, Sarkar A, Hung A, Tayeb K, Pott S, Stephens M. GoM DE: interpreting structure in sequence count data with differential expression analysis allowing for grades of membership. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.03.531029. [PMID: 36945441 PMCID: PMC10028846 DOI: 10.1101/2023.03.03.531029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/11/2023]
Abstract
Parts-based representations, such as non-negative matrix factorization and topic modeling, have been used to identify structure from single-cell sequencing data sets, in particular structure that is not as well captured by clustering or other dimensionality reduction methods. However, interpreting the individual parts remains a challenge. To address this challenge, we extend methods for differential expression analysis by allowing cells to have partial membership to multiple groups. We call this grade of membership differential expression (GoM DE). We illustrate the benefits of GoM DE for annotating topics identified in several single-cell RNA-seq and ATAC-seq data sets.
Collapse
Affiliation(s)
- Peter Carbonetto
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Research Computing Center, University of Chicago, Chicago, IL, USA
| | - Kaixuan Luo
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
| | - Abhishek Sarkar
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Vesalius Therapeutics, Cambridge, MA, USA
| | - Anthony Hung
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Section of Genetic Medicine, University of Chicago, Chicago, IL, USA
| | - Karl Tayeb
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Committee on Genetics, Genomics and Systems Biology, University of Chicago, Chicago, IL, USA
| | - Sebastian Pott
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Section of Genetic Medicine, University of Chicago, Chicago, IL, USA
| | - Matthew Stephens
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Department of Statistics, University of Chicago, Chicago, IL, USA
| |
Collapse
|
41
|
Li Y, Wu M, Ma S, Wu M. ZINBMM: a general mixture model for simultaneous clustering and gene selection using single-cell transcriptomic data. Genome Biol 2023; 24:208. [PMID: 37697330 PMCID: PMC10496184 DOI: 10.1186/s13059-023-03046-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2022] [Accepted: 08/22/2023] [Indexed: 09/13/2023] Open
Abstract
Clustering is a critical component of single-cell RNA sequencing (scRNA-seq) data analysis and can help reveal cell types and infer cell lineages. Despite considerable successes, there are few methods tailored to investigating cluster-specific genes contributing to cell heterogeneity, which can promote biological understanding of cell heterogeneity. In this study, we propose a zero-inflated negative binomial mixture model (ZINBMM) that simultaneously achieves effective scRNA-seq data clustering and gene selection. ZINBMM conducts a systemic analysis on raw counts, accommodating both batch effects and dropout events. Simulations and the analysis of five scRNA-seq datasets demonstrate the practical applicability of ZINBMM.
Collapse
Affiliation(s)
- Yang Li
- Center for Applied Statistics and School of Statistics, Renmin University of China, Beijing, China
- RSS and China-Re Life Joint Lab on Public Health and Risk Management, Renmin University of China, Beijing, China
- Statistical Consulting Center, Renmin University of China, Beijing, China
| | - Mingcong Wu
- Center for Applied Statistics and School of Statistics, Renmin University of China, Beijing, China
- Statistical Consulting Center, Renmin University of China, Beijing, China
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, USA
| | - Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China.
| |
Collapse
|
42
|
Madadi Y, Monavarfeshani A, Chen H, Stamer WD, Williams RW, Yousefi S. Artificial Intelligence Models for Cell Type and Subtype Identification Based on Single-Cell RNA Sequencing Data in Vision Science. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2837-2852. [PMID: 37294649 PMCID: PMC10631573 DOI: 10.1109/tcbb.2023.3284795] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Single-cell RNA sequencing (scRNA-seq) provides a high throughput, quantitative and unbiased framework for scientists in many research fields to identify and characterize cell types within heterogeneous cell populations from various tissues. However, scRNA-seq based identification of discrete cell-types is still labor intensive and depends on prior molecular knowledge. Artificial intelligence has provided faster, more accurate, and user-friendly approaches for cell-type identification. In this review, we discuss recent advances in cell-type identification methods using artificial intelligence techniques based on single-cell and single-nucleus RNA sequencing data in vision science. The main purpose of this review paper is to assist vision scientists not only to select suitable datasets for their problems, but also to be aware of the appropriate computational tools to perform their analysis. Developing novel methods for scRNA-seq data analysis remains to be addressed in future studies.
Collapse
|
43
|
Erfanian N, Heydari AA, Feriz AM, Iañez P, Derakhshani A, Ghasemigol M, Farahpour M, Razavi SM, Nasseri S, Safarpour H, Sahebkar A. Deep learning applications in single-cell genomics and transcriptomics data analysis. Biomed Pharmacother 2023; 165:115077. [PMID: 37393865 DOI: 10.1016/j.biopha.2023.115077] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Revised: 06/22/2023] [Accepted: 06/23/2023] [Indexed: 07/04/2023] Open
Abstract
Traditional bulk sequencing methods are limited to measuring the average signal in a group of cells, potentially masking heterogeneity, and rare populations. The single-cell resolution, however, enhances our understanding of complex biological systems and diseases, such as cancer, the immune system, and chronic diseases. However, the single-cell technologies generate massive amounts of data that are often high-dimensional, sparse, and complex, thus making analysis with traditional computational approaches difficult and unfeasible. To tackle these challenges, many are turning to deep learning (DL) methods as potential alternatives to the conventional machine learning (ML) algorithms for single-cell studies. DL is a branch of ML capable of extracting high-level features from raw inputs in multiple stages. Compared to traditional ML, DL models have provided significant improvements across many domains and applications. In this work, we examine DL applications in genomics, transcriptomics, spatial transcriptomics, and multi-omics integration, and address whether DL techniques will prove to be advantageous or if the single-cell omics domain poses unique challenges. Through a systematic literature review, we have found that DL has not yet revolutionized the most pressing challenges of the single-cell omics field. However, using DL models for single-cell omics has shown promising results (in many cases outperforming the previous state-of-the-art models) in data preprocessing and downstream analysis. Although developments of DL algorithms for single-cell omics have generally been gradual, recent advances reveal that DL can offer valuable resources in fast-tracking and advancing research in single-cell.
Collapse
Affiliation(s)
- Nafiseh Erfanian
- Student Research Committee, Birjand University of Medical Sciences, Birjand, Iran
| | - A Ali Heydari
- Department of Applied Mathematics, University of California, Merced, CA, USA; Health Sciences Research Institute, University of California, Merced, CA, USA
| | - Adib Miraki Feriz
- Student Research Committee, Birjand University of Medical Sciences, Birjand, Iran
| | - Pablo Iañez
- Cellular Systems Genomics Group, Josep Carreras Research Institute, Barcelona, Spain
| | - Afshin Derakhshani
- Department of Biochemistry and Molecular Biology, University of Calgary, Calgary, AB, Canada
| | | | - Mohsen Farahpour
- Department of Electronics, Faculty of Electrical and Computer Engineering, University of Birjand, Birjand, Iran
| | - Seyyed Mohammad Razavi
- Department of Electronics, Faculty of Electrical and Computer Engineering, University of Birjand, Birjand, Iran
| | - Saeed Nasseri
- Cellular and Molecular Research Center, Birjand University of Medical Sciences, Birjand, Iran
| | - Hossein Safarpour
- Cellular and Molecular Research Center, Birjand University of Medical Sciences, Birjand, Iran.
| | - Amirhossein Sahebkar
- Biotechnology Research Center, Pharmaceutical Technology Institute, Mashhad University of Medical Sciences, Mashhad, Iran; Applied Biomedical Research Center, Mashhad University of Medical Sciences, Mashhad, Iran; Department of Biotechnology, School of Pharmacy, Mashhad University of Medical Sciences, Mashhad, Iran.
| |
Collapse
|
44
|
Jones A, Townes FW, Li D, Engelhardt BE. Alignment of spatial genomics data using deep Gaussian processes. Nat Methods 2023; 20:1379-1387. [PMID: 37592182 PMCID: PMC10482692 DOI: 10.1038/s41592-023-01972-2] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2022] [Accepted: 07/06/2023] [Indexed: 08/19/2023]
Abstract
Spatially resolved genomic technologies have allowed us to study the physical organization of cells and tissues, and promise an understanding of local interactions between cells. However, it remains difficult to precisely align spatial observations across slices, samples, scales, individuals and technologies. Here, we propose a probabilistic model that aligns spatially-resolved samples onto a known or unknown common coordinate system (CCS) with respect to phenotypic readouts (for example, gene expression). Our method, Gaussian Process Spatial Alignment (GPSA), consists of a two-layer Gaussian process: the first layer maps observed samples' spatial locations onto a CCS, and the second layer maps from the CCS to the observed readouts. Our approach enables complex downstream spatially aware analyses that are impossible or inaccurate with unaligned data, including an analysis of variance, creation of a dense three-dimensional (3D) atlas from sparse two-dimensional (2D) slices or association tests across data modalities.
Collapse
Affiliation(s)
- Andrew Jones
- Department of Computer Science, Princeton University, Princeton, NJ, USA
| | - F William Townes
- Department of Statistics and Data Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Didong Li
- Department of Biostatistics, University of North Carolina, Chapel Hill, NC, USA
| | - Barbara E Engelhardt
- Gladstone Institutes, San Francisco, CA, USA.
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.
| |
Collapse
|
45
|
Gunawan I, Vafaee F, Meijering E, Lock JG. An introduction to representation learning for single-cell data analysis. CELL REPORTS METHODS 2023; 3:100547. [PMID: 37671013 PMCID: PMC10475795 DOI: 10.1016/j.crmeth.2023.100547] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/07/2023]
Abstract
Single-cell-resolved systems biology methods, including omics- and imaging-based measurement modalities, generate a wealth of high-dimensional data characterizing the heterogeneity of cell populations. Representation learning methods are routinely used to analyze these complex, high-dimensional data by projecting them into lower-dimensional embeddings. This facilitates the interpretation and interrogation of the structures, dynamics, and regulation of cell heterogeneity. Reflecting their central role in analyzing diverse single-cell data types, a myriad of representation learning methods exist, with new approaches continually emerging. Here, we contrast general features of representation learning methods spanning statistical, manifold learning, and neural network approaches. We consider key steps involved in representation learning with single-cell data, including data pre-processing, hyperparameter optimization, downstream analysis, and biological validation. Interdependencies and contingencies linking these steps are also highlighted. This overview is intended to guide researchers in the selection, application, and optimization of representation learning strategies for current and future single-cell research applications.
Collapse
Affiliation(s)
- Ihuan Gunawan
- School of Biomedical Sciences, Faculty of Medicine and Health, University of New South Wales, Sydney, NSW, Australia
- School of Computer Science and Engineering, Faculty of Engineering, University of New South Wales, Sydney, NSW, Australia
| | - Fatemeh Vafaee
- School of Biotechnology and Biomolecular Sciences, Faculty of Science, University of New South Wales, Sydney, NSW, Australia
- UNSW Data Science Hub, University of New South Wales, Sydney, NSW, Australia
| | - Erik Meijering
- School of Computer Science and Engineering, Faculty of Engineering, University of New South Wales, Sydney, NSW, Australia
| | - John George Lock
- School of Biomedical Sciences, Faculty of Medicine and Health, University of New South Wales, Sydney, NSW, Australia
- UNSW Data Science Hub, University of New South Wales, Sydney, NSW, Australia
- Ingham Institute for Applied Medical Research, Liverpool, NSW, Australia
| |
Collapse
|
46
|
Xi NM, Li JJ. Exploring the optimization of autoencoder design for imputing single-cell RNA sequencing data. Comput Struct Biotechnol J 2023; 21:4079-4095. [PMID: 37671239 PMCID: PMC10475479 DOI: 10.1016/j.csbj.2023.07.041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Revised: 07/22/2023] [Accepted: 07/31/2023] [Indexed: 09/07/2023] Open
Abstract
Autoencoders are the backbones of many imputation methods that aim to relieve the sparsity issue in single-cell RNA sequencing (scRNA-seq) data. The imputation performance of an autoencoder relies on both the neural network architecture and the hyperparameter choice. So far, literature in the single-cell field lacks a formal discussion on how to design the neural network and choose the hyperparameters. Here, we conducted an empirical study to answer this question. Our study used many real and simulated scRNA-seq datasets to examine the impacts of the neural network architecture, the activation function, and the regularization strategy on imputation accuracy and downstream analyses. Our results show that (i) deeper and narrower autoencoders generally lead to better imputation performance; (ii) the sigmoid and tanh activation functions consistently outperform other commonly used functions including ReLU; (iii) regularization improves the accuracy of imputation and downstream cell clustering and DE gene analyses. Notably, our results differ from common practices in the computer vision field regarding the activation function and the regularization strategy. Overall, our study offers practical guidance on how to optimize the autoencoder design for scRNA-seq data imputation.
Collapse
Affiliation(s)
- Nan Miles Xi
- Department of Mathematics and Statistics, Loyola University Chicago, Chicago, IL 60660, USA
| | - Jingyi Jessica Li
- Department of Statistics and Data Science, University of California, Los Angeles, CA 90095-1554, USA
- Department of Human Genetics, University of California, Los Angeles, CA 90095-7088, USA
- Department of Computational Medicine, University of California, Los Angeles, CA 90095-1766, USA
- Department of Biostatistics, University of California, Los Angeles, CA 90095-1772, USA
| |
Collapse
|
47
|
Pan Y, Landis JT, Moorad R, Wu D, Marron JS, Dittmer DP. The Poisson distribution model fits UMI-based single-cell RNA-sequencing data. BMC Bioinformatics 2023; 24:256. [PMID: 37330471 PMCID: PMC10276395 DOI: 10.1186/s12859-023-05349-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2023] [Accepted: 05/24/2023] [Indexed: 06/19/2023] Open
Abstract
BACKGROUND Modeling of single cell RNA-sequencing (scRNA-seq) data remains challenging due to a high percentage of zeros and data heterogeneity, so improved modeling has strong potential to benefit many downstream data analyses. The existing zero-inflated or over-dispersed models are based on aggregations at either the gene or the cell level. However, they typically lose accuracy due to a too crude aggregation at those two levels. RESULTS We avoid the crude approximations entailed by such aggregation through proposing an independent Poisson distribution (IPD) particularly at each individual entry in the scRNA-seq data matrix. This approach naturally and intuitively models the large number of zeros as matrix entries with a very small Poisson parameter. The critical challenge of cell clustering is approached via a novel data representation as Departures from a simple homogeneous IPD (DIPD) to capture the per-gene-per-cell intrinsic heterogeneity generated by cell clusters. Our experiments using real data and crafted experiments show that using DIPD as a data representation for scRNA-seq data can uncover novel cell subtypes that are missed or can only be found by careful parameter tuning using conventional methods. CONCLUSIONS This new method has multiple advantages, including (1) no need for prior feature selection or manual optimization of hyperparameters; (2) flexibility to combine with and improve upon other methods, such as Seurat. Another novel contribution is the use of crafted experiments as part of the validation of our newly developed DIPD-based clustering pipeline. This new clustering pipeline is implemented in the R (CRAN) package scpoisson.
Collapse
Affiliation(s)
- Yue Pan
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, USA
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, USA
| | - Justin T Landis
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, USA
- Department of Microbiology and Immunology, University of North Carolina at Chapel Hill, Chapel Hill, USA
| | - Razia Moorad
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, USA
- Department of Microbiology and Immunology, University of North Carolina at Chapel Hill, Chapel Hill, USA
| | - Di Wu
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, USA
- Adam School of Dentistry, University of North Carolina at Chapel Hill, Chapel Hill, USA
| | - J S Marron
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, USA
- Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, USA
| | - Dirk P Dittmer
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, USA.
- Department of Microbiology and Immunology, University of North Carolina at Chapel Hill, Chapel Hill, USA.
| |
Collapse
|
48
|
Pan W, Long F, Pan J. ScInfoVAE: interpretable dimensional reduction of single cell transcription data with variational autoencoders and extended mutual information regularization. BioData Min 2023; 16:17. [PMID: 37301826 DOI: 10.1186/s13040-023-00333-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2022] [Accepted: 06/05/2023] [Indexed: 06/12/2023] Open
Abstract
Single-cell RNA-sequencing (scRNA-seq) data can serve as a good indicator of cell-to-cell heterogeneity and can aid in the study of cell growth by identifying cell types. Recently, advances in Variational Autoencoder (VAE) have demonstrated their ability to learn robust feature representations for scRNA-seq. However, it has been observed that VAEs tend to ignore the latent variables when combined with a decoding distribution that is too flexible. In this paper, we introduce ScInfoVAE, a dimensional reduction method based on the mutual information variational autoencoder (InfoVAE), which can more effectively identify various cell types in scRNA-seq data of complex tissues. A joint InfoVAE deep model and zero-inflated negative binomial distributed model design based on ScInfoVAE reconstructs the objective function to noise scRNA-seq data and learn an efficient low-dimensional representation of it. We use ScInfoVAE to analyze the clustering performance of 15 real scRNA-seq datasets and demonstrate that our method provides high clustering performance. In addition, we use simulated data to investigate the interpretability of feature extraction, and visualization results show that the low-dimensional representation learned by ScInfoVAE retains local and global neighborhood structure data well. In addition, our model can significantly improve the quality of the variational posterior.
Collapse
Affiliation(s)
- Weiquan Pan
- School of Mathematics and Statistics, Yulin Normal University, Yulin, China
| | - Faning Long
- School of Computer Science and Engineering, Yulin Normal University, Yulin, China.
| | - Jian Pan
- School of Mathematics and Statistics, Yulin Normal University, Yulin, China
| |
Collapse
|
49
|
Ranek JS, Stallaert W, Milner J, Stanley N, Purvis JE. Feature selection for preserving biological trajectories in single-cell data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.09.540043. [PMID: 37214963 PMCID: PMC10197710 DOI: 10.1101/2023.05.09.540043] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Single-cell technologies can readily measure the expression of thousands of molecular features from individual cells undergoing dynamic biological processes, such as cellular differentiation, immune response, and disease progression. While examining cells along a computationally ordered pseudotime offers the potential to study how subtle changes in gene or protein expression impact cell fate decision-making, identifying characteristic features that drive continuous biological processes remains difficult to detect from unenriched and noisy single-cell data. Given that all profiled sources of feature variation contribute to the cell-to-cell distances that define an inferred cellular trajectory, including confounding sources of biological variation (e.g. cell cycle or metabolic state) or noisy and irrelevant features (e.g. measurements with low signal-to-noise ratio) can mask the underlying trajectory of study and hinder inference. Here, we present DELVE (dynamic selection of locally covarying features), an unsupervised feature selection method for identifying a representative subset of dynamically-expressed molecular features that recapitulates cellular trajectories. In contrast to previous work, DELVE uses a bottom-up approach to mitigate the effect of unwanted sources of variation confounding inference, and instead models cell states from dynamic feature modules that constitute core regulatory complexes. Using simulations, single-cell RNA sequencing data, and iterative immunofluorescence imaging data in the context of the cell cycle and cellular differentiation, we demonstrate that DELVE selects features that more accurately characterize cell populations and improve the recovery of cell type transitions. This feature selection framework provides an alternative approach for improving trajectory inference and uncovering co-variation amongst features along a biological trajectory. DELVE is implemented as an open-source python package and is publicly available at: https://github.com/jranek/delve.
Collapse
Affiliation(s)
- Jolene S. Ranek
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Computational Medicine Program, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Wayne Stallaert
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
| | - Justin Milner
- Department of Microbiology and Immunology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Natalie Stanley
- Computational Medicine Program, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Jeremy E. Purvis
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Computational Medicine Program, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| |
Collapse
|
50
|
Bouland GA, Mahfouz A, Reinders MJT. Consequences and opportunities arising due to sparser single-cell RNA-seq datasets. Genome Biol 2023; 24:86. [PMID: 37085823 PMCID: PMC10120229 DOI: 10.1186/s13059-023-02933-w] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2022] [Accepted: 04/10/2023] [Indexed: 04/23/2023] Open
Abstract
With the number of cells measured in single-cell RNA sequencing (scRNA-seq) datasets increasing exponentially and concurrent increased sparsity due to more zero counts being measured for many genes, we demonstrate here that downstream analyses on binary-based gene expression give similar results as count-based analyses. Moreover, a binary representation scales up to ~ 50-fold more cells that can be analyzed using the same computational resources. We also highlight the possibilities provided by binarized scRNA-seq data. Development of specialized tools for bit-aware implementations of downstream analytical tasks will enable a more fine-grained resolution of biological heterogeneity.
Collapse
Affiliation(s)
- Gerard A Bouland
- Delft Bioinformatics Lab, Delft University of Technology, Delft, The Netherlands
- Department of Human Genetics, Leiden University Medical Center, Leiden, 2333ZC, The Netherlands
| | - Ahmed Mahfouz
- Delft Bioinformatics Lab, Delft University of Technology, Delft, The Netherlands.
- Department of Human Genetics, Leiden University Medical Center, Leiden, 2333ZC, The Netherlands.
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden, 2333ZC, The Netherlands.
| | - Marcel J T Reinders
- Delft Bioinformatics Lab, Delft University of Technology, Delft, The Netherlands.
- Department of Human Genetics, Leiden University Medical Center, Leiden, 2333ZC, The Netherlands.
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden, 2333ZC, The Netherlands.
| |
Collapse
|