1
|
Arya A, Tripathi P, Dubey N, Aier I, Kumar Varadwaj P. Navigating single-cell RNA-sequencing: protocols, tools, databases, and applications. Genomics Inform 2025; 23:13. [PMID: 40382658 DOI: 10.1186/s44342-025-00044-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2025] [Accepted: 04/07/2025] [Indexed: 05/20/2025] Open
Abstract
Single-cell RNA-sequencing (scRNA-seq) technology brought about a revolutionary change in the transcriptomic world, paving the way for comprehensive analysis of cellular heterogeneity in complex biological systems. It enabled researchers to see how different cells behaved at single-cell levels, providing new insights into the process. However, despite all these advancements, scRNA-seq also experiences challenges related to the complexity of data analysis, interpretation, and multi-omics data integration. In this review, these complications were discussed in detail, directly pointing at the optimization of scRNA-seq approaches and understanding the world of single-cell and its dynamics. Different protocols and currently functional single-cell databases were also covered. This review highlights different tools for the analysis of scRNA-seq and their methodologies, emphasizing innovative techniques that enhance resolution and accuracy at a single-cell level. Various applications were explored across domains including drug discovery, tumor microenvironment (TME), biomarker discovery, and microbial profiling, and case studies were discussed to explain the importance of scRNA-seq by uncovering novel and rare cell types and their identification. This review underlines a crucial aspect of scRNA-seq in the advancement of personalized medicine and highlights its potential to understand the complexity of biological systems.
Collapse
Affiliation(s)
- Ankish Arya
- Department of Applied Sciences, Indian Institute of Information Technology Allahabad, Jhalwa, Prayagraj, 211015, Uttar Pradesh, India
| | - Prabhat Tripathi
- Department of Applied Sciences, Indian Institute of Information Technology Allahabad, Jhalwa, Prayagraj, 211015, Uttar Pradesh, India
| | - Nidhi Dubey
- Department of Applied Sciences, Indian Institute of Information Technology Allahabad, Jhalwa, Prayagraj, 211015, Uttar Pradesh, India
| | - Imlimaong Aier
- Department of Applied Sciences, Indian Institute of Information Technology Allahabad, Jhalwa, Prayagraj, 211015, Uttar Pradesh, India
| | - Pritish Kumar Varadwaj
- Department of Applied Sciences, Indian Institute of Information Technology Allahabad, Jhalwa, Prayagraj, 211015, Uttar Pradesh, India.
| |
Collapse
|
2
|
Li T, Wang Z, Liu Y, He S, Zou Q, Zhang Y. An overview of computational methods in single-cell transcriptomic cell type annotation. Brief Bioinform 2025; 26:bbaf207. [PMID: 40347979 PMCID: PMC12065632 DOI: 10.1093/bib/bbaf207] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2024] [Revised: 03/14/2025] [Accepted: 04/01/2025] [Indexed: 05/14/2025] Open
Abstract
The rapid accumulation of single-cell RNA sequencing data has provided unprecedented computational resources for cell type annotation, significantly advancing our understanding of cellular heterogeneity. Leveraging gene expression profiles derived from transcriptomic data, researchers can accurately infer cell types, sparking the development of numerous innovative annotation methods. These methods utilize a range of strategies, including marker genes, correlation-based matching, and supervised learning, to classify cell types. In this review, we systematically examine these annotation approaches based on transcriptomics-specific gene expression profiles and provide a comprehensive comparison and categorization of these methods. Furthermore, we focus on the main challenges in the annotation process, especially the long-tail distribution problem arising from data imbalance in rare cell types. We discuss the potential of deep learning techniques to address these issues and enhance model capability in recognizing novel cell types within an open-world framework.
Collapse
Affiliation(s)
- Tianhao Li
- School of Computer Science, Chengdu University of Information Technology, No. 24 Block 1, Xuefu Road, 610225 Chengdu, China
| | - Zixuan Wang
- College of Electronics and Information Engineering, Sichuan University, No. 24 South Section 1, 1st Ring Road, 610065 Chengdu, China
| | - Yuhang Liu
- Faculty of Applied Sciences, Macao Polytechnic University, 999078 Macao, China
| | - Sihan He
- School of Computer Science, Chengdu University of Information Technology, No. 24 Block 1, Xuefu Road, 610225 Chengdu, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Shahe Campus: No. 4, Section 2, North Jianshe Road, 611731 Chengdu, China
| | - Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, No. 24 Block 1, Xuefu Road, 610225 Chengdu, China
| |
Collapse
|
3
|
Wu W, Wang S, Zhang K, Li H, Qiao S, Zhang Y, Pang S. scMDCL: A Deep Collaborative Contrastive Learning Framework for Matched Single-Cell Multiomics Data Clustering. J Chem Inf Model 2025; 65:3048-3063. [PMID: 40068854 DOI: 10.1021/acs.jcim.4c02114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/25/2025]
Abstract
Single-cell multiomics clustering integrates multiple omics data to analyze cellular heterogeneity and is crucial for uncovering complex biological processes and disease mechanisms. However, existing matched single-cell multiomics clustering methods often neglect the full utilization of intercellular relationships and the interactions and synergy between features from different omics, leading to suboptimal clustering performance. In this paper, we propose a deep collaborative contrastive learning framework for matched single-cell multiomics data clustering, named scMDCL. This framework fully leverages intercell relationships while enhancing feature interactions among identical cells across different omics data, thereby facilitating efficient clustering of multiomics data. Specifically, to fully utilize the topological information between cells, a graph autoencoder and a feature information enhancement module are designed for different omics, enabling the extraction and augmentation of cell features. Additionally, contrastive learning techniques are employed to strengthen the interactions among the different omics features of the same cell. Ultimately, multiomics deep collaborative clustering modules are utilized to achieve single-cell multiomics clustering. Extensive experiments conducted on nine publicly available single-cell multiomics datasets demonstrate the superior performance of the proposed framework in integrating multiomics data for clustering tasks.
Collapse
Affiliation(s)
- Wenhao Wu
- Qingdao Institute of Software, College of Computer Science and Technology, State Key Laboratory of Chemical Safety, Shandong Key Laboratory of Intelligent Oil & Gas Industrial Software, China University of Petroleum (East China), Qingdao 266580, China
| | - Shudong Wang
- Qingdao Institute of Software, College of Computer Science and Technology, State Key Laboratory of Chemical Safety, Shandong Key Laboratory of Intelligent Oil & Gas Industrial Software, China University of Petroleum (East China), Qingdao 266580, China
| | - Kuijie Zhang
- Qingdao Institute of Software, College of Computer Science and Technology, State Key Laboratory of Chemical Safety, Shandong Key Laboratory of Intelligent Oil & Gas Industrial Software, China University of Petroleum (East China), Qingdao 266580, China
| | - Hengxiao Li
- Qingdao Institute of Software, College of Computer Science and Technology, State Key Laboratory of Chemical Safety, Shandong Key Laboratory of Intelligent Oil & Gas Industrial Software, China University of Petroleum (East China), Qingdao 266580, China
| | - Sibo Qiao
- School of software, Tiangong university, Tianjin 300387, China
| | - Yuanyuan Zhang
- The College of Information and Control Engineering, Qingdao University of Technology, Qingdao, Shandong 266520, China
| | - Shanchen Pang
- Qingdao Institute of Software, College of Computer Science and Technology, State Key Laboratory of Chemical Safety, Shandong Key Laboratory of Intelligent Oil & Gas Industrial Software, China University of Petroleum (East China), Qingdao 266580, China
| |
Collapse
|
4
|
Wang L, Zhang H, Yi B, Xie W, Yu K, Li W, Li K, Zhao D. FactVAE: a factorized variational autoencoder for single-cell multi-omics data integration analysis. Brief Bioinform 2025; 26:bbaf157. [PMID: 40211981 PMCID: PMC11986350 DOI: 10.1093/bib/bbaf157] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2025] [Revised: 03/02/2025] [Accepted: 03/21/2025] [Indexed: 04/14/2025] Open
Abstract
Single-cell multi-omics technologies have revolutionized the study of cell states and functions by simultaneously profiling multiple molecular layers within individual cells. However, existing methods for integrating these data struggle to preserve critical feature information and fail to exploit known regulatory knowledge, which is essential for understanding cell functions. This limitation hinders their ability to provide comprehensive and accurate insights into cells. Here, we propose FactVAE, an innovative factorized variational autoencoder designed for the robust and accurate understanding of single-cell multi-omics data. FactVAE integrates the factorization principle into the variational autoencoder framework, ensuring the preservation of feature information while leveraging the non-linear capture of sample information by neural networks. Additionally, known regulatory knowledge is incorporated during model training, and a knowledge transfer strategy is employed for cell embedding optimization and data augmentation. Comparative analyses of single-cell multi-omics datasets from different protocols and the spatial multi-omics dataset demonstrate that FactVAE not only outperforms benchmark methods in clustering performance but also generates augmented data that reveals the clearest cell-type-specific motif expression. Moreover, the feature embeddings captured by FactVAE enable the inference of potential and reliable gene regulatory relationships. Overall, FactVAE's superior performance and strong scalability make it a promising new solution for single-cell multi-omics data analysis.
Collapse
Affiliation(s)
- Linjie Wang
- School of Computer Science and Engineering, Northeastern University, 110819, Shenyang, China
| | - Huixia Zhang
- School of Computer Science and Engineering, Northeastern University, 110819, Shenyang, China
| | - Bo Yi
- School of Computer Science and Engineering, Northeastern University, 110819, Shenyang, China
| | - Weidong Xie
- School of Computer Science and Engineering, Northeastern University, 110819, Shenyang, China
| | - Kun Yu
- College of Medicine and Bioinformation Engineering, Northeastern University, 110819, Shenyang, China
| | - Wei Li
- Key Laboratory of Intelligent Computing in Medical Image (MIIC), Northeastern University, 110000, Shenyang, China
- National Frontiers Science Center for Industrial Intelligence and Systems Optimization, Northeastern University, 110819, Shenyang, China
| | - Keqin Li
- Department of Computer Science, State University of New York, Albany, NY 12561, United States
| | - Dazhe Zhao
- School of Computer Science and Engineering, Northeastern University, 110819, Shenyang, China
| |
Collapse
|
5
|
Liang DM, Du PF. scMUG: deep clustering analysis of single-cell RNA-seq data on multiple gene functional modules. Brief Bioinform 2025; 26:bbaf138. [PMID: 40188497 PMCID: PMC11972635 DOI: 10.1093/bib/bbaf138] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2024] [Revised: 02/11/2025] [Accepted: 03/09/2025] [Indexed: 04/08/2025] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) has revolutionized our understanding of cellular heterogeneity by providing gene expression data at the single-cell level. Unlike bulk RNA-seq, scRNA-seq allows identification of different cell types within a given tissue, leading to a more nuanced comprehension of cell functions. However, the analysis of scRNA-seq data presents challenges due to its sparsity and high dimensionality. Since bioinformatics plays an important role in the analysis of big data and its utility for the welfare of living beings, it has been widely applied in analyzing scRNA-seq data. To address these challenges, we introduce the scMUG computational pipeline, which incorporates gene functional module information to enhance scRNA-seq clustering analysis. The pipeline includes data preprocessing, cell representation generation, cell-cell similarity matrix construction, and clustering analysis. The scMUG pipeline also introduces a novel similarity measure that combines local density and global distribution in the latent cell representation space. As far as we can tell, this is the first attempt to integrate gene functional associations into scRNA-seq clustering analysis. We curated nine human scRNA-seq datasets to evaluate our scMUG pipeline. With the help of gene functional information and the novel similarity measure, the clustering results from scMUG pipeline present deep insights into functional relationships between gene expression patterns and cellular heterogeneity. In addition, our scMUG pipeline also presents comparable or better clustering performances than other state-of-the-art methods. All source codes of scMUG have been deposited in a GitHub repository with instructions for reproducing all results (https://github.com/degiminnal/scMUG).
Collapse
Affiliation(s)
- De-Min Liang
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| | - Pu-Feng Du
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| |
Collapse
|
6
|
Li B, Zhao Y, Hu J, Zhang S, Zhang X. scSAMAC: saliency-adjusted masking induced attention contrastive learning for single-cell clustering. Brief Bioinform 2025; 26:bbaf128. [PMID: 40131310 PMCID: PMC11934584 DOI: 10.1093/bib/bbaf128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2024] [Revised: 02/23/2025] [Accepted: 03/01/2025] [Indexed: 03/26/2025] Open
Abstract
Single-cell sequencing technology has enabled researchers to study cellular heterogeneity at the cell level. To facilitate the downstream analysis, clustering single-cell data into subgroups is essential. However, the high dimensionality, sparsity, and dropout events of the data make the clustering challenging. Currently, many deep learning methods have been proposed. Nevertheless, they either fail to fully utilize pairwise distances information between similar cells, or do not adequately capture their feature correlations. They cannot also effectively handle high-dimensional sparse data. Therefore, they are not suitable for high-fidelity clustering, leading to difficulties in analyzing the clear cell types required for downstream analysis. The proposed scSAMAC method integrates contrastive learning and negative binomial losses into a variational autoencoder, extracting features via contrastive unit similarity while preserving the intrinsic characteristics. This enhances the robustness and generalization during the clustering. In the contrastive learning, it constructs a mask module by adopting a negative sample generation method with gene feature saliency adjustment, which selects features more influential in the clustering phase and simulates data missing events. Additionally, it develops a novel loss, which consists of a soft k-means loss, a Wasserstein distance, and a contrastive loss. This fully utilizes data information and improves clustering performance. Furthermore, a multi-head attention mechanism module is applied to the latent variables at each layer of autoencoder to enhance feature correlation, integration, and information repair. Experimental results demonstrate that scSAMAC outperforms several state-of-the-art clustering methods.
Collapse
Affiliation(s)
- Bo Li
- School of Computer Science and Technology, Wuhan University of Science and Technology, Huangjiahu west road 2#, Wuhan 430065, China
- Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan University of Science and Technology, Huangjiahu west road 2#, Wuhan 430065, China
| | - Yongkang Zhao
- School of Computer Science and Technology, Wuhan University of Science and Technology, Huangjiahu west road 2#, Wuhan 430065, China
- Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan University of Science and Technology, Huangjiahu west road 2#, Wuhan 430065, China
| | - Jing Hu
- School of Computer Science and Technology, Wuhan University of Science and Technology, Huangjiahu west road 2#, Wuhan 430065, China
- Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan University of Science and Technology, Huangjiahu west road 2#, Wuhan 430065, China
| | - Shihua Zhang
- College of Computer Science, South-Central Minzu University, 182# Minyuan road, Hongshan District, Wuhan 430074, China
| | - Xiaolong Zhang
- School of Computer Science and Technology, Wuhan University of Science and Technology, Huangjiahu west road 2#, Wuhan 430065, China
- Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan University of Science and Technology, Huangjiahu west road 2#, Wuhan 430065, China
| |
Collapse
|
7
|
Tang B, Chen Y. scFTAT: a novel cell annotation method integrating FFT and transformer. BMC Bioinformatics 2025; 26:62. [PMID: 39994539 PMCID: PMC11853718 DOI: 10.1186/s12859-025-06061-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2024] [Accepted: 01/22/2025] [Indexed: 02/26/2025] Open
Abstract
BACKGROUND Advancements in high-throughput sequencing and deep learning have boosted single-cell RNA studies. However, current methods for annotating single-cell data face challenges due to high data sparsity and tedious manual annotation on large-scale data. RESULTS Thus, we proposed a novel annotation model integrating FFT (Fast Fourier Transform) and an enhanced Transformer, named scFTAT. Initially, it reduces data sparsity using LDA (Linear Discriminant Analysis). Subsequently, automatic cell annotation is achieved through a proposed module integrating FFT and an enhanced Transformer. Moreover, the model is fine-tuned to improve training performance by effectively incorporating such techniques as kernel approximation, position encoding enhancement, and attention enhancement modules. Compared to existing popular annotation tools, scFTAT maintains high accuracy and robustness on six typical datasets. Specifically, the model achieves an accuracy of 0.93 on the human kidney data, with an F1 score of 0.84, precision of 0.96, recall rate of 0.80, and Matthews correlation coefficient of 0.89. The highest accuracy of the compared methods is 0.92, with an F1 score of 0.71, precision of 0.75, recall rate of 0.73, and Matthews correlation coefficient of 0.85. The compiled codes and supplements are available at: https://github.com/gladex/scFTAT . CONCLUSION In summary, the proposed scFTAT effectively integrates FFT and enhanced Transformer for automatic feature learning, addressing the challenges of high sparsity and tedious manual annotation in single-cell profiling data. Experiments on six typical scRNA-seq datasets from human and mouse tissues evaluate the model using five metrics as accuracy, F1 score, precision, recall, and Matthews correlation coefficient. Performance comparisons with existing methods further demonstrate the efficiency and robustness of our proposed method.
Collapse
Affiliation(s)
- Binhua Tang
- College of Information Science and Engineering, Hohai University, Jiangsu, 213200, China.
- Key Laboratory of Maritime Intelligent Cyberspace Technology (Hohai University), Ministry of Education, Jiangsu, 213200, China.
- BGI Research, Changzhou, 213299, Jiangsu, China.
| | - Yiyao Chen
- College of Information Science and Engineering, Hohai University, Jiangsu, 213200, China
- Key Laboratory of Maritime Intelligent Cyberspace Technology (Hohai University), Ministry of Education, Jiangsu, 213200, China
| |
Collapse
|
8
|
Zhang Y, Feng X, Wang Y, Shi K. Deep learning powered single-cell clustering framework with enhanced accuracy and stability. Sci Rep 2025; 15:4107. [PMID: 39900656 PMCID: PMC11791198 DOI: 10.1038/s41598-025-87672-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2024] [Accepted: 01/21/2025] [Indexed: 02/05/2025] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) has revolutionized the field of cellular diversity research. Unsupervised clustering, a key technique in this exploration, allows for the identification of distinct cell types within a population. Graph-based deep clustering methods have shown promise in preserving the structural relationships between cells (nodes) within the data. However, these methods often neglect the inherent distribution of nodes in the graph, leading to incomplete representations of cell populations. Additionally, conventional graph convolutional networks (GCNs) can suffer from oversmoothing, a phenomenon where the network loses the ability to differentiate between samples with similar expression profiles. To address these limitations, we proposed scG-cluster, an innovative deep structural clustering method. This method incorporates two key innovations: (1) Dual-topology adjacency graph: scG-cluster integrates information about node distribution into the traditional adjacency graph used by GCNs. This enriches the graph representation by capturing the spatial relationships between cells in addition to their pairwise similarities. (2) Dual-topology adaptive graph convolutional network (TAGCN): The framework employs a TAGCN architecture with residual concatenation. This network utilizes an attention mechanism to dynamically weight features within the graph, focusing on the most informative aspects for clustering. Additionally, residual connections are implemented to combat oversmoothing, ensuring the network retains the ability to distinguish between subtle differences in cell expression profiles. Furthermore, scG-cluster iteratively refines the clustering centers, leading to enhanced stability and accuracy in the final cluster assignments. Extensive evaluations on six diverse scRNA-seq datasets demonstrate that scG-cluster consistently outperforms existing state-of-the-art methods in terms of both clustering accuracy and scalability. Ablation studies are also conducted to validate the significant contributions of both the residual connections and the attention mechanism to the overall performance of the model. The source code for scG-cluster is publicly available at https://github.com/xixi-wq/scG-cluster .
Collapse
Affiliation(s)
- Yi Zhang
- Guilin University of Technology, Guilin, 541004, China
- Guangxi Key Laboratory of Embedded Technology and Intelligent System, Guilin University of Technology, Guilin, 541004, China
| | - Xi Feng
- Guilin University of Technology, Guilin, 541004, China.
- Guangxi Key Laboratory of Embedded Technology and Intelligent System, Guilin University of Technology, Guilin, 541004, China.
| | - Yin Wang
- Guilin University of Technology, Guilin, 541004, China
- Guangxi Key Laboratory of Embedded Technology and Intelligent System, Guilin University of Technology, Guilin, 541004, China
| | - Kai Shi
- Guilin University of Technology, Guilin, 541004, China
- Guangxi Key Laboratory of Embedded Technology and Intelligent System, Guilin University of Technology, Guilin, 541004, China
| |
Collapse
|
9
|
Hozumi Y, Wei GW. Analyzing scRNA-seq data by CCP-assisted UMAP and tSNE. PLoS One 2024; 19:e0311791. [PMID: 39671349 PMCID: PMC11642954 DOI: 10.1371/journal.pone.0311791] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Accepted: 09/24/2024] [Indexed: 12/15/2024] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity in cells, which has given us insights into cell-cell communication, cell differentiation, and differential gene expression. However, analyzing scRNA-seq data is a challenge due to sparsity and the large number of genes involved. Therefore, dimensionality reduction and feature selection are important for removing spurious signals and enhancing downstream analysis. Correlated clustering and projection (CCP) was recently introduced as an effective method for preprocessing scRNA-seq data. CCP utilizes gene-gene correlations to partition the genes and, based on the partition, employs cell-cell interactions to obtain super-genes. Because CCP is a data-domain approach that does not require matrix diagonalization, it can be used in many downstream machine learning tasks. In this work, we utilize CCP as an initialization tool for uniform manifold approximation and projection (UMAP) and t-distributed stochastic neighbor embedding (tSNE). By using 21 publicly available datasets, we have found that CCP significantly improves UMAP and tSNE visualization and dramatically improve their accuracy. More specifically, CCP improves UMAP by 22% in ARI, 14% in NMI and 15% in ECM, and improves tSNE by 11% in ARI, 9% in NMI and 8% in ECM.
Collapse
Affiliation(s)
- Yuta Hozumi
- Department of Mathematics, Michigan State University, East Lansing, Michigan, United States of America
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, Michigan, United States of America
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan, United States of America
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan, United States of America
| |
Collapse
|
10
|
Tian S, Ji C, Ni J, Wang Y, Zheng C. Using Multi-Encoder Semi-Implicit Graph Variational Autoencoder to Analyze Single-Cell RNA Sequencing Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:2280-2291. [PMID: 39255084 DOI: 10.1109/tcbb.2024.3458170] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/12/2024]
Abstract
Rapid advances in single-cell RNA sequencing (scRNA-seq) have made it possible to characterize cell states at a high resolution view for large scale library. scRNA-seq data contains a great deal of biological information, which can be mainly used to discover cell subtypes and track cell development. However, traditional methods face many challenges in addressing scRNA-seq data with high dimensions and high sparsity. For better analysis of scRNA-seq data, we propose a new framework called MSVGAE based on variational graph auto-encoder and graph attention networks. Specifically, we introduce multiple encoders to learn features at different scales and control for uninformative features. Moreover, different noises are added to encoders to promote the propagation of graph structural information and distribution uncertainty. Therefore, some complex posterior distributions can be captured by our model. MSVGAE maps scRNA-seq data with high dimensions and high noise into the low-dimensional latent space, which is beneficial for downstream tasks. In particular, MSVGAE can handle extremely sparse data. Before the experiment, we create 24 simulated datasets to simulate various biological scenarios and collect 8 real-world datasets. The experimental results of clustering, visualization and marker genes analysis indicate that MSVGAE model has excellent accuracy and robustness in analyzing scRNA-seq data.
Collapse
|
11
|
Shu Z, Xia M, Tan K, Zhang Y, Yu Z. Multi-level multi-view network based on structural contrastive learning for scRNA-seq data clustering. Brief Bioinform 2024; 25:bbae562. [PMID: 39494609 PMCID: PMC11532661 DOI: 10.1093/bib/bbae562] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2024] [Revised: 09/23/2024] [Accepted: 10/18/2024] [Indexed: 11/05/2024] Open
Abstract
Clustering plays a crucial role in analyzing scRNA-seq data and has been widely used in studying cellular distribution over the past few years. However, the high dimensionality and complexity of scRNA-seq data pose significant challenges to achieving accurate clustering from a singular perspective. To address these challenges, we propose a novel approach, called multi-level multi-view network based on structural consistency contrastive learning (scMMN), for scRNA-seq data clustering. Firstly, the proposed method constructs shallow views through the $k$-nearest neighbor ($k$NN) and diffusion mapping (DM) algorithms, and then deep views are generated by utilizing the graph Laplacian filters. These deep multi-view data serve as the input for representation learning. To improve the clustering performance of scRNA-seq data, contrastive learning is introduced to enhance the discrimination ability of our network. Specifically, we construct a group contrastive loss for representation features and a structural consistency contrastive loss for structural relationships. Extensive experiments on eight real scRNA-seq datasets show that the proposed method outperforms other state-of-the-art methods in scRNA-seq data clustering tasks. Our source code has already been available at https://github.com/szq0816/scMMN.
Collapse
Affiliation(s)
- Zhenqiu Shu
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Chenggong, 650500, Yunnan, China
| | - Min Xia
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Chenggong, 650500, Yunnan, China
| | - Kaiwen Tan
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Chenggong, 650500, Yunnan, China
| | - Yongbing Zhang
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Chenggong, 650500, Yunnan, China
| | - Zhengtao Yu
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Chenggong, 650500, Yunnan, China
| |
Collapse
|
12
|
Liu T, Jia C, Bi Y, Guo X, Zou Q, Li F. scDFN: enhancing single-cell RNA-seq clustering with deep fusion networks. Brief Bioinform 2024; 25:bbae486. [PMID: 39373051 PMCID: PMC11456827 DOI: 10.1093/bib/bbae486] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Revised: 08/07/2024] [Accepted: 09/17/2024] [Indexed: 10/08/2024] Open
Abstract
Single-cell ribonucleic acid sequencing (scRNA-seq) technology can be used to perform high-resolution analysis of the transcriptomes of individual cells. Therefore, its application has gained popularity for accurately analyzing the ever-increasing content of heterogeneous single-cell datasets. Central to interpreting scRNA-seq data is the clustering of cells to decipher transcriptomic diversity and infer cell behavior patterns. However, its complexity necessitates the application of advanced methodologies capable of resolving the inherent heterogeneity and limited gene expression characteristics of single-cell data. Herein, we introduce a novel deep learning-based algorithm for single-cell clustering, designated scDFN, which can significantly enhance the clustering of scRNA-seq data through a fusion network strategy. The scDFN algorithm applies a dual mechanism involving an autoencoder to extract attribute information and an improved graph autoencoder to capture topological nuances, integrated via a cross-network information fusion mechanism complemented by a triple self-supervision strategy. This fusion is optimized through a holistic consideration of four distinct loss functions. A comparative analysis with five leading scRNA-seq clustering methodologies across multiple datasets revealed the superiority of scDFN, as determined by better the Normalized Mutual Information (NMI) and the Adjusted Rand Index (ARI) metrics. Additionally, scDFN demonstrated robust multi-cluster dataset performance and exceptional resilience to batch effects. Ablation studies highlighted the key roles of the autoencoder and the improved graph autoencoder components, along with the critical contribution of the four joint loss functions to the overall efficacy of the algorithm. Through these advancements, scDFN set a new benchmark in single-cell clustering and can be used as an effective tool for the nuanced analysis of single-cell transcriptomics.
Collapse
Affiliation(s)
- Tianxiang Liu
- School of Science, Dalian Maritime University, 1 Linghai Road, Dalian 116026, China
| | - Cangzhi Jia
- School of Science, Dalian Maritime University, 1 Linghai Road, Dalian 116026, China
| | - Yue Bi
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC 3800, Australia
| | - Xudong Guo
- College of Information Engineering, Northwest A&F University, No. 3 Taicheng Road, Yangling, Shaanxi,China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, No. 2006, Xiyuan Ave, West Hi-Tech Zone, 611731, Chengdu, Sichuan, China
| | - Fuyi Li
- College of Information Engineering, Northwest A&F University, No. 3 Taicheng Road, Yangling, Shaanxi,China
- South Australian Immunogenomics Cancer Institute, The University of Adelaide, 4 North Terrace, SA 5000, Australia
| |
Collapse
|
13
|
Yao Z, Li B, Lu Y, Yau ST. Single-cell analysis via manifold fitting: A framework for RNA clustering and beyond. Proc Natl Acad Sci U S A 2024; 121:e2400002121. [PMID: 39226348 PMCID: PMC11406302 DOI: 10.1073/pnas.2400002121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2024] [Accepted: 07/19/2024] [Indexed: 09/05/2024] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) data, susceptible to noise arising from biological variability and technical errors, can distort gene expression analysis and impact cell similarity assessments, particularly in heterogeneous populations. Current methods, including deep learning approaches, often struggle to accurately characterize cell relationships due to this inherent noise. To address these challenges, we introduce scAMF (Single-cell Analysis via Manifold Fitting), a framework designed to enhance clustering accuracy and data visualization in scRNA-seq studies. At the heart of scAMF lies the manifold fitting module, which effectively denoises scRNA-seq data by unfolding their distribution in the ambient space. This unfolding aligns the gene expression vector of each cell more closely with its underlying structure, bringing it spatially closer to other cells of the same cell type. To comprehensively assess the impact of scAMF, we compile a collection of 25 publicly available scRNA-seq datasets spanning various sequencing platforms, species, and organ types, forming an extensive RNA data bank. In our comparative studies, benchmarking scAMF against existing scRNA-seq analysis algorithms in this data bank, we consistently observe that scAMF outperforms in terms of clustering efficiency and data visualization clarity. Further experimental analysis reveals that this enhanced performance stems from scAMF's ability to improve the spatial distribution of the data and capture class-consistent neighborhoods. These findings underscore the promising application potential of manifold fitting as a tool in scRNA-seq analysis, signaling a significant enhancement in the precision and reliability of data interpretation in this critical field of study.
Collapse
Affiliation(s)
- Zhigang Yao
- Department of Statistics and Data Science, National University of Singapore, Singapore 117546, Republic of Singapore
| | - Bingjie Li
- Department of Statistics and Data Science, National University of Singapore, Singapore 117546, Republic of Singapore
| | - Yukun Lu
- Department of Statistics and Data Science, National University of Singapore, Singapore 117546, Republic of Singapore
| | - Shing-Tung Yau
- Yau Mathematical Sciences Center, Jingzhai, Tsinghua University, Beijing 100084, China
| |
Collapse
|
14
|
Gao H, Shen W, Li R, Liu C, Wu S. Collaborative Structure-Preserved Missing Data Imputation for Single-Cell RNA-Seq Clustering. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:1480-1491. [PMID: 38776196 DOI: 10.1109/tcbb.2024.3404013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2024]
Abstract
Clustering of the single-cell RNA-seq (scRNA-seq) transcriptome profiles is able to identify cell types, which is beneficial to improve the understanding of disease progression. However, in practice, the single-cell expression data often contains a significant number of missing values as a result of technical variability. Missing data is a critical challenge in scRNA-seq clustering analysis since the unknown value does not reflect the underlying true expression level and makes it difficult to discovering cell types by applying clustering algorithms directly. Various approaches have been developed to overcome missing data issue in scRNA-seq clustering. Most of them recover missing expression values by borrowing observed data from similar cells or synthesizing data via generative adversarial networks. Such that the biologically meaningful cluster structure has not been sufficiently exploited. In this work, we introduce ColImpute, a collaborative structure-preserved missing data imputation approach for the scRNA-seq clustering. Specifically, a cluster structure-preserved imputation module and a subspace clustering module, which respectively perform missing data imputation and cell subtypes identification, are integrated into a unified optimization framework to train the two networks in a collaborative manner. Consequently, the clustering module effectively contributes cluster-structure information to guide the trainning process of the missing data imputation module. Simultaneously, the cluster structure-preserved imputation module reciprocally enhances the performance of the clustering module by generating more precise recovered samples. Promising experimental results show that the proposed method is effective for both the data imputation and the cell types identification.
Collapse
|
15
|
Qin L, Zhang G, Zhang S, Chen Y. Deep Batch Integration and Denoise of Single-Cell RNA-Seq Data. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2024; 11:e2308934. [PMID: 38778573 PMCID: PMC11304254 DOI: 10.1002/advs.202308934] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/20/2023] [Revised: 03/14/2024] [Indexed: 05/25/2024]
Abstract
Numerous single-cell transcriptomic datasets from identical tissues or cell lines are generated from different laboratories or single-cell RNA sequencing (scRNA-seq) protocols. The denoising of these datasets to eliminate batch effects is crucial for data integration, ensuring accurate interpretation and comprehensive analysis of biological questions. Although many scRNA-seq data integration methods exist, most are inefficient and/or not conducive to downstream analysis. Here, DeepBID, a novel deep learning-based method for batch effect correction, non-linear dimensionality reduction, embedding, and cell clustering concurrently, is introduced. DeepBID utilizes a negative binomial-based autoencoder with dual Kullback-Leibler divergence loss functions, aligning cell points from different batches within a consistent low-dimensional latent space and progressively mitigating batch effects through iterative clustering. Extensive validation on multiple-batch scRNA-seq datasets demonstrates that DeepBID surpasses existing tools in removing batch effects and achieving superior clustering accuracy. When integrating multiple scRNA-seq datasets from patients with Alzheimer's disease, DeepBID significantly improves cell clustering, effectively annotating unidentified cells, and detecting cell-specific differentially expressed genes.
Collapse
Affiliation(s)
- Lu Qin
- College of Computer and Information EngineeringTianjin Normal UniversityTianjin300387China
| | - Guangya Zhang
- College of Computer and Information EngineeringTianjin Normal UniversityTianjin300387China
| | - Shaoqiang Zhang
- College of Computer and Information EngineeringTianjin Normal UniversityTianjin300387China
| | - Yong Chen
- Department of Biological and Biomedical SciencesRowan UniversityNJ08028USA
| |
Collapse
|
16
|
Alsaggaf I, Buchan D, Wan C. Improving cell type identification with Gaussian noise-augmented single-cell RNA-seq contrastive learning. Brief Funct Genomics 2024; 23:441-451. [PMID: 38242863 DOI: 10.1093/bfgp/elad059] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2023] [Revised: 12/14/2023] [Accepted: 12/18/2023] [Indexed: 01/21/2024] Open
Abstract
Cell type identification is an important task for single-cell RNA-sequencing (scRNA-seq) data analysis. Many prediction methods have recently been proposed, but the predictive accuracy of difficult cell type identification tasks is still low. In this work, we proposed a novel Gaussian noise augmentation-based scRNA-seq contrastive learning method (GsRCL) to learn a type of discriminative feature representations for cell type identification tasks. A large-scale computational evaluation suggests that GsRCL successfully outperformed other state-of-the-art predictive methods on difficult cell type identification tasks, while the conventional random genes masking augmentation-based contrastive learning method also improved the accuracy of easy cell type identification tasks in general.
Collapse
Affiliation(s)
- Ibrahim Alsaggaf
- School of Computing and Mathematical Sciences, Birkbeck, University of London, Malet Street, WC1E 7HX, London, United Kingdom
| | - Daniel Buchan
- Department of Computer Science, University College London, Gower Street, WC1E 6BT, London, United Kingdom
| | - Cen Wan
- School of Computing and Mathematical Sciences, Birkbeck, University of London, Malet Street, WC1E 7HX, London, United Kingdom
| |
Collapse
|
17
|
Xie J, Ruan S, Tu M, Yuan Z, Hu J, Li H, Li S. Clustering single-cell RNA sequencing data via iterative smoothing and self-supervised discriminative embedding. Oncogene 2024; 43:2279-2292. [PMID: 38834657 DOI: 10.1038/s41388-024-03074-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2024] [Revised: 05/22/2024] [Accepted: 05/28/2024] [Indexed: 06/06/2024]
Abstract
Single-cell transcriptome sequencing (scRNA-seq) is a high-throughput technique used to study gene expression at the single-cell level. Clustering analysis is a commonly used method in scRNA-seq data analysis, helping researchers identify cell types and uncover interactions between cells. However, the choice of a robust similarity metric in the clustering procedure is still an open challenge due to the complex underlying structures of the data and the inherent noise in data acquisition. Here, we propose a deep clustering method for scRNA-seq data called scRISE (scRNA-seq Iterative Smoothing and self-supervised discriminative Embedding model) to resolve this challenge. The model consists of two main modules: an iterative smoothing module based on graph autoencoders designed to denoise the data and refine the pairwise similarity in turn to gradually incorporate cell structural features and enrich the data information; and a self-supervised discriminative embedding module with adaptive similarity threshold for partitioning samples into correct clusters. Our approach has shown improved quality of data representation and clustering on seventeen scRNA-seq datasets against a number of state-of-the-art deep learning clustering methods. Furthermore, utilizing the scRISE method in biological analysis against the HNSCC dataset has unveiled 62 informative genes, highlighting their potential roles as therapeutic targets and biomarkers.
Collapse
Affiliation(s)
- Jinxin Xie
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China
| | - Shanshan Ruan
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China
| | - Mingyan Tu
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China
| | - Zhen Yuan
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China
| | - Jianguo Hu
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China
| | - Honglin Li
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China.
- Innovation Center for AI and Drug Discovery, School of Pharmacy, East China Normal University, Shanghai, 200062, China.
- Lingang Laboratory, Shanghai, 200031, China.
| | - Shiliang Li
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China.
- Innovation Center for AI and Drug Discovery, School of Pharmacy, East China Normal University, Shanghai, 200062, China.
| |
Collapse
|
18
|
Xiong J, Gong F, Ma L, Wan L. scVIC: deep generative modeling of heterogeneity for scRNA-seq data. BIOINFORMATICS ADVANCES 2024; 4:vbae086. [PMID: 39027640 PMCID: PMC11256938 DOI: 10.1093/bioadv/vbae086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Revised: 05/15/2024] [Accepted: 06/12/2024] [Indexed: 07/20/2024]
Abstract
Motivation Single-cell RNA sequencing (scRNA-seq) has become a valuable tool for studying cellular heterogeneity. However, the analysis of scRNA-seq data is challenging because of inherent noise and technical variability. Existing methods often struggle to simultaneously explore heterogeneity across cells, handle dropout events, and account for batch effects. These drawbacks call for a robust and comprehensive method that can address these challenges and provide accurate insights into heterogeneity at the single-cell level. Results In this study, we introduce scVIC, an algorithm designed to account for variational inference, while simultaneously handling biological heterogeneity and batch effects at the single-cell level. scVIC explicitly models both biological heterogeneity and technical variability to learn cellular heterogeneity in a manner free from dropout events and the bias of batch effects. By leveraging variational inference, we provide a robust framework for inferring the parameters of scVIC. To test the performance of scVIC, we employed both simulated and biological scRNA-seq datasets, either including, or not, batch effects. scVIC was found to outperform other approaches because of its superior clustering ability and circumvention of the batch effects problem. Availability and implementation The code of scVIC and replication for this study are available at https://github.com/HiBearME/scVIC/tree/v1.0.
Collapse
Affiliation(s)
- Jiankang Xiong
- National Center for Mathematics and Interdisciplinary Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Fuzhou Gong
- National Center for Mathematics and Interdisciplinary Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Liang Ma
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
| | - Lin Wan
- National Center for Mathematics and Interdisciplinary Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
- School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| |
Collapse
|
19
|
Zhang T, Ren J, Li L, Wu Z, Zhang Z, Dong G, Wang G. scZAG: Integrating ZINB-Based Autoencoder with Adaptive Data Augmentation Graph Contrastive Learning for scRNA-seq Clustering. Int J Mol Sci 2024; 25:5976. [PMID: 38892162 PMCID: PMC11172799 DOI: 10.3390/ijms25115976] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2024] [Revised: 04/08/2024] [Accepted: 05/28/2024] [Indexed: 06/21/2024] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) is widely used to interpret cellular states, detect cell subpopulations, and study disease mechanisms. In scRNA-seq data analysis, cell clustering is a key step that can identify cell types. However, scRNA-seq data are characterized by high dimensionality and significant sparsity, presenting considerable challenges for clustering. In the high-dimensional gene expression space, cells may form complex topological structures. Many conventional scRNA-seq data analysis methods focus on identifying cell subgroups rather than exploring these potential high-dimensional structures in detail. Although some methods have begun to consider the topological structures within the data, many still overlook the continuity and complex topology present in single-cell data. We propose a deep learning framework that begins by employing a zero-inflated negative binomial (ZINB) model to denoise the highly sparse and over-dispersed scRNA-seq data. Next, scZAG uses an adaptive graph contrastive representation learning approach that combines approximate personalized propagation of neural predictions graph convolution (APPNPGCN) with graph contrastive learning methods. By using APPNPGCN as the encoder for graph contrastive learning, we ensure that each cell's representation reflects not only its own features but also its position in the graph and its relationships with other cells. Graph contrastive learning exploits the relationships between nodes to capture the similarity among cells, better representing the data's underlying continuity and complex topology. Finally, the learned low-dimensional latent representations are clustered using Kullback-Leibler divergence. We validated the superior clustering performance of scZAG on 10 common scRNA-seq datasets in comparison to existing state-of-the-art clustering methods.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Guohua Wang
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China; (T.Z.); (J.R.); (L.L.); (Z.W.); (Z.Z.); (G.D.)
| |
Collapse
|
20
|
Qiu Y, Yang L, Jiang H, Zou Q. scTPC: a novel semisupervised deep clustering model for scRNA-seq data. Bioinformatics 2024; 40:btae293. [PMID: 38684178 PMCID: PMC11091743 DOI: 10.1093/bioinformatics/btae293] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Revised: 04/14/2024] [Accepted: 04/26/2024] [Indexed: 05/02/2024] Open
Abstract
MOTIVATION Continuous advancements in single-cell RNA sequencing (scRNA-seq) technology have enabled researchers to further explore the study of cell heterogeneity, trajectory inference, identification of rare cell types, and neurology. Accurate scRNA-seq data clustering is crucial in single-cell sequencing data analysis. However, the high dimensionality, sparsity, and presence of "false" zero values in the data can pose challenges to clustering. Furthermore, current unsupervised clustering algorithms have not effectively leveraged prior biological knowledge, making cell clustering even more challenging. RESULTS This study investigates a semisupervised clustering model called scTPC, which integrates the triplet constraint, pairwise constraint, and cross-entropy constraint based on deep learning. Specifically, the model begins by pretraining a denoising autoencoder based on a zero-inflated negative binomial distribution. Deep clustering is then performed in the learned latent feature space using triplet constraints and pairwise constraints generated from partial labeled cells. Finally, to address imbalanced cell-type datasets, a weighted cross-entropy loss is introduced to optimize the model. A series of experimental results on 10 real scRNA-seq datasets and five simulated datasets demonstrate that scTPC achieves accurate clustering with a well-designed framework. AVAILABILITY AND IMPLEMENTATION scTPC is a Python-based algorithm, and the code is available from https://github.com/LF-Yang/Code or https://zenodo.org/records/10951780.
Collapse
Affiliation(s)
- Yushan Qiu
- School of Mathematical Sciences, Shenzhen University, Shenzhen, Guangdong 518000, China
| | - Lingfei Yang
- School of Mathematical Sciences, Shenzhen University, Shenzhen, Guangdong 518000, China
| | - Hao Jiang
- School of Mathematics, Renmin University of China, Haidian District, Beijing 100872, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610056, China
| |
Collapse
|
21
|
Manousidaki A, Little A, Xie Y. Clustering and visualization of single-cell RNA-seq data using path metrics. PLoS Comput Biol 2024; 20:e1012014. [PMID: 38809943 PMCID: PMC11164391 DOI: 10.1371/journal.pcbi.1012014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2023] [Revised: 06/10/2024] [Accepted: 03/21/2024] [Indexed: 05/31/2024] Open
Abstract
Recent advances in single-cell technologies have enabled high-resolution characterization of tissue and cancer compositions. Although numerous tools for dimension reduction and clustering are available for single-cell data analyses, these methods often fail to simultaneously preserve local cluster structure and global data geometry. To address these challenges, we developed a novel analyses framework, Single-Cell Path Metrics Profiling (scPMP), using power-weighted path metrics, which measure distances between cells in a data-driven way. Unlike Euclidean distance and other commonly used distance metrics, path metrics are density sensitive and respect the underlying data geometry. By combining path metrics with multidimensional scaling, a low dimensional embedding of the data is obtained which preserves both the global data geometry and cluster structure. We evaluate the method both for clustering quality and geometric fidelity, and it outperforms current scRNAseq clustering algorithms on a wide range of benchmarking data sets.
Collapse
Affiliation(s)
- Andriana Manousidaki
- Department of Statistics and Probability, Michigan State University, East Lansing, Michigan, United States of America
| | - Anna Little
- Department of Mathematics, University of Utah, Salt Lake City, Utah, United States of America
| | - Yuying Xie
- Department of Statistics and Probability, Michigan State University, East Lansing, Michigan, United States of America
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America
| |
Collapse
|
22
|
Zhang W, Yu R, Xu Z, Li J, Gao W, Jiang M, Dai Q. scCompressSA: dual-channel self-attention based deep autoencoder model for single-cell clustering by compressing gene-gene interactions. BMC Genomics 2024; 25:423. [PMID: 38684946 PMCID: PMC11059774 DOI: 10.1186/s12864-024-10286-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Accepted: 04/04/2024] [Indexed: 05/02/2024] Open
Abstract
BACKGROUND Single-cell clustering has played an important role in exploring the molecular mechanisms about cell differentiation and human diseases. Due to highly-stochastic transcriptomics data, accurate detection of cell types is still challenged, especially for RNA-sequencing data from human beings. In this case, deep neural networks have been increasingly employed to mine cell type specific patterns and have outperformed statistic approaches in cell clustering. RESULTS Using cross-correlation to capture gene-gene interactions, this study proposes the scCompressSA method to integrate topological patterns from scRNA-seq data, with support of self-attention (SA) based coefficient compression (CC) block. This SA-based CC block is able to extract and employ static gene-gene interactions from scRNA-seq data. This proposed scCompressSA method has enhanced clustering accuracy in multiple benchmark scRNA-seq datasets by integrating topological and temporal features. CONCLUSION Static gene-gene interactions have been extracted as temporal features to boost clustering performance in single-cell clustering For the scCompressSA method, dual-channel SA based CC block is able to integrate topological features and has exhibited extraordinary detection accuracy compared with previous clustering approaches that only employ temporal patterns.
Collapse
Affiliation(s)
- Wei Zhang
- Zhejiang Sci-Tech University, Second Street 928, Hangzhou, Zhejiang, 310018, China
| | - Ruochen Yu
- Zhejiang Sci-Tech University, Second Street 928, Hangzhou, Zhejiang, 310018, China
| | - Zeqi Xu
- Zhejiang Sci-Tech University, Second Street 928, Hangzhou, Zhejiang, 310018, China
| | - Junnan Li
- Zhejiang Sci-Tech University, Second Street 928, Hangzhou, Zhejiang, 310018, China
| | - Wenhao Gao
- Zhejiang Sci-Tech University, Second Street 928, Hangzhou, Zhejiang, 310018, China
| | - Mingfeng Jiang
- Zhejiang Sci-Tech University, Second Street 928, Hangzhou, Zhejiang, 310018, China.
| | - Qi Dai
- Zhejiang Sci-Tech University, Second Street 928, Hangzhou, Zhejiang, 310018, China.
| |
Collapse
|
23
|
Lee J, Yun S, Kim Y, Chen T, Kellis M, Park C. Single-cell RNA sequencing data imputation using bi-level feature propagation. Brief Bioinform 2024; 25:bbae209. [PMID: 38706317 PMCID: PMC11070731 DOI: 10.1093/bib/bbae209] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2023] [Revised: 04/08/2024] [Accepted: 04/19/2024] [Indexed: 05/07/2024] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) enables the exploration of cellular heterogeneity by analyzing gene expression profiles in complex tissues. However, scRNA-seq data often suffer from technical noise, dropout events and sparsity, hindering downstream analyses. Although existing works attempt to mitigate these issues by utilizing graph structures for data denoising, they involve the risk of propagating noise and fall short of fully leveraging the inherent data relationships, relying mainly on one of cell-cell or gene-gene associations and graphs constructed by initial noisy data. To this end, this study presents single-cell bilevel feature propagation (scBFP), two-step graph-based feature propagation method. It initially imputes zero values using non-zero values, ensuring that the imputation process does not affect the non-zero values due to dropout. Subsequently, it denoises the entire dataset by leveraging gene-gene and cell-cell relationships in the respective steps. Extensive experimental results on scRNA-seq data demonstrate the effectiveness of scBFP in various downstream tasks, uncovering valuable biological insights.
Collapse
Affiliation(s)
- Junseok Lee
- Department of Industrial and Systems Engineering, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea
| | - Sukwon Yun
- Department of Computer Science, 201 S. Columbia St. CB 3175, UNC-Chapel Hill, Chapel Hill, NC 27599, United States
| | - Yeongmin Kim
- School of Computing, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea
| | - Tianlong Chen
- Department of Computer Science, 201 S. Columbia St. CB 3175, UNC-Chapel Hill, Chapel Hill, NC 27599, United States
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 32 Vassar St, Cambridge, MA 02139, United States
- Broad Institute of MIT and Harvard, Merkin Building, 415 Main St., Cambridge, MA 02142, United States
| | - Manolis Kellis
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 32 Vassar St, Cambridge, MA 02139, United States
- Broad Institute of MIT and Harvard, Merkin Building, 415 Main St., Cambridge, MA 02142, United States
| | - Chanyoung Park
- Department of Industrial and Systems Engineering, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea
| |
Collapse
|
24
|
Zhai Y, Chen L, Deng M. scBOL: a universal cell type identification framework for single-cell and spatial transcriptomics data. Brief Bioinform 2024; 25:bbae188. [PMID: 38678389 PMCID: PMC11056022 DOI: 10.1093/bib/bbae188] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2024] [Revised: 03/11/2024] [Accepted: 04/14/2024] [Indexed: 04/30/2024] Open
Abstract
MOTIVATION Over the past decade, single-cell transcriptomic technologies have experienced remarkable advancements, enabling the simultaneous profiling of gene expressions across thousands of individual cells. Cell type identification plays an essential role in exploring tissue heterogeneity and characterizing cell state differences. With more and more well-annotated reference data becoming available, massive automatic identification methods have sprung up to simplify the annotation process on unlabeled target data by transferring the cell type knowledge. However, in practice, the target data often include some novel cell types that are not in the reference data. Most existing works usually classify these private cells as one generic 'unassigned' group and learn the features of known and novel cell types in a coupled way. They are susceptible to the potential batch effects and fail to explore the fine-grained semantic knowledge of novel cell types, thus hurting the model's discrimination ability. Additionally, emerging spatial transcriptomic technologies, such as in situ hybridization, sequencing and multiplexed imaging, present a novel challenge to current cell type identification strategies that predominantly neglect spatial organization. Consequently, it is imperative to develop a versatile method that can proficiently annotate single-cell transcriptomics data, encompassing both spatial and non-spatial dimensions. RESULTS To address these issues, we propose a new, challenging yet realistic task called universal cell type identification for single-cell and spatial transcriptomics data. In this task, we aim to give semantic labels to target cells from known cell types and cluster labels to those from novel ones. To tackle this problem, instead of designing a suboptimal two-stage approach, we propose an end-to-end algorithm called scBOL from the perspective of Bipartite prototype alignment. Firstly, we identify the mutual nearest clusters in reference and target data as their potential common cell types. On this basis, we mine the cycle-consistent semantic anchor cells to build the intrinsic structure association between two data. Secondly, we design a neighbor-aware prototypical learning paradigm to strengthen the inter-cluster separability and intra-cluster compactness within each data, thereby inspiring the discriminative feature representations. Thirdly, driven by the semantic-aware prototypical learning framework, we can align the known cell types and separate the private cell types from them among reference and target data. Such an algorithm can be seamlessly applied to various data types modeled by different foundation models that can generate the embedding features for cells. Specifically, for non-spatial single-cell transcriptomics data, we use the autoencoder neural network to learn latent low-dimensional cell representations, and for spatial single-cell transcriptomics data, we apply the graph convolution network to capture molecular and spatial similarities of cells jointly. Extensive results on our carefully designed evaluation benchmarks demonstrate the superiority of scBOL over various state-of-the-art cell type identification methods. To our knowledge, we are the pioneers in presenting this pragmatic annotation task, as well as in devising a comprehensive algorithmic framework aimed at resolving this challenge across varied types of single-cell data. Finally, scBOL is implemented in Python using the Pytorch machine-learning library, and it is freely available at https://github.com/aimeeyaoyao/scBOL.
Collapse
Affiliation(s)
- Yuyao Zhai
- School of Mathematical Sciences, Peking University, Beijing, China
| | - Liang Chen
- Huawei Technologies Co., Ltd., Beijing, China
| | - Minghua Deng
- School of Mathematical Sciences, Peking University, Beijing, China
- Center for Statistical Science, Peking University, Beijing, China
- Center for Quantitative Biology, Peking University, Beijing, China
| |
Collapse
|
25
|
Ren L, Wang J, Li W, Guo M, Yu G. Single-cell RNA-seq data clustering by deep information fusion. Brief Funct Genomics 2024; 23:128-137. [PMID: 37208992 DOI: 10.1093/bfgp/elad017] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2022] [Revised: 02/13/2023] [Indexed: 05/21/2023] Open
Abstract
Determining cell types by single-cell transcriptomics data is fundamental for downstream analysis. However, cell clustering and data imputation still face the computation challenges, due to the high dropout rate, sparsity and dimensionality of single-cell data. Although some deep learning based solutions have been proposed to handle these challenges, they still can not leverage gene attribute information and cell topology in a sensible way to explore the consistent clustering. In this paper, we present scDeepFC, a deep information fusion-based single-cell data clustering method for cell clustering and data imputation. Specifically, scDeepFC uses a deep auto-encoder (DAE) network and a deep graph convolution network to embed high-dimensional gene attribute information and high-order cell-cell topological information into different low-dimensional representations, and then fuses them to generate a more comprehensive and accurate consensus representation via a deep information fusion network. In addition, scDeepFC integrates the zero-inflated negative binomial (ZINB) into DAE to model the dropout events. By jointly optimizing the ZINB loss and cell graph reconstruction loss, scDeepFC generates a salient embedding representation for clustering cells and imputing missing data. Extensive experiments on real single-cell datasets prove that scDeepFC outperforms other popular single-cell analysis methods. Both the gene attribute and cell topology information can improve the cell clustering.
Collapse
Affiliation(s)
- Liangrui Ren
- School of Software, Shandong University, 250101 Ji'nan, China
| | - Jun Wang
- Joint SDU-NTU Centre for Artificial Intelligence Research, Shandong University, 250101 Ji'nan, China
| | - Wei Li
- School of Control Science and Engineering, Shandong University, 250061 Ji'nan, China
| | - Maozu Guo
- College of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, 100044,Bei'jing, China
| | - Guoxian Yu
- School of Software, Shandong University, 250101 Ji'nan, China
| |
Collapse
|
26
|
Hu D, Liang K, Dong Z, Wang J, Zhao Y, He K. Effective multi-modal clustering method via skip aggregation network for parallel scRNA-seq and scATAC-seq data. Brief Bioinform 2024; 25:bbae102. [PMID: 38493338 PMCID: PMC10944573 DOI: 10.1093/bib/bbae102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Revised: 01/06/2024] [Accepted: 02/16/2024] [Indexed: 03/18/2024] Open
Abstract
In recent years, there has been a growing trend in the realm of parallel clustering analysis for single-cell RNA-seq (scRNA) and single-cell Assay of Transposase Accessible Chromatin (scATAC) data. However, prevailing methods often treat these two data modalities as equals, neglecting the fact that the scRNA mode holds significantly richer information compared to the scATAC. This disregard hinders the model benefits from the insights derived from multiple modalities, compromising the overall clustering performance. To this end, we propose an effective multi-modal clustering model scEMC for parallel scRNA and Assay of Transposase Accessible Chromatin data. Concretely, we have devised a skip aggregation network to simultaneously learn global structural information among cells and integrate data from diverse modalities. To safeguard the quality of integrated cell representation against the influence stemming from sparse scATAC data, we connect the scRNA data with the aggregated representation via skip connection. Moreover, to effectively fit the real distribution of cells, we introduced a Zero Inflated Negative Binomial-based denoising autoencoder that accommodates corrupted data containing synthetic noise, concurrently integrating a joint optimization module that employs multiple losses. Extensive experiments serve to underscore the effectiveness of our model. This work contributes significantly to the ongoing exploration of cell subpopulations and tumor microenvironments, and the code of our work will be public at https://github.com/DayuHuu/scEMC.
Collapse
Affiliation(s)
- Dayu Hu
- School of Computer, National University of Defense Technology, No. 109 Deya Road, 410073 Changsha, Hunan, China
| | - Ke Liang
- School of Computer, National University of Defense Technology, No. 109 Deya Road, 410073 Changsha, Hunan, China
| | - Zhibin Dong
- School of Computer, National University of Defense Technology, No. 109 Deya Road, 410073 Changsha, Hunan, China
| | - Jun Wang
- School of Computer, National University of Defense Technology, No. 109 Deya Road, 410073 Changsha, Hunan, China
| | - Yawei Zhao
- Medical Big Data Research Center, Chinese PLA General Hospital, No. 28 Fuxing Road, 100853 Beijing, China
| | - Kunlun He
- Medical Big Data Research Center, Chinese PLA General Hospital, No. 28 Fuxing Road, 100853 Beijing, China
| |
Collapse
|
27
|
Fang Z, Zheng R, Li M. scMAE: a masked autoencoder for single-cell RNA-seq clustering. Bioinformatics 2024; 40:btae020. [PMID: 38230824 PMCID: PMC10832357 DOI: 10.1093/bioinformatics/btae020] [Citation(s) in RCA: 11] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Revised: 01/07/2024] [Accepted: 01/12/2024] [Indexed: 01/18/2024] Open
Abstract
MOTIVATION Single-cell RNA sequencing has emerged as a powerful technology for studying gene expression at the individual cell level. Clustering individual cells into distinct subpopulations is fundamental in scRNA-seq data analysis, facilitating the identification of cell types and exploration of cellular heterogeneity. Despite the recent development of many deep learning-based single-cell clustering methods, few have effectively exploited the correlations among genes, resulting in suboptimal clustering outcomes. RESULTS Here, we propose a novel masked autoencoder-based method, scMAE, for cell clustering. scMAE perturbs gene expression and employs a masked autoencoder to reconstruct the original data, learning robust and informative cell representations. The masked autoencoder introduces a masking predictor, which captures relationships among genes by predicting whether gene expression values are masked. By integrating this masking mechanism, scMAE effectively captures latent structures and dependencies in the data, enhancing clustering performance. We conducted extensive comparative experiments using various clustering evaluation metrics on 15 scRNA-seq datasets from different sequencing platforms. Experimental results indicate that scMAE outperforms other state-of-the-art methods on these datasets. In addition, scMAE accurately identifies rare cell types, which are challenging to detect due to their low abundance. Furthermore, biological analyses confirm the biological significance of the identified cell subpopulations. AVAILABILITY AND IMPLEMENTATION The source code of scMAE is available at: https://zenodo.org/records/10465991.
Collapse
Affiliation(s)
- Zhaoyu Fang
- School of Computer Science and Engineering, Central South University, 932 South Lushan Road, Yuelu District, Changsha 410083, China
| | - Ruiqing Zheng
- School of Computer Science and Engineering, Central South University, 932 South Lushan Road, Yuelu District, Changsha 410083, China
| | - Min Li
- School of Computer Science and Engineering, Central South University, 932 South Lushan Road, Yuelu District, Changsha 410083, China
| |
Collapse
|
28
|
Wang Z, Xie X, Liu S, Ji Z. scFseCluster: a feature selection-enhanced clustering for single-cell RNA-seq data. Life Sci Alliance 2023; 6:e202302103. [PMID: 37788907 PMCID: PMC10547911 DOI: 10.26508/lsa.202302103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2023] [Revised: 09/21/2023] [Accepted: 09/22/2023] [Indexed: 10/05/2023] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) enables researchers to reveal previously unknown cell heterogeneity and functional diversity, which is impossible with bulk RNA sequencing. Clustering approaches are widely used for analyzing scRNA-seq data and identifying cell types and states. In the past few years, various advanced computational strategies emerged. However, the low generalization and high computational cost are the main bottlenecks of existing methods. In this study, we established a novel computational framework, scFseCluster, for scRNA-seq clustering analysis. scFseCluster incorporates a metaheuristic algorithm (Feature Selection based on Quantum Squirrel Search Algorithm) to extract the optimal gene set, which largely guarantees the performance of cell clustering. We conducted simulation experiments in several aspects to verify the performance of the proposed approach. scFseCluster performed very well on eight benchmark scRNA-seq datasets because of the optimal gene sets obtained using the Feature Selection based on Quantum Squirrel Search Algorithm. The comparative study demonstrated the significant advantages of scFseCluster over seven State-of-the-Art algorithms. In addition, our analysis shows that feature selection on high-variable genes can significantly improve clustering performance. In conclusion, our study demonstrates that scFseCluster is a highly versatile tool for enhancing scRNA-seq data clustering analysis.
Collapse
Affiliation(s)
- Zongqin Wang
- College of Artificial Intelligence, Nanjing Agricultural University, Nanjing, China
| | - Xiaojun Xie
- College of Artificial Intelligence, Nanjing Agricultural University, Nanjing, China
- Center for Data Science and Intelligent Computing, Nanjing Agricultural University, Nanjing, China
| | - Shouyang Liu
- Academy for Advanced Interdisciplinary Studies, Nanjing Agricultural University, Nanjing, China
| | - Zhiwei Ji
- College of Artificial Intelligence, Nanjing Agricultural University, Nanjing, China
- Center for Data Science and Intelligent Computing, Nanjing Agricultural University, Nanjing, China
| |
Collapse
|
29
|
Tian SW, Ni JC, Wang YT, Zheng CH, Ji CM. scGCC: Graph Contrastive Clustering With Neighborhood Augmentations for scRNA-Seq Data Analysis. IEEE J Biomed Health Inform 2023; 27:6133-6143. [PMID: 37751336 DOI: 10.1109/jbhi.2023.3319551] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/28/2023]
Abstract
Single-cell RNA sequencing (scRNA-seq) has rapidly emerged as a powerful technique for analyzing cellular heterogeneity at the individual cell level. In the analysis of scRNA-seq data, cell clustering is a critical step in downstream analysis, as it enables the identification of cell types and the discovery of novel cell subtypes. However, the characteristics of scRNA-seq data, such as high dimensionality and sparsity, dropout events and batch effects, present significant computational challenges for clustering analysis. In this study, we propose scGCC, a novel graph self-supervised contrastive learning model, to address the challenges faced in scRNA-seq data analysis. scGCC comprises two main components: a representation learning module and a clustering module. The scRNA-seq data is first fed into a representation learning module for training, which is then used for data classification through a clustering module. scGCC can learn low-dimensional denoised embeddings, which is advantageous for our clustering task. We introduce Graph Attention Networks (GAT) for cell representation learning, which enables better feature extraction and improved clustering accuracy. Additionally, we propose five data augmentation methods to improve clustering performance by increasing data diversity and reducing overfitting. These methods enhance the robustness of clustering results. Our experimental study on 14 real-world datasets has demonstrated that our model achieves extraordinary accuracy and robustness. We also perform downstream tasks, including batch effect removal, trajectory inference, and marker genes analysis, to verify the biological effectiveness of our model.
Collapse
|
30
|
Zhan Y, Liu J, Ou-Yang L. scMIC: A Deep Multi-Level Information Fusion Framework for Clustering Single-Cell Multi-Omics Data. IEEE J Biomed Health Inform 2023; 27:6121-6132. [PMID: 37725723 DOI: 10.1109/jbhi.2023.3317272] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/21/2023]
Abstract
Cell type identification is a crucial step towards the study of cellular heterogeneity and biological processes. Advances in single-cell sequencing technology have enabled the development of a variety of clustering methods for cell type identification. However, most of existing methods are designed for clustering single omic data such as single-cell RNA-sequencing (scRNA-seq) data. The accumulation of single-cell multi-omics data provides a great opportunity to integrate different omics data for cell clustering, but also raise new computational challenges for existing methods. How to integrate multi-omics data and leverage their consensus and complementary information to improve the accuracy of cell clustering still remains a challenge. In this study, we propose a new deep multi-level information fusion framework, named scMIC, for clustering single-cell multi-omics data. Our model can integrate the attribute information of cells and the potential structural relationship among cells from local and global levels, and reduce redundant information between different omics from cell and feature levels, leading to more discriminative representations. Moreover, the proposed multiple collaborative supervised clustering strategy is able to guide the learning process of the core encoding part by learning the high-confidence target distribution, which facilitates the interaction between the clustering part and the representation learning part, as well as the information exchange between omics, and finally obtain more robust clustering results. Experiments on seven single-cell multi-omics datasets show the superiority of scMIC over existing state-of-the-art methods.
Collapse
|
31
|
Liu J, Zeng W, Kan S, Li M, Zheng R. CAKE: a flexible self-supervised framework for enhancing cell visualization, clustering and rare cell identification. Brief Bioinform 2023; 25:bbad475. [PMID: 38145950 PMCID: PMC10749894 DOI: 10.1093/bib/bbad475] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2023] [Revised: 11/13/2023] [Accepted: 11/30/2023] [Indexed: 12/27/2023] Open
Abstract
Single cell sequencing technology has provided unprecedented opportunities for comprehensively deciphering cell heterogeneity. Nevertheless, the high dimensionality and intricate nature of cell heterogeneity have presented substantial challenges to computational methods. Numerous novel clustering methods have been proposed to address this issue. However, none of these methods achieve the consistently better performance under different biological scenarios. In this study, we developed CAKE, a novel and scalable self-supervised clustering method, which consists of a contrastive learning model with a mixture neighborhood augmentation for cell representation learning, and a self-Knowledge Distiller model for the refinement of clustering results. These designs provide more condensed and cluster-friendly cell representations and improve the clustering performance in term of accuracy and robustness. Furthermore, in addition to accurately identifying the major type cells, CAKE could also find more biologically meaningful cell subgroups and rare cell types. The comprehensive experiments on real single-cell RNA sequencing datasets demonstrated the superiority of CAKE in visualization and clustering over other comparison methods, and indicated its extensive application in the field of cell heterogeneity analysis. Contact: Ruiqing Zheng. (rqzheng@csu.edu.cn).
Collapse
Affiliation(s)
- Jin Liu
- School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, P.R. China
| | - Weixing Zeng
- School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, P.R. China
| | - Shichao Kan
- School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, P.R. China
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, P.R. China
| | - Ruiqing Zheng
- School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, P.R. China
| |
Collapse
|
32
|
Wang L, Li W, Xie W, Wang R, Yu K. Dual-GCN-based deep clustering with triplet contrast for ScRNA-seq data analysis. Comput Biol Chem 2023; 106:107924. [PMID: 37487251 DOI: 10.1016/j.compbiolchem.2023.107924] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2023] [Revised: 06/08/2023] [Accepted: 07/12/2023] [Indexed: 07/26/2023]
Abstract
Single-cell RNA sequencing (ScRNA-seq) technology reveals gene expression information at the cellular level. The critical tasks in ScRNA-seq data analysis are clustering and dimensionality reduction. Recent deep clustering algorithms are used to optimize the two tasks jointly, and their variations, graph-based deep clustering algorithms, are used to capture and preserve topological information in the process. However, the existing graph-based deep clustering algorithms ignore the distribution information of nodes when constructing cell graphs which leads to incomplete information in the embedding representation; and graph convolutional networks (GCN), which are most commonly used, often suffer from over-smoothing that leads to high sample similarity in the embedding representation and then poor clustering performance. Here, the dual-GCN-based deep clustering with Triplet contrast (scDGDC) is proposed for dimensionality reduction and clustering of scRNA-seq data. Two critical components are dual-GCN-based encoder for capturing more comprehensive topological information and triplet contrast for reducing GCN over-smoothing. The two components improve the dimensionality reduction and clustering performance of scDGDC in terms of information acquisition and model optimization, respectively. The experiments on eight real ScRNA-seq datasets showed that scDGDC achieves excellent performance for both clustering and dimensionality reduction tasks and is high robustness to parameters.
Collapse
Affiliation(s)
- LinJie Wang
- School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China.
| | - Wei Li
- Key Laboratory of Intelligent Computing in Medical Image (MIIC), Northeastern University, Ministry of Education, Shenyang 110000, China.
| | - WeiDong Xie
- School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China.
| | - Rui Wang
- School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China.
| | - Kun Yu
- College of Medicine and Bioinformation Engineering, Northeastern University, Shenyang 110819, China.
| |
Collapse
|
33
|
Lei T, Chen R, Zhang S, Chen Y. Self-supervised deep clustering of single-cell RNA-seq data to hierarchically detect rare cell populations. Brief Bioinform 2023; 24:bbad335. [PMID: 37769630 PMCID: PMC10539043 DOI: 10.1093/bib/bbad335] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Revised: 09/05/2023] [Accepted: 09/06/2023] [Indexed: 10/02/2023] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) is a widely used technique for characterizing individual cells and studying gene expression at the single-cell level. Clustering plays a vital role in grouping similar cells together for various downstream analyses. However, the high sparsity and dimensionality of large scRNA-seq data pose challenges to clustering performance. Although several deep learning-based clustering algorithms have been proposed, most existing clustering methods have limitations in capturing the precise distribution types of the data or fully utilizing the relationships between cells, leaving a considerable scope for improving the clustering performance, particularly in detecting rare cell populations from large scRNA-seq data. We introduce DeepScena, a novel single-cell hierarchical clustering tool that fully incorporates nonlinear dimension reduction, negative binomial-based convolutional autoencoder for data fitting, and a self-supervision model for cell similarity enhancement. In comprehensive evaluation using multiple large-scale scRNA-seq datasets, DeepScena consistently outperformed seven popular clustering tools in terms of accuracy. Notably, DeepScena exhibits high proficiency in identifying rare cell populations within large datasets that contain large numbers of clusters. When applied to scRNA-seq data of multiple myeloma cells, DeepScena successfully identified not only previously labeled large cell types but also subpopulations in CD14 monocytes, T cells and natural killer cells, respectively.
Collapse
Affiliation(s)
- Tianyuan Lei
- College of Computer and Information Engineering, Tianjin Normal University, Tianjin 300387, China
| | - Ruoyu Chen
- Moorestown High School, Moorestown, NJ 08057, USA
| | - Shaoqiang Zhang
- College of Computer and Information Engineering, Tianjin Normal University, Tianjin 300387, China
| | - Yong Chen
- Department of Biological and Biomedical Sciences, Rowan University, NJ 08028, USA
| |
Collapse
|
34
|
Pan W, Long F, Pan J. ScInfoVAE: interpretable dimensional reduction of single cell transcription data with variational autoencoders and extended mutual information regularization. BioData Min 2023; 16:17. [PMID: 37301826 DOI: 10.1186/s13040-023-00333-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2022] [Accepted: 06/05/2023] [Indexed: 06/12/2023] Open
Abstract
Single-cell RNA-sequencing (scRNA-seq) data can serve as a good indicator of cell-to-cell heterogeneity and can aid in the study of cell growth by identifying cell types. Recently, advances in Variational Autoencoder (VAE) have demonstrated their ability to learn robust feature representations for scRNA-seq. However, it has been observed that VAEs tend to ignore the latent variables when combined with a decoding distribution that is too flexible. In this paper, we introduce ScInfoVAE, a dimensional reduction method based on the mutual information variational autoencoder (InfoVAE), which can more effectively identify various cell types in scRNA-seq data of complex tissues. A joint InfoVAE deep model and zero-inflated negative binomial distributed model design based on ScInfoVAE reconstructs the objective function to noise scRNA-seq data and learn an efficient low-dimensional representation of it. We use ScInfoVAE to analyze the clustering performance of 15 real scRNA-seq datasets and demonstrate that our method provides high clustering performance. In addition, we use simulated data to investigate the interpretability of feature extraction, and visualization results show that the low-dimensional representation learned by ScInfoVAE retains local and global neighborhood structure data well. In addition, our model can significantly improve the quality of the variational posterior.
Collapse
Affiliation(s)
- Weiquan Pan
- School of Mathematics and Statistics, Yulin Normal University, Yulin, China
| | - Faning Long
- School of Computer Science and Engineering, Yulin Normal University, Yulin, China.
| | - Jian Pan
- School of Mathematics and Statistics, Yulin Normal University, Yulin, China
| |
Collapse
|
35
|
Zhang S, Li X, Lin J, Lin Q, Wong KC. Review of single-cell RNA-seq data clustering for cell-type identification and characterization. RNA (NEW YORK, N.Y.) 2023; 29:517-530. [PMID: 36737104 PMCID: PMC10158997 DOI: 10.1261/rna.078965.121] [Citation(s) in RCA: 38] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/27/2022] [Accepted: 01/03/2023] [Indexed: 05/06/2023]
Abstract
In recent years, the advances in single-cell RNA-seq techniques have enabled us to perform large-scale transcriptomic profiling at single-cell resolution in a high-throughput manner. Unsupervised learning such as data clustering has become the central component to identify and characterize novel cell types and gene expression patterns. In this study, we review the existing single-cell RNA-seq data clustering methods with critical insights into the related advantages and limitations. In addition, we also review the upstream single-cell RNA-seq data processing techniques such as quality control, normalization, and dimension reduction. We conduct performance comparison experiments to evaluate several popular single-cell RNA-seq clustering approaches on simulated and multiple single-cell transcriptomic data sets.
Collapse
Affiliation(s)
- Shixiong Zhang
- School of Computer Science and Technology, Xidian University, Xi'an 710071, China
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China
| | - Xiangtao Li
- School of Artificial Intelligence, Jilin University, Jilin 130012, China
| | - Jiecong Lin
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China
| | - Qiuzhen Lin
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China
| |
Collapse
|
36
|
Feng X, Zhang H, Lin H, Long H. Single-cell RNA-seq data analysis based on directed graph neural network. Methods 2023; 211:48-60. [PMID: 36804214 DOI: 10.1016/j.ymeth.2023.02.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2022] [Revised: 12/09/2022] [Accepted: 02/13/2023] [Indexed: 02/17/2023] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) data scale surges with high-throughput sequencing technology development. However, although single-cell data analysis is a powerful tool, various issues have been reported, such as sequencing sparsity and complex differential patterns in gene expression. Statistical or traditional machine learning methods are inefficient, and the accuracy needs to be improved. The methods based on deep learning can not directly process non-Euclidean spatial data, such as cell diagrams. In this study, we have developed graph autoencoders and graph attention network for scRNA-seq analysis based on a directed graph neural network named scDGAE. Directed graph neural networks cannot only retain the connection properties of the directed graph but also expand the receptive field of the convolution operation. Cosine similarity, median L1 distance, and root-mean-squared error are used to measure the gene imputation performance of different methods with scDGAE. Furthermore, adjusted mutual information, normalized mutual information, completeness score, and Silhouette coefficient score are used to measure the cell clustering performance of different methods with scDGAE. Experiment results show that the scDGAE model achieves promising performance in gene imputation and cell clustering prediction on four scRNA-seq data sets with gold-standard cell labels. Furthermore, it is a robust framework that can be applied to general scRNA-Seq analyses.
Collapse
Affiliation(s)
- Xiang Feng
- College of Information Science Technology, Hainan Normal University, Haikou, Hainan 571158, China
| | - Hongqi Zhang
- College of Information Science Technology, Hainan Normal University, Haikou, Hainan 571158, China
| | - Hao Lin
- School of Mathematics and Statistics, Hainan Normal University, Haikou, Hainan 571158, China; Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Haixia Long
- College of Information Science Technology, Hainan Normal University, Haikou, Hainan 571158, China.
| |
Collapse
|
37
|
Yu X, Xu X, Zhang J, Li X. Batch alignment of single-cell transcriptomics data using deep metric learning. Nat Commun 2023; 14:960. [PMID: 36810607 PMCID: PMC9944958 DOI: 10.1038/s41467-023-36635-5] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2022] [Accepted: 02/10/2023] [Indexed: 02/24/2023] Open
Abstract
scRNA-seq has uncovered previously unappreciated levels of heterogeneity. With the increasing scale of scRNA-seq studies, the major challenge is correcting batch effect and accurately detecting the number of cell types, which is inevitable in human studies. The majority of scRNA-seq algorithms have been specifically designed to remove batch effect firstly and then conduct clustering, which may miss some rare cell types. Here we develop scDML, a deep metric learning model to remove batch effect in scRNA-seq data, guided by the initial clusters and the nearest neighbor information intra and inter batches. Comprehensive evaluations spanning different species and tissues demonstrated that scDML can remove batch effect, improve clustering performance, accurately recover true cell types and consistently outperform popular methods such as Seurat 3, scVI, Scanorama, BBKNN, Harmony et al. Most importantly, scDML preserves subtle cell types in raw data and enables discovery of new cell subtypes that are hard to extract by analyzing each batch individually. We also show that scDML is scalable to large datasets with lower peak memory usage, and we believe that scDML offers a valuable tool to study complex cellular heterogeneity.
Collapse
Affiliation(s)
- Xiaokang Yu
- Center for Applied Statistics, School of Statistics, Renmin University of China, 100872, Beijing, China
| | - Xinyi Xu
- School of Statistics and Mathematics, Central University of Finance and Economics, 100081, Beijing, China
| | - Jingxiao Zhang
- Center for Applied Statistics, School of Statistics, Renmin University of China, 100872, Beijing, China.
| | - Xiangjie Li
- Changping Laboratory, 102206, Beijing, China.
| |
Collapse
|
38
|
Wang Y, Yu Z, Li S, Bian C, Liang Y, Wong KC, Li X. scBGEDA: deep single-cell clustering analysis via a dual denoising autoencoder with bipartite graph ensemble clustering. Bioinformatics 2023; 39:7025496. [PMID: 36734596 PMCID: PMC9925104 DOI: 10.1093/bioinformatics/btad075] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Revised: 12/08/2022] [Accepted: 02/02/2023] [Indexed: 02/04/2023] Open
Abstract
MOTIVATION Single-cell RNA sequencing (scRNA-seq) is an increasingly popular technique for transcriptomic analysis of gene expression at the single-cell level. Cell-type clustering is the first crucial task in the analysis of scRNA-seq data that facilitates accurate identification of cell types and the study of the characteristics of their transcripts. Recently, several computational models based on a deep autoencoder and the ensemble clustering have been developed to analyze scRNA-seq data. However, current deep autoencoders are not sufficient to learn the latent representations of scRNA-seq data, and obtaining consensus partitions from these feature representations remains under-explored. RESULTS To address this challenge, we propose a single-cell deep clustering model via a dual denoising autoencoder with bipartite graph ensemble clustering called scBGEDA, to identify specific cell populations in single-cell transcriptome profiles. First, a single-cell dual denoising autoencoder network is proposed to project the data into a compressed low-dimensional space and that can learn feature representation via explicit modeling of synergistic optimization of the zero-inflated negative binomial reconstruction loss and denoising reconstruction loss. Then, a bipartite graph ensemble clustering algorithm is designed to exploit the relationships between cells and the learned latent embedded space by means of a graph-based consensus function. Multiple comparison experiments were conducted on 20 scRNA-seq datasets from different sequencing platforms using a variety of clustering metrics. The experimental results indicated that scBGEDA outperforms other state-of-the-art methods on these datasets, and also demonstrated its scalability to large-scale scRNA-seq datasets. Moreover, scBGEDA was able to identify cell-type specific marker genes and provide functional genomic analysis by quantifying the influence of genes on cell clusters, bringing new insights into identifying cell types and characterizing the scRNA-seq data from different perspectives. AVAILABILITY AND IMPLEMENTATION The source code of scBGEDA is available at https://github.com/wangyh082/scBGEDA. The software and the supporting data can be downloaded from https://figshare.com/articles/software/scBGEDA/19657911. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yunhe Wang
- School of Artificial Intelligence, Hebei University of Technology, Tianjin, China
| | - Zhuohan Yu
- School of Artificial Intelligence, Jilin University, Jilin, China
| | - Shaochuan Li
- School of Artificial Intelligence, Jilin University, Jilin, China
| | - Chuang Bian
- School of Artificial Intelligence, Jilin University, Jilin, China
| | - Yanchun Liang
- Zhuhai Laboratory of Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, Zhuhai College of Science and Technology, Zhuhai, China
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Xiangtao Li
- School of Artificial Intelligence, Jilin University, Jilin, China
| |
Collapse
|
39
|
Yu Z, Su Y, Lu Y, Yang Y, Wang F, Zhang S, Chang Y, Wong KC, Li X. Topological identification and interpretation for single-cell gene regulation elucidation across multiple platforms using scMGCA. Nat Commun 2023; 14:400. [PMID: 36697410 PMCID: PMC9877026 DOI: 10.1038/s41467-023-36134-7] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2022] [Accepted: 01/16/2023] [Indexed: 01/26/2023] Open
Abstract
Single-cell RNA sequencing provides high-throughput gene expression information to explore cellular heterogeneity at the individual cell level. A major challenge in characterizing high-throughput gene expression data arises from challenges related to dimensionality, and the prevalence of dropout events. To address these concerns, we develop a deep graph learning method, scMGCA, for single-cell data analysis. scMGCA is based on a graph-embedding autoencoder that simultaneously learns cell-cell topology representation and cluster assignments. We show that scMGCA is accurate and effective for cell segregation and batch effect correction, outperforming other state-of-the-art models across multiple platforms. In addition, we perform genomic interpretation on the key compressed transcriptomic space of the graph-embedding autoencoder to demonstrate the underlying gene regulation mechanism. We demonstrate that in a pancreatic ductal adenocarcinoma dataset, scMGCA successfully provides annotations on the specific cell types and reveals differential gene expression levels across multiple tumor-associated and cell signalling pathways.
Collapse
Affiliation(s)
- Zhuohan Yu
- School of Artificial Intelligence, Jilin University, Jilin, China
| | - Yanchi Su
- School of Artificial Intelligence, Jilin University, Jilin, China
| | - Yifu Lu
- School of Artificial Intelligence, Jilin University, Jilin, China
| | - Yuning Yang
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, ON, Canada
| | - Fuzhou Wang
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China
| | - Shixiong Zhang
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China
| | - Yi Chang
- School of Artificial Intelligence, Jilin University, Jilin, China
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China.
| | - Xiangtao Li
- School of Artificial Intelligence, Jilin University, Jilin, China.
| |
Collapse
|
40
|
Wang J, Xia J, Wang H, Su Y, Zheng CH. scDCCA: deep contrastive clustering for single-cell RNA-seq data based on auto-encoder network. Brief Bioinform 2023; 24:6984787. [PMID: 36631401 DOI: 10.1093/bib/bbac625] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2022] [Revised: 12/12/2022] [Accepted: 12/19/2022] [Indexed: 01/13/2023] Open
Abstract
The advances in single-cell ribonucleic acid sequencing (scRNA-seq) allow researchers to explore cellular heterogeneity and human diseases at cell resolution. Cell clustering is a prerequisite in scRNA-seq analysis since it can recognize cell identities. However, the high dimensionality, noises and significant sparsity of scRNA-seq data have made it a big challenge. Although many methods have emerged, they still fail to fully explore the intrinsic properties of cells and the relationship among cells, which seriously affects the downstream clustering performance. Here, we propose a new deep contrastive clustering algorithm called scDCCA. It integrates a denoising auto-encoder and a dual contrastive learning module into a deep clustering framework to extract valuable features and realize cell clustering. Specifically, to better characterize and learn data representations robustly, scDCCA utilizes a denoising Zero-Inflated Negative Binomial model-based auto-encoder to extract low-dimensional features. Meanwhile, scDCCA incorporates a dual contrastive learning module to capture the pairwise proximity of cells. By increasing the similarities between positive pairs and the differences between negative ones, the contrasts at both the instance and the cluster level help the model learn more discriminative features and achieve better cell segregation. Furthermore, scDCCA joins feature learning with clustering, which realizes representation learning and cell clustering in an end-to-end manner. Experimental results of 14 real datasets validate that scDCCA outperforms eight state-of-the-art methods in terms of accuracy, generalizability, scalability and efficiency. Cell visualization and biological analysis demonstrate that scDCCA significantly improves clustering and facilitates downstream analysis for scRNA-seq data. The code is available at https://github.com/WJ319/scDCCA.
Collapse
Affiliation(s)
- Jing Wang
- Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei, China
| | - Junfeng Xia
- Institutes of Physical Science and Information Technology, Anhui University, Hefei, China
| | - Haiyun Wang
- School of Mathematics and Systems Science, Xinjiang University, Urumqi, China
| | - Yansen Su
- School of Artificial Intelligence, Anhui University, Hefei, China
| | - Chun-Hou Zheng
- School of Artificial Intelligence, Anhui University, Hefei, China
| |
Collapse
|
41
|
Lin X, Tian T, Wei Z, Hakonarson H. Clustering of single-cell multi-omics data with a multimodal deep learning method. Nat Commun 2022; 13:7705. [PMID: 36513636 PMCID: PMC9748135 DOI: 10.1038/s41467-022-35031-9] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2021] [Accepted: 11/16/2022] [Indexed: 12/15/2022] Open
Abstract
Single-cell multimodal sequencing technologies are developed to simultaneously profile different modalities of data in the same cell. It provides a unique opportunity to jointly analyze multimodal data at the single-cell level for the identification of distinct cell types. A correct clustering result is essential for the downstream complex biological functional studies. However, combining different data sources for clustering analysis of single-cell multimodal data remains a statistical and computational challenge. Here, we develop a novel multimodal deep learning method, scMDC, for single-cell multi-omics data clustering analysis. scMDC is an end-to-end deep model that explicitly characterizes different data sources and jointly learns latent features of deep embedding for clustering analysis. Extensive simulation and real-data experiments reveal that scMDC outperforms existing single-cell single-modal and multimodal clustering methods on different single-cell multimodal datasets. The linear scalability of running time makes scMDC a promising method for analyzing large multimodal datasets.
Collapse
Affiliation(s)
- Xiang Lin
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA
| | - Tian Tian
- Center of Applied Genomics, Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - Zhi Wei
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA.
| | - Hakon Hakonarson
- Center of Applied Genomics, Children's Hospital of Philadelphia, Philadelphia, PA, USA
- Division of Human Genetics, Department of Pediatrics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
42
|
Feng X, Fang F, Long H, Zeng R, Yao Y. Single-cell RNA-seq data analysis using graph autoencoders and graph attention networks. Front Genet 2022; 13:1003711. [PMID: 36568390 PMCID: PMC9780469 DOI: 10.3389/fgene.2022.1003711] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2022] [Accepted: 11/21/2022] [Indexed: 12/13/2022] Open
Abstract
With the development of high-throughput sequencing technology, the scale of single-cell RNA sequencing (scRNA-seq) data has surged. Its data are typically high-dimensional, with high dropout noise and high sparsity. Therefore, gene imputation and cell clustering analysis of scRNA-seq data is increasingly important. Statistical or traditional machine learning methods are inefficient, and improved accuracy is needed. The methods based on deep learning cannot directly process non-Euclidean spatial data, such as cell diagrams. In this study, we developed scGAEGAT, a multi-modal model with graph autoencoders and graph attention networks for scRNA-seq analysis based on graph neural networks. Cosine similarity, median L1 distance, and root-mean-squared error were used to measure the gene imputation performance of different methods for comparison with scGAEGAT. Furthermore, adjusted mutual information, normalized mutual information, completeness score, and Silhouette coefficient score were used to measure the cell clustering performance of different methods for comparison with scGAEGAT. Experimental results demonstrated promising performance of the scGAEGAT model in gene imputation and cell clustering prediction on four scRNA-seq data sets with gold-standard cell labels.
Collapse
Affiliation(s)
- Xiang Feng
- College of Information Science Technology, Hainan Normal University, Haikou, Hainan, China
| | - Fang Fang
- College of Information Engineering, Hainan Vocational University of Science and Technology, Haikou, Hainan, China
| | - Haixia Long
- College of Information Science Technology, Hainan Normal University, Haikou, Hainan, China
| | - Rao Zeng
- College of Information Science Technology, Hainan Normal University, Haikou, Hainan, China
| | - Yuhua Yao
- College of Mathematics and Statistics, Hainan Normal University, Haikou, Hainan, China
| |
Collapse
|
43
|
Shan Y, Yang J, Li X, Zhong X, Chang Y. GLAE: A Graph-learnable Auto-encoder for Single-cell RNA-seq Analysis. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.11.049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
44
|
Brendel M, Su C, Bai Z, Zhang H, Elemento O, Wang F. Application of Deep Learning on Single-cell RNA Sequencing Data Analysis: A Review. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:814-835. [PMID: 36528240 PMCID: PMC10025684 DOI: 10.1016/j.gpb.2022.11.011] [Citation(s) in RCA: 33] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/23/2022] [Revised: 08/17/2022] [Accepted: 11/24/2022] [Indexed: 12/23/2022]
Abstract
Single-cell RNA sequencing (scRNA-seq) has become a routinely used technique to quantify the gene expression profile of thousands of single cells simultaneously. Analysis of scRNA-seq data plays an important role in the study of cell states and phenotypes, and has helped elucidate biological processes, such as those occurring during the development of complex organisms, and improved our understanding of disease states, such as cancer, diabetes, and coronavirus disease 2019 (COVID-19). Deep learning, a recent advance of artificial intelligence that has been used to address many problems involving large datasets, has also emerged as a promising tool for scRNA-seq data analysis, as it has a capacity to extract informative and compact features from noisy, heterogeneous, and high-dimensional scRNA-seq data to improve downstream analysis. The present review aims at surveying recently developed deep learning techniques in scRNA-seq data analysis, identifying key steps within the scRNA-seq data analysis pipeline that have been advanced by deep learning, and explaining the benefits of deep learning over more conventional analytic tools. Finally, we summarize the challenges in current deep learning approaches faced within scRNA-seq data and discuss potential directions for improvements in deep learning algorithms for scRNA-seq data analysis.
Collapse
Affiliation(s)
- Matthew Brendel
- Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA; Institute for Computational Biomedicine, Caryl and Israel Englander Institute for Precision Medicine, Department of Physiology and Biophysics, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA
| | - Chang Su
- Department of Health Service Administration and Policy, Temple University, Philadelphia, PA 19122, USA.
| | - Zilong Bai
- Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA
| | - Hao Zhang
- Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA
| | - Olivier Elemento
- Institute for Computational Biomedicine, Caryl and Israel Englander Institute for Precision Medicine, Department of Physiology and Biophysics, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA
| | - Fei Wang
- Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA.
| |
Collapse
|
45
|
Mondal AK, Asnani H, Singla P, Ap P. scRAE: Deterministic Regularized Autoencoders With Flexible Priors for Clustering Single-Cell Gene Expression Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2996-3007. [PMID: 34288873 DOI: 10.1109/tcbb.2021.3098394] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Clustering single-cell RNA sequence (scRNA-seq) data poses statistical and computational challenges due to their high-dimensionality and data-sparsity, also known as 'dropout' events. Recently, Regularized Auto-Encoder (RAE) based deep neural network models have achieved remarkable success in learning robust low-dimensional representations. The basic idea in RAEs is to learn a non-linear mapping from the high-dimensional data space to a low-dimensional latent space and vice-versa, simultaneously imposing a distributional prior on the latent space, which brings in a regularization effect. This paper argues that RAEs suffer from the infamous problem of bias-variance trade-off in their naive formulation. While a simple AE wita latent regularization results in data over-fitting, a very strong prior leads to under-representation and thus bad clustering. To address the above issues, we propose a modified RAE framework (called the scRAE) for effective clustering of the single-cell RNA sequencing data. scRAE consists of deterministic AE with a flexibly learnable prior generator network, which is jointly trained with the AE. This facilitates scRAE to trade-off better between the bias and variance in the latent space. We demonstrate the efficacy of the proposed method through extensive experimentation on several real-world single-cell Gene expression datasets. The code for our work is available at https://github.com/arnabkmondal/scRAE.
Collapse
|
46
|
Li Z, Zhou X. BASS: multi-scale and multi-sample analysis enables accurate cell type clustering and spatial domain detection in spatial transcriptomic studies. Genome Biol 2022; 23:168. [PMID: 35927760 PMCID: PMC9351148 DOI: 10.1186/s13059-022-02734-7] [Citation(s) in RCA: 55] [Impact Index Per Article: 18.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2022] [Accepted: 07/21/2022] [Indexed: 02/08/2023] Open
Abstract
Spatial transcriptomic studies are reaching single-cell spatial resolution, with data often collected from multiple tissue sections. Here, we present a computational method, BASS, that enables multi-scale and multi-sample analysis for single-cell resolution spatial transcriptomics. BASS performs cell type clustering at the single-cell scale and spatial domain detection at the tissue regional scale, with the two tasks carried out simultaneously within a Bayesian hierarchical modeling framework. We illustrate the benefits of BASS through comprehensive simulations and applications to three datasets. The substantial power gain brought by BASS allows us to reveal accurate transcriptomic and cellular landscape in both cortex and hypothalamus.
Collapse
Affiliation(s)
- Zheng Li
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, 48109, USA
- Center for Statistical Genetics, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Xiang Zhou
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, 48109, USA.
- Center for Statistical Genetics, University of Michigan, Ann Arbor, MI, 48109, USA.
| |
Collapse
|
47
|
Ding Q, Yang W, Luo M, Xu C, Xu Z, Pang F, Cai Y, Anashkina AA, Su X, Chen N, Jiang Q. CBLRR: a cauchy-based bounded constraint low-rank representation method to cluster single-cell RNA-seq data. Brief Bioinform 2022; 23:6649282. [DOI: 10.1093/bib/bbac300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 06/17/2022] [Accepted: 07/02/2022] [Indexed: 11/14/2022] Open
Abstract
Abstract
The rapid development of single-cel+l RNA sequencing (scRNA-seq) technology provides unprecedented opportunities for exploring biological phenomena at the single-cell level. The discovery of cell types is one of the major applications for researchers to explore the heterogeneity of cells. Some computational methods have been proposed to solve the problem of scRNA-seq data clustering. However, the unavoidable technical noise and notorious dropouts also reduce the accuracy of clustering methods. Here, we propose the cauchy-based bounded constraint low-rank representation (CBLRR), which is a low-rank representation-based method by introducing cauchy loss function (CLF) and bounded nuclear norm regulation, aiming to alleviate the above issue. Specifically, as an effective loss function, the CLF is proven to enhance the robustness of the identification of cell types. Then, we adopt the bounded constraint to ensure the entry values of single-cell data within the restricted interval. Finally, the performance of CBLRR is evaluated on 15 scRNA-seq datasets, and compared with other state-of-the-art methods. The experimental results demonstrate that CBLRR performs accurately and robustly on clustering scRNA-seq data. Furthermore, CBLRR is an effective tool to cluster cells, and provides great potential for downstream analysis of single-cell data. The source code of CBLRR is available online at https://github.com/Ginnay/CBLRR.
Collapse
Affiliation(s)
- Qian Ding
- School of Life Science and Technology, Harbin Institute of Technology , Harbin, Heilongjiang, China
| | - Wenyi Yang
- School of Life Science and Technology, Harbin Institute of Technology , Harbin, Heilongjiang, China
| | - Meng Luo
- School of Life Science and Technology, Harbin Institute of Technology , Harbin, Heilongjiang, China
| | - Chang Xu
- School of Life Science and Technology, Harbin Institute of Technology , Harbin, Heilongjiang, China
| | - Zhaochun Xu
- School of Life Science and Technology, Harbin Institute of Technology , Harbin, Heilongjiang, China
| | - Fenglan Pang
- School of Life Science and Technology, Harbin Institute of Technology , Harbin, Heilongjiang, China
| | - Yideng Cai
- School of Life Science and Technology, Harbin Institute of Technology , Harbin, Heilongjiang, China
| | - Anastasia A Anashkina
- Engelhardt Institute of Molecular Biology, Russian Academy of Sciences , Moscow, Russia
| | - Xi Su
- Foshan Maternity & Child Healthcare Hospital, Southern Medical University , Foshan, Guangdong, China
| | - Na Chen
- Department of Hematology, Shandong Provincial Hospital Affiliated to Shandong First Medical University , Jinan, Shandong, China
| | - Qinghua Jiang
- School of Life Science and Technology, Harbin Institute of Technology , Harbin, Heilongjiang, China
| |
Collapse
|
48
|
Liu Q, Luo X, Li J, Wang G. scESI: evolutionary sparse imputation for single-cell transcriptomes from nearest neighbor cells. Brief Bioinform 2022; 23:6580519. [PMID: 35512331 DOI: 10.1093/bib/bbac144] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2022] [Revised: 03/14/2022] [Accepted: 03/31/2022] [Indexed: 02/01/2023] Open
Abstract
The ubiquitous dropout problem in single-cell RNA sequencing technology causes a large amount of data noise in the gene expression profile. For this reason, we propose an evolutionary sparse imputation (ESI) algorithm for single-cell transcriptomes, which constructs a sparse representation model based on gene regulation relationships between cells. To solve this model, we design an optimization framework based on nondominated sorting genetics. This framework takes into account the topological relationship between cells and the variety of gene expression to iteratively search the global optimal solution, thereby learning the Pareto optimal cell-cell affinity matrix. Finally, we use the learned sparse relationship model between cells to improve data quality and reduce data noise. In simulated datasets, scESI performed significantly better than benchmark methods with various metrics. By applying scESI to real scRNA-seq datasets, we discovered scESI can not only further classify the cell types and separate cells in visualization successfully but also improve the performance in reconstructing trajectories differentiation and identifying differentially expressed genes. In addition, scESI successfully recovered the expression trends of marker genes in stem cell differentiation and can discover new cell types and putative pathways regulating biological processes.
Collapse
Affiliation(s)
- Qiaoming Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Ximei Luo
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China.,Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Jie Li
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Guohua Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
49
|
Wan H, Chen L, Deng M. scNAME: neighborhood contrastive clustering with ancillary mask estimation for scRNA-seq data. Bioinformatics 2022; 38:1575-1583. [PMID: 34999761 DOI: 10.1093/bioinformatics/btac011] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2021] [Revised: 11/28/2021] [Accepted: 01/05/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION The rapid development of single-cell RNA sequencing (scRNA-seq) makes it possible to study the heterogeneity of individual cell characteristics. Cell clustering is a vital procedure in scRNA-seq analysis, providing insight into complex biological phenomena. However, the noisy, high-dimensional and large-scale nature of scRNA-seq data introduces challenges in clustering analysis. Up to now, many deep learning-based methods have emerged to learn underlying feature representations while clustering. However, these methods are inefficient when it comes to rare cell type identification and barely able to fully utilize gene dependencies or cell similarity integrally. As a result, they cannot detect a clear cell type structure which is required for clustering accuracy as well as downstream analysis. RESULTS Here, we propose a novel scRNA-seq clustering algorithm called scNAME which incorporates a mask estimation task for gene pertinence mining and a neighborhood contrastive learning framework for cell intrinsic structure exploitation. The learned pattern through mask estimation helps reveal uncorrupted data structure and denoise the original single-cell data. In addition, the randomly created augmented data introduced in contrastive learning not only helps improve robustness of clustering, but also increases sample size in each cluster for better data capacity. Beyond this, we also introduce a neighborhood contrastive paradigm with an offline memory bank, global in scope, which can inspire discriminative feature representation and achieve intra-cluster compactness, yet inter-cluster separation. The combination of mask estimation task, neighborhood contrastive learning and global memory bank designed in scNAME is conductive to rare cell type detection. The experimental results of both simulations and real data confirm that our method is accurate, robust and scalable. We also implement biological analysis, including marker gene identification, gene ontology and pathway enrichment analysis, to validate the biological significance of our method. To the best of our knowledge, we are among the first to introduce a gene relationship exploration strategy, as well as a global cellular similarity repository, in the single-cell field. AVAILABILITY AND IMPLEMENTATION An implementation of scNAME is available from https://github.com/aster-ww/scNAME. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hui Wan
- School of Mathematical Sciences, Peking University, Beijing 100871, China
| | - Liang Chen
- School of Mathematical Sciences, Peking University, Beijing 100871, China
| | - Minghua Deng
- School of Mathematical Sciences, Peking University, Beijing 100871, China.,Center for Quantitative Biology, Peking University, Beijing 100871, China.,Center for Statistical Science, Peking university, Beijing 100871, China
| |
Collapse
|
50
|
Ciortan M, Defrance M. GNN-based embedding for clustering scRNA-seq data. Bioinformatics 2022; 38:1037-1044. [PMID: 34850828 DOI: 10.1093/bioinformatics/btab787] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2021] [Revised: 10/15/2021] [Accepted: 11/15/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Single-cell RNA sequencing (scRNA-seq) provides transcriptomic profiling for individual cells, allowing researchers to study the heterogeneity of tissues, recognize rare cell identities and discover new cellular subtypes. Clustering analysis is usually used to predict cell class assignments and infer cell identities. However, the high sparsity of scRNA-seq data, accentuated by dropout events generates challenges that have motivated the development of numerous dedicated clustering methods. Nevertheless, there is still no consensus on the best performing method. RESULTS graph-sc is a new method leveraging a graph autoencoder network to create embeddings for scRNA-seq cell data. While this work analyzes the performance of clustering the embeddings with various clustering algorithms, other downstream tasks can also be performed. A broad experimental study has been performed on both simulated and scRNA-seq datasets. The results indicate that although there is no consistently best method across all the analyzed datasets, graph-sc compares favorably to competing techniques across all types of datasets. Furthermore, the proposed method is stable across consecutive runs, robust to input down-sampling, generally insensitive to changes in the network architecture or training parameters and more computationally efficient than other competing methods based on neural networks. Modeling the data as a graph provides increased flexibility to define custom features characterizing the genes, the cells and their interactions. Moreover, external data (e.g. gene network) can easily be integrated into the graph and used seamlessly under the same optimization task. AVAILABILITY AND IMPLEMENTATION https://github.com/ciortanmadalina/graph-sc. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Madalina Ciortan
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles, Brussels, Belgium
| | - Matthieu Defrance
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles, Brussels, Belgium
| |
Collapse
|