1
|
Jiang S, Wang C, Sun Q, Zhang Z. A robust multi-scale clustering framework for single-cell RNA-seq data analysis. Sci Rep 2025; 15:18543. [PMID: 40425750 PMCID: PMC12116994 DOI: 10.1038/s41598-025-03603-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2025] [Accepted: 05/21/2025] [Indexed: 05/29/2025] Open
Abstract
Recent advancements in single-cell RNA sequencing (scRNA-seq) technology have unlocked novel opportunities for deep exploration of gene expression patterns. However, the inherent high dimensionality, sparsity, and noise in scRNA-seq data pose significant challenges for existing clustering methods, especially in accurately identifying and classifying diverse cell types. To address these challenges, we introduce a new method, single-cell Multi-Scale Clustering Framework (scMSCF), which combines multi-dimensional PCA for dimensionality reduction, K-means clustering, and a weighted ensemble meta-clustering approach, enhanced by a self-attention-driven Transformer model to optimize clustering performance. scMSCF constructs an initial clustering framework using a multi-layer dimensionality reduction strategy to establish a robust consensus on clustering structure. A voting mechanism within the meta-clustering process selects high-confidence cells from the initial clustering results to provide precise training labels for the Transformer model. This approach enables the model to capture complex dependencies in gene expression data, thereby enhancing clustering accuracy. Comprehensive testing across eight single-cell RNA sequencing datasets demonstrates that scMSCF surpasses existing methods, achieving on average 10-15% higher ARI, NMI, and ACC scores. For example, on the PBMC5k dataset, scMSCF improves ARI from 0.72 to 0.86, demonstrating its ability to accurately identify diverse cell populations. The source code for our algorithm is publicly available at https://github.com/DEREKJ24/scMSCF .
Collapse
Affiliation(s)
- Songrun Jiang
- College of Computer Science and Technology, Changchun Normal University, Changchun, 130000, China
| | - Chunyan Wang
- College of Computer Science and Technology, Changchun Normal University, Changchun, 130000, China.
| | - Qiucheng Sun
- College of Computer Science and Technology, Changchun Normal University, Changchun, 130000, China.
| | - Zhi Zhang
- College of Computer Science and Technology, Changchun Normal University, Changchun, 130000, China
| |
Collapse
|
2
|
Yuan L, Xu Z, Meng B, Ye L. scAMZI: attention-based deep autoencoder with zero-inflated layer for clustering scRNA-seq data. BMC Genomics 2025; 26:350. [PMID: 40197174 PMCID: PMC11974017 DOI: 10.1186/s12864-025-11511-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2024] [Accepted: 03/20/2025] [Indexed: 04/10/2025] Open
Abstract
BACKGROUND Clustering scRNA-seq data plays a vital role in scRNA-seq data analysis and downstream analyses. Many computational methods have been proposed and achieved remarkable results. However, there are several limitations of these methods. First, they do not fully exploit cellular features. Second, they are developed based on gene expression information and lack of flexibility in integrating intercellular relationships. Finally, the performance of these methods is affected by dropout event. RESULTS We propose a novel deep learning (DL) model based on attention autoencoder and zero-inflated (ZI) layer, namely scAMZI, to cluster scRNA-seq data. scAMZI is mainly composed of SimAM (a Simple, parameter-free Attention Module), autoencoder, ZINB (Zero-Inflated Negative Binomial) model and ZI layer. Based on ZINB model, we introduce autoencoder and SimAM to reduce dimensionality of data and learn feature representations of cells and relationships between cells. Meanwhile, ZI layer is used to handle zero values in the data. We compare the performance of scAMZI with nine methods (three shallow learning algorithms and six state-of-the-art DL-based methods) on fourteen benchmark scRNA-seq datasets of various sizes (from hundreds to tens of thousands of cells) with known cell types. Experimental results demonstrate that scAMZI outperforms competing methods. CONCLUSIONS scAMZI outperforms competing methods and can facilitate downstream analyses such as cell annotation, marker gene discovery, and cell trajectory inference. The package of scAMZI is made freely available at https://doi.org/10.5281/zenodo.13131559 .
Collapse
Affiliation(s)
- Lin Yuan
- Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology (Shandong Academy of Sciences), 3501 Daxue Road, Jinan, 250353, China
- Shandong Engineering Research Center of Big Data Applied Technology, Faculty of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), 3501 Daxue Road, Jinan, 250353, China
- Shandong Provincial Key Laboratory of Industrial Network and Information System Security, Shandong Fundamental Research Center for Computer Science, 3501 Daxue Road, Jinan, 250353, China
| | - Zhijie Xu
- Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology (Shandong Academy of Sciences), 3501 Daxue Road, Jinan, 250353, China
- Shandong Engineering Research Center of Big Data Applied Technology, Faculty of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), 3501 Daxue Road, Jinan, 250353, China
- Shandong Provincial Key Laboratory of Industrial Network and Information System Security, Shandong Fundamental Research Center for Computer Science, 3501 Daxue Road, Jinan, 250353, China
| | - Boyuan Meng
- Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology (Shandong Academy of Sciences), 3501 Daxue Road, Jinan, 250353, China
- Shandong Engineering Research Center of Big Data Applied Technology, Faculty of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), 3501 Daxue Road, Jinan, 250353, China
- Shandong Provincial Key Laboratory of Industrial Network and Information System Security, Shandong Fundamental Research Center for Computer Science, 3501 Daxue Road, Jinan, 250353, China
| | - Lan Ye
- Cancer Center, The Second Hospital of Shandong University, 247 Beiyuan Street, Jinan, 250033, China.
| |
Collapse
|
3
|
Liang DM, Du PF. scMUG: deep clustering analysis of single-cell RNA-seq data on multiple gene functional modules. Brief Bioinform 2025; 26:bbaf138. [PMID: 40188497 PMCID: PMC11972635 DOI: 10.1093/bib/bbaf138] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2024] [Revised: 02/11/2025] [Accepted: 03/09/2025] [Indexed: 04/08/2025] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) has revolutionized our understanding of cellular heterogeneity by providing gene expression data at the single-cell level. Unlike bulk RNA-seq, scRNA-seq allows identification of different cell types within a given tissue, leading to a more nuanced comprehension of cell functions. However, the analysis of scRNA-seq data presents challenges due to its sparsity and high dimensionality. Since bioinformatics plays an important role in the analysis of big data and its utility for the welfare of living beings, it has been widely applied in analyzing scRNA-seq data. To address these challenges, we introduce the scMUG computational pipeline, which incorporates gene functional module information to enhance scRNA-seq clustering analysis. The pipeline includes data preprocessing, cell representation generation, cell-cell similarity matrix construction, and clustering analysis. The scMUG pipeline also introduces a novel similarity measure that combines local density and global distribution in the latent cell representation space. As far as we can tell, this is the first attempt to integrate gene functional associations into scRNA-seq clustering analysis. We curated nine human scRNA-seq datasets to evaluate our scMUG pipeline. With the help of gene functional information and the novel similarity measure, the clustering results from scMUG pipeline present deep insights into functional relationships between gene expression patterns and cellular heterogeneity. In addition, our scMUG pipeline also presents comparable or better clustering performances than other state-of-the-art methods. All source codes of scMUG have been deposited in a GitHub repository with instructions for reproducing all results (https://github.com/degiminnal/scMUG).
Collapse
Affiliation(s)
- De-Min Liang
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| | - Pu-Feng Du
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| |
Collapse
|
4
|
Cui X, Wu R, Liu Y, Chen P, Chang Q, Liang P, He C. scSMD: a deep learning method for accurate clustering of single cells based on auto-encoder. BMC Bioinformatics 2025; 26:33. [PMID: 39881248 PMCID: PMC11780796 DOI: 10.1186/s12859-025-06047-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2024] [Accepted: 01/13/2025] [Indexed: 01/31/2025] Open
Abstract
BACKGROUND Single-cell RNA sequencing (scRNA-seq) has transformed biological research by offering new insights into cellular heterogeneity, developmental processes, and disease mechanisms. As scRNA-seq technology advances, its role in modern biology has become increasingly vital. This study explores the application of deep learning to single-cell data clustering, with a particular focus on managing sparse, high-dimensional data. RESULTS We propose the SMD deep learning model, which integrates nonlinear dimensionality reduction techniques with a porous dilated attention gate component. Built upon a convolutional autoencoder and informed by the negative binomial distribution, the SMD model efficiently captures essential cell clustering features and dynamically adjusts feature weights. Comprehensive evaluation on both public datasets and proprietary osteosarcoma data highlights the SMD model's efficacy in achieving precise classifications for single-cell data clustering, showcasing its potential for advanced transcriptomic analysis. CONCLUSION This study underscores the potential of deep learning-specifically the SMD model-in advancing single-cell RNA sequencing data analysis. By integrating innovative computational techniques, the SMD model provides a powerful framework for unraveling cellular complexities, enhancing our understanding of biological processes, and elucidating disease mechanisms. The code is available from https://github.com/xiaoxuc/scSMD .
Collapse
Affiliation(s)
- Xiaoxu Cui
- School of Health Science and Engineering, University of Shanghai for Science and Technology, Shanghai, China
- Shanghai University of Medicine & Health Sciences, Shanghai, China
- Department of Surgery, Shanghai Key Laboratory of Gastric Neoplasms, Shanghai Institute of Digestive Surgery, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Renkai Wu
- School of Microelectronics, Shanghai University, Shanghai, China
- Department of Surgery, Shanghai Key Laboratory of Gastric Neoplasms, Shanghai Institute of Digestive Surgery, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Yinghao Liu
- School of Health Science and Engineering, University of Shanghai for Science and Technology, Shanghai, China
- Shanghai University of Medicine & Health Sciences, Shanghai, China
- Department of Surgery, Shanghai Key Laboratory of Gastric Neoplasms, Shanghai Institute of Digestive Surgery, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Peizhan Chen
- Clinical Research Center, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Qing Chang
- Department of Surgery, Shanghai Key Laboratory of Gastric Neoplasms, Shanghai Institute of Digestive Surgery, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Pengchen Liang
- School of Microelectronics, Shanghai University, Shanghai, China.
| | - Changyu He
- Department of Surgery, Shanghai Key Laboratory of Gastric Neoplasms, Shanghai Institute of Digestive Surgery, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China.
| |
Collapse
|
5
|
Wang Y, Li K, Zhang R, Fan Y, Huang L, Zhou F. GraCEImpute: A novel graph clustering autoencoder approach for imputation of single-cell RNA-seq data. Comput Biol Med 2025; 184:109400. [PMID: 39561511 DOI: 10.1016/j.compbiomed.2024.109400] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2024] [Revised: 10/14/2024] [Accepted: 11/07/2024] [Indexed: 11/21/2024]
Abstract
Single-cell RNA sequencing (scRNA-seq) technology establishes a unique view for elucidating cellular heterogeneity in various biological systems. Yet the scRNA-seq data is compromised by a high dropout rate due to the technological limitation, and the substantial data loss poses computational challenges on subsequent analyses. This study introduces a novel graph clustering autoencoder (GCAE)-based imputation approach (GraCEImpute) to address the challenge of missing data in scRNA-seq data. Our comprehensive evaluation demonstrates that the GraCEImpute model outperforms existing approaches in accurately imputing dropout zeros within scRNA-seq data. The proposed GraCEImpute model also demonstrates the significantly enhanced quality of downstream scRNA-seq data analyses, including clustering, differential gene expression (DEG) analysis, and cell trajectory inference. These improvements underscore the GraCEImpute model's potential to facilitate a deeper understanding of cellular processes and heterogeneity through the scRNA-seq data analyses. The source code is released at https://www.healthinformaticslab.org/supp/.
Collapse
Affiliation(s)
- Yueying Wang
- College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China
| | - Kewei Li
- College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China
| | - Ruochi Zhang
- School of Artificial Intelligence, Jilin University, Changchun, 130012, China
| | - Yusi Fan
- College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China.
| | - Lan Huang
- College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China
| | - Fengfeng Zhou
- College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China; School of Biology and Engineering, Guizhou Medical University, Guiyang, 550025, Guizhou, China.
| |
Collapse
|
6
|
Wang J, Qiao TJ, Zheng CH, Liu JX, Shang JL. A New Graph Autoencoder-Based Multi-Level Kernel Subspace Fusion Framework for Single-Cell Type Identification. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:2292-2303. [PMID: 39264790 DOI: 10.1109/tcbb.2024.3459960] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/14/2024]
Abstract
The advent of single-cell RNA sequencing (scRNA-seq) technology offers the opportunity to conduct biological research at the cellular level. Single-cell type identification based on unsupervised clustering is one of the fundamental tasks of scRNA-seq data analysis. Although many single-cell clustering methods have been developed recently, few can fully exploit the deep potential relationships between cells, resulting in suboptimal clustering. In this paper, we propose scGAMF, a graph autoencoder-based multi-level kernel subspace fusion framework for scRNA-seq data analysis. Based on multiple top feature sets, scGAMF unifies deep feature embedding and kernel space analysis into a single framework to learn an accurate clustering affinity matrix. First, we construct multiple top feature sets to avoid the high variability caused by single feature set learning. Second, scGAMF uses a graph autoencoder (GAEs) to extract deep information embedded in the data, and learn embeddings including gene expression patterns and cell-cell relationships. Third, to fully explore the deep potential relationships between cells, we design a multi-level kernel space fusion strategy. This strategy uses a kernel expression model with adaptive similarity preservation to learn a self-expression matrix shared by all embedding spaces of a given feature set, and a consensus affinity matrix across multiple top feature sets. Finally, the consensus affinity matrix is used for spectral clustering, visualization, and identification of gene markers. Extensive validation on real datasets shows that scGAMF achieves higher clustering accuracy than many popular single-cell analysis methods.
Collapse
|
7
|
Zhang Z, Liu Y, Xiao M, Wang K, Huang Y, Bian J, Yang R, Li F. Graph contrastive learning as a versatile foundation for advanced scRNA-seq data analysis. Brief Bioinform 2024; 25:bbae558. [PMID: 39487083 PMCID: PMC11530284 DOI: 10.1093/bib/bbae558] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2024] [Revised: 09/24/2024] [Accepted: 10/16/2024] [Indexed: 11/04/2024] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) offers unprecedented insights into transcriptome-wide gene expression at the single-cell level. Cell clustering has been long established in the analysis of scRNA-seq data to identify the groups of cells with similar expression profiles. However, cell clustering is technically challenging, as raw scRNA-seq data have various analytical issues, including high dimensionality and dropout values. Existing research has developed deep learning models, such as graph machine learning models and contrastive learning-based models, for cell clustering using scRNA-seq data and has summarized the unsupervised learning of cell clustering into a human-interpretable format. While advances in cell clustering have been profound, we are no closer to finding a simple yet effective framework for learning high-quality representations necessary for robust clustering. In this study, we propose scSimGCL, a novel framework based on the graph contrastive learning paradigm for self-supervised pretraining of graph neural networks. This framework facilitates the generation of high-quality representations crucial for cell clustering. Our scSimGCL incorporates cell-cell graph structure and contrastive learning to enhance the performance of cell clustering. Extensive experimental results on simulated and real scRNA-seq datasets suggest the superiority of the proposed scSimGCL. Moreover, clustering assignment analysis confirms the general applicability of scSimGCL, including state-of-the-art clustering algorithms. Further, ablation study and hyperparameter analysis suggest the efficacy of our network architecture with the robustness of decisions in the self-supervised learning setting. The proposed scSimGCL can serve as a robust framework for practitioners developing tools for cell clustering. The source code of scSimGCL is publicly available at https://github.com/zhangzh1328/scSimGCL.
Collapse
Affiliation(s)
- Zhenhao Zhang
- College of Life Sciences, Northwest A&F University, Yangling, 712100 Shaanxi, China
- College of Information Engineering, Northwest A&F University, Yangling, 712100 Shaanxi, China
| | - Yuxi Liu
- College of Medicine, University of Florida, Gainesville, FL 32610, USA
| | - Meichen Xiao
- College of Life Sciences, Northwest A&F University, Yangling, 712100 Shaanxi, China
| | - Kun Wang
- College of Life Sciences, Northwest A&F University, Yangling, 712100 Shaanxi, China
| | - Yu Huang
- College of Medicine, University of Florida, Gainesville, FL 32610, USA
| | - Jiang Bian
- College of Medicine, University of Florida, Gainesville, FL 32610, USA
| | - Ruolin Yang
- College of Life Sciences, Northwest A&F University, Yangling, 712100 Shaanxi, China
| | - Fuyi Li
- College of Information Engineering, Northwest A&F University, Yangling, 712100 Shaanxi, China
| |
Collapse
|
8
|
Liu T, Jia C, Bi Y, Guo X, Zou Q, Li F. scDFN: enhancing single-cell RNA-seq clustering with deep fusion networks. Brief Bioinform 2024; 25:bbae486. [PMID: 39373051 PMCID: PMC11456827 DOI: 10.1093/bib/bbae486] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Revised: 08/07/2024] [Accepted: 09/17/2024] [Indexed: 10/08/2024] Open
Abstract
Single-cell ribonucleic acid sequencing (scRNA-seq) technology can be used to perform high-resolution analysis of the transcriptomes of individual cells. Therefore, its application has gained popularity for accurately analyzing the ever-increasing content of heterogeneous single-cell datasets. Central to interpreting scRNA-seq data is the clustering of cells to decipher transcriptomic diversity and infer cell behavior patterns. However, its complexity necessitates the application of advanced methodologies capable of resolving the inherent heterogeneity and limited gene expression characteristics of single-cell data. Herein, we introduce a novel deep learning-based algorithm for single-cell clustering, designated scDFN, which can significantly enhance the clustering of scRNA-seq data through a fusion network strategy. The scDFN algorithm applies a dual mechanism involving an autoencoder to extract attribute information and an improved graph autoencoder to capture topological nuances, integrated via a cross-network information fusion mechanism complemented by a triple self-supervision strategy. This fusion is optimized through a holistic consideration of four distinct loss functions. A comparative analysis with five leading scRNA-seq clustering methodologies across multiple datasets revealed the superiority of scDFN, as determined by better the Normalized Mutual Information (NMI) and the Adjusted Rand Index (ARI) metrics. Additionally, scDFN demonstrated robust multi-cluster dataset performance and exceptional resilience to batch effects. Ablation studies highlighted the key roles of the autoencoder and the improved graph autoencoder components, along with the critical contribution of the four joint loss functions to the overall efficacy of the algorithm. Through these advancements, scDFN set a new benchmark in single-cell clustering and can be used as an effective tool for the nuanced analysis of single-cell transcriptomics.
Collapse
Affiliation(s)
- Tianxiang Liu
- School of Science, Dalian Maritime University, 1 Linghai Road, Dalian 116026, China
| | - Cangzhi Jia
- School of Science, Dalian Maritime University, 1 Linghai Road, Dalian 116026, China
| | - Yue Bi
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC 3800, Australia
| | - Xudong Guo
- College of Information Engineering, Northwest A&F University, No. 3 Taicheng Road, Yangling, Shaanxi,China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, No. 2006, Xiyuan Ave, West Hi-Tech Zone, 611731, Chengdu, Sichuan, China
| | - Fuyi Li
- College of Information Engineering, Northwest A&F University, No. 3 Taicheng Road, Yangling, Shaanxi,China
- South Australian Immunogenomics Cancer Institute, The University of Adelaide, 4 North Terrace, SA 5000, Australia
| |
Collapse
|
9
|
Wang L, Li W, Zhou F, Yu K, Feng C, Zhao D. nsDCC: dual-level contrastive clustering with nonuniform sampling for scRNA-seq data analysis. Brief Bioinform 2024; 25:bbae477. [PMID: 39327063 PMCID: PMC11427072 DOI: 10.1093/bib/bbae477] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2024] [Revised: 07/16/2024] [Accepted: 09/18/2024] [Indexed: 09/28/2024] Open
Abstract
Dimensionality reduction and clustering are crucial tasks in single-cell RNA sequencing (scRNA-seq) data analysis, treated independently in the current process, hindering their mutual benefits. The latest methods jointly optimize these tasks through deep clustering. However, contrastive learning, with powerful representation capability, can bridge the gap that common deep clustering methods face, which requires pre-defined cluster centers. Therefore, a dual-level contrastive clustering method with nonuniform sampling (nsDCC) is proposed for scRNA-seq data analysis. Dual-level contrastive clustering, which combines instance-level contrast and cluster-level contrast, jointly optimizes dimensionality reduction and clustering. Multi-positive contrastive learning and unit matrix constraint are introduced in instance- and cluster-level contrast, respectively. Furthermore, the attention mechanism is introduced to capture inter-cellular information, which is beneficial for clustering. The nsDCC focuses on important samples at category boundaries and in minority categories by the proposed nearest boundary sparsest density weight assignment algorithm, making it capable of capturing comprehensive characteristics against imbalanced datasets. Experimental results show that nsDCC outperforms the six other state-of-the-art methods on both real and simulated scRNA-seq data, validating its performance on dimensionality reduction and clustering of scRNA-seq data, especially for imbalanced data. Simulation experiments demonstrate that nsDCC is insensitive to "dropout events" in scRNA-seq. Finally, cluster differential expressed gene analysis confirms the meaningfulness of results from nsDCC. In summary, nsDCC is a new way of analyzing and understanding scRNA-seq data.
Collapse
Affiliation(s)
- Linjie Wang
- School of Computer Science and Engineering, No. 195 Chuangxin Road, Hunnan District, Northeastern University, Shenyang 110819, China
| | - Wei Li
- Key Laboratory of Intelligent Computing in Medical Image (MIIC), Northeastern University, No. 195 Chuangxin Road, Hunnan District, Shenyang 110000, China
- National Frontiers Science Center for Industrial Intelligence and Systems Optimization, No. 3-11 Wenhua Road, Heping District, Northeastern University, Shenyang 110819, China
| | - Fanghui Zhou
- School of Computer Science and Engineering, No. 195 Chuangxin Road, Hunnan District, Northeastern University, Shenyang 110819, China
| | - Kun Yu
- College of Medicine and Bioinformation Engineering, Northeastern University, No. 195 Chuangxin Road, Hunnan District, Shenyang 110819, China
| | - Chaolu Feng
- Key Laboratory of Intelligent Computing in Medical Image (MIIC), Northeastern University, No. 195 Chuangxin Road, Hunnan District, Shenyang 110000, China
| | - Dazhe Zhao
- Key Laboratory of Intelligent Computing in Medical Image (MIIC), Northeastern University, No. 195 Chuangxin Road, Hunnan District, Shenyang 110000, China
| |
Collapse
|
10
|
Liu L, Wu X, Yu J, Zhang Y, Niu K, Yu A. scVGATAE: A Variational Graph Attentional Autoencoder Model for Clustering Single-Cell RNA-seq Data. BIOLOGY 2024; 13:713. [PMID: 39336140 PMCID: PMC11428844 DOI: 10.3390/biology13090713] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/02/2024] [Revised: 09/06/2024] [Accepted: 09/07/2024] [Indexed: 09/30/2024]
Abstract
Single-cell RNA sequencing (scRNA-seq) is now a successful technology for identifying cell heterogeneity, revealing new cell subpopulations, and predicting developmental trajectories. A crucial component in scRNA-seq is the precise identification of cell subsets. Although many unsupervised clustering methods have been developed for clustering cell subpopulations, the performance of these methods is prone to be affected by dropout, high dimensionality, and technical noise. Additionally, most existing methods are time-consuming and fail to fully consider the potential correlations between cells. In this paper, we propose a novel unsupervised clustering method called scVGATAE (Single-cell Variational Graph Attention Autoencoder) for scRNA-seq data. This method constructs a reliable cell graph through network denoising, utilizes a novel variational graph autoencoder model integrated with graph attention networks to aggregate neighbor information and learn the distribution of the low-dimensional representations of cells, and adaptively determines the model training iterations for various datasets. Finally, the obtained low-dimensional representations of cells are clustered using kmeans. Experiments on nine public datasets show that scVGATAE outperforms classical and state-of-the-art clustering methods.
Collapse
Affiliation(s)
- Lijun Liu
- School of Science, Dalian Minzu University, Dalian 116600, China
| | - Xiaoyang Wu
- School of Science, Dalian Minzu University, Dalian 116600, China
| | - Jun Yu
- School of Science, Dalian Minzu University, Dalian 116600, China
| | - Yuduo Zhang
- School of Science, Dalian Minzu University, Dalian 116600, China
| | - Kaixing Niu
- School of Science, Dalian Minzu University, Dalian 116600, China
| | - Anli Yu
- School of Science, Dalian Minzu University, Dalian 116600, China
| |
Collapse
|
11
|
Mittal S, Jena MK, Pathak B. Machine learning empowered next generation DNA sequencing: perspective and prospectus. Chem Sci 2024; 15:12169-12188. [PMID: 39118630 PMCID: PMC11304540 DOI: 10.1039/d4sc01714e] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Accepted: 07/07/2024] [Indexed: 08/10/2024] Open
Abstract
The pursuit of ultra-rapid, cost-effective, and accurate DNA sequencing is a highly sought after aspect of personalized medicine development. With recent advancements, mainstream machine learning (ML) algorithms hold immense promise for high throughput DNA sequencing at the single nucleotide level. While ML has revolutionized multiple domains of nanoscience and nanotechnology, its implementation in DNA sequencing is still in its preliminary stages. ML-aided DNA sequencing is especially appealing, as ML has the potential to decipher complex patterns and extract knowledge from complex datasets. Herein, we present a holistic framework of ML-aided next-generation DNA sequencing with domain knowledge to set directions toward the development of artificially intelligent DNA sequencers. This perspective focuses on the current state-of-the-art ML-aided DNA sequencing, exploring the opportunities as well as the future challenges in this field. In addition, we provide our personal viewpoints on the critical issues that require attention in the context of ML-aided DNA sequencing.
Collapse
Affiliation(s)
- Sneha Mittal
- Department of Chemistry, Indian Institute of Technology (IIT) Indore Indore Madhya Pradesh 453552 India
| | - Milan Kumar Jena
- Department of Chemistry, Indian Institute of Technology (IIT) Indore Indore Madhya Pradesh 453552 India
| | - Biswarup Pathak
- Department of Chemistry, Indian Institute of Technology (IIT) Indore Indore Madhya Pradesh 453552 India
| |
Collapse
|
12
|
Xiong G, LeRoy NJ, Bekiranov S, Sheffield NC, Zhang A. DeepGSEA: explainable deep gene set enrichment analysis for single-cell transcriptomic data. Bioinformatics 2024; 40:btae434. [PMID: 38950178 PMCID: PMC11236288 DOI: 10.1093/bioinformatics/btae434] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2024] [Revised: 05/28/2024] [Accepted: 06/28/2024] [Indexed: 07/03/2024] Open
Abstract
MOTIVATION Gene set enrichment (GSE) analysis allows for an interpretation of gene expression through pre-defined gene set databases and is a critical step in understanding different phenotypes. With the rapid development of single-cell RNA sequencing (scRNA-seq) technology, GSE analysis can be performed on fine-grained gene expression data to gain a nuanced understanding of phenotypes of interest. However, with the cellular heterogeneity in single-cell gene profiles, current statistical GSE analysis methods sometimes fail to identify enriched gene sets. Meanwhile, deep learning has gained traction in applications like clustering and trajectory inference in single-cell studies due to its prowess in capturing complex data patterns. However, its use in GSE analysis remains limited, due to interpretability challenges. RESULTS In this paper, we present DeepGSEA, an explainable deep gene set enrichment analysis approach which leverages the expressiveness of interpretable, prototype-based neural networks to provide an in-depth analysis of GSE. DeepGSEA learns the ability to capture GSE information through our designed classification tasks, and significance tests can be performed on each gene set, enabling the identification of enriched sets. The underlying distribution of a gene set learned by DeepGSEA can be explicitly visualized using the encoded cell and cellular prototype embeddings. We demonstrate the performance of DeepGSEA over commonly used GSE analysis methods by examining their sensitivity and specificity with four simulation studies. In addition, we test our model on three real scRNA-seq datasets and illustrate the interpretability of DeepGSEA by showing how its results can be explained. AVAILABILITY AND IMPLEMENTATION https://github.com/Teddy-XiongGZ/DeepGSEA.
Collapse
Affiliation(s)
- Guangzhi Xiong
- Department of Computer Science, University of Virginia, Charlottesville, VA, 22904, United States
| | - Nathan J LeRoy
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, 22904, United States
| | - Stefan Bekiranov
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA, 22908, United States
| | - Nathan C Sheffield
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, 22904, United States
| | - Aidong Zhang
- Department of Computer Science, University of Virginia, Charlottesville, VA, 22904, United States
| |
Collapse
|
13
|
Xie J, Ruan S, Tu M, Yuan Z, Hu J, Li H, Li S. Clustering single-cell RNA sequencing data via iterative smoothing and self-supervised discriminative embedding. Oncogene 2024; 43:2279-2292. [PMID: 38834657 DOI: 10.1038/s41388-024-03074-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2024] [Revised: 05/22/2024] [Accepted: 05/28/2024] [Indexed: 06/06/2024]
Abstract
Single-cell transcriptome sequencing (scRNA-seq) is a high-throughput technique used to study gene expression at the single-cell level. Clustering analysis is a commonly used method in scRNA-seq data analysis, helping researchers identify cell types and uncover interactions between cells. However, the choice of a robust similarity metric in the clustering procedure is still an open challenge due to the complex underlying structures of the data and the inherent noise in data acquisition. Here, we propose a deep clustering method for scRNA-seq data called scRISE (scRNA-seq Iterative Smoothing and self-supervised discriminative Embedding model) to resolve this challenge. The model consists of two main modules: an iterative smoothing module based on graph autoencoders designed to denoise the data and refine the pairwise similarity in turn to gradually incorporate cell structural features and enrich the data information; and a self-supervised discriminative embedding module with adaptive similarity threshold for partitioning samples into correct clusters. Our approach has shown improved quality of data representation and clustering on seventeen scRNA-seq datasets against a number of state-of-the-art deep learning clustering methods. Furthermore, utilizing the scRISE method in biological analysis against the HNSCC dataset has unveiled 62 informative genes, highlighting their potential roles as therapeutic targets and biomarkers.
Collapse
Affiliation(s)
- Jinxin Xie
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China
| | - Shanshan Ruan
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China
| | - Mingyan Tu
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China
| | - Zhen Yuan
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China
| | - Jianguo Hu
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China
| | - Honglin Li
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China.
- Innovation Center for AI and Drug Discovery, School of Pharmacy, East China Normal University, Shanghai, 200062, China.
- Lingang Laboratory, Shanghai, 200031, China.
| | - Shiliang Li
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China.
- Innovation Center for AI and Drug Discovery, School of Pharmacy, East China Normal University, Shanghai, 200062, China.
| |
Collapse
|
14
|
Gao Q, Ai Q. DCRELM: dual correlation reduction network-based extreme learning machine for single-cell RNA-seq data clustering. Sci Rep 2024; 14:13541. [PMID: 38866896 PMCID: PMC11169517 DOI: 10.1038/s41598-024-64217-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2024] [Accepted: 06/06/2024] [Indexed: 06/14/2024] Open
Abstract
Single-cell ribonucleic acid sequencing (scRNA-seq) is a high-throughput genomic technique that is utilized to investigate single-cell transcriptomes. Cluster analysis can effectively reveal the heterogeneity and diversity of cells in scRNA-seq data, but existing clustering algorithms struggle with the inherent high dimensionality, noise, and sparsity of scRNA-seq data. To overcome these limitations, we propose a clustering algorithm: the Dual Correlation Reduction network-based Extreme Learning Machine (DCRELM). First, DCRELM obtains the low-dimensional and dense result features of scRNA-seq data in an extreme learning machine (ELM) random mapping space. Second, the ELM graph distortion module is employed to obtain a dual view of the resulting features, effectively enhancing their robustness. Third, the autoencoder fusion module is employed to learn the attributes and structural information of the resulting features, and merge these two types of information to generate consistent latent representations of these features. Fourth, the dual information reduction network is used to filter the redundant information and noise in the dual consistent latent representations. Last, a triplet self-supervised learning mechanism is utilized to further improve the clustering performance. Extensive experiments show that the DCRELM performs well in terms of clustering performance and robustness. The code is available at https://github.com/gaoqingyun-lucky/awesome-DCRELM .
Collapse
Affiliation(s)
- Qingyun Gao
- School of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, 114051, China
| | - Qing Ai
- School of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, 114051, China.
| |
Collapse
|
15
|
Qiao TJ, Li F, Yuan SS, Dai LY, Wang J. A Fusion Learning Model Based on Deep Learning for Single-Cell RNA Sequencing Data Clustering. J Comput Biol 2024; 31:576-588. [PMID: 38758925 DOI: 10.1089/cmb.2024.0512] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/19/2024] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) technology provides a means for studying biology from a cellular perspective. The fundamental goal of scRNA-seq data analysis is to discriminate single-cell types using unsupervised clustering. Few single-cell clustering algorithms have taken into account both deep and surface information, despite the recent slew of suggestions. Consequently, this article constructs a fusion learning framework based on deep learning, namely scGASI. For learning a clustering similarity matrix, scGASI integrates data affinity recovery and deep feature embedding in a unified scheme based on various top feature sets. Next, scGASI learns the low-dimensional latent representation underlying the data using a graph autoencoder to mine the hidden information residing in the data. To efficiently merge the surface information from raw area and the deeper potential information from underlying area, we then construct a fusion learning model based on self-expression. scGASI uses this fusion learning model to learn the similarity matrix of an individual feature set as well as the clustering similarity matrix of all feature sets. Lastly, gene marker identification, visualization, and clustering are accomplished using the clustering similarity matrix. Extensive verification on actual data sets demonstrates that scGASI outperforms many widely used clustering techniques in terms of clustering accuracy.
Collapse
Affiliation(s)
- Tian-Jing Qiao
- School of Computer Science, Qufu Normal University, Rizhao, China
| | - Feng Li
- School of Computer Science, Qufu Normal University, Rizhao, China
| | - Sha-Sha Yuan
- School of Computer Science, Qufu Normal University, Rizhao, China
| | - Ling-Yun Dai
- School of Computer Science, Qufu Normal University, Rizhao, China
| | - Juan Wang
- School of Computer Science, Qufu Normal University, Rizhao, China
| |
Collapse
|
16
|
Cai X, Zhang W, Zheng X, Xu Y, Li Y. scEM: A New Ensemble Framework for Predicting Cell Type Composition Based on scRNA-Seq Data. Interdiscip Sci 2024; 16:304-317. [PMID: 38368575 DOI: 10.1007/s12539-023-00601-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2023] [Revised: 12/22/2023] [Accepted: 12/24/2023] [Indexed: 02/19/2024]
Abstract
With the advent of single-cell RNA sequencing (scRNA-seq) technology, many scRNA-seq data have become available, providing an unprecedented opportunity to explore cellular composition and heterogeneity. Recently, many computational algorithms for predicting cell type composition have been developed, and these methods are typically evaluated on different datasets and performance metrics using diverse techniques. Consequently, the lack of comprehensive and standardized comparative analysis makes it difficult to gain a clear understanding of the strengths and weaknesses of these methods. To address this gap, we reviewed 20 cutting-edge unsupervised cell type identification methods and evaluated these methods comprehensively using 24 real scRNA-seq datasets of varying scales. In addition, we proposed a new ensemble cell-type identification method, named scEM, which learns the consensus similarity matrix by applying the entropy weight method to the four representative methods are selected. The Louvain algorithm is adopted to obtain the final classification of individual cells based on the consensus matrix. Extensive evaluation and comparison with 11 other similarity-based methods under real scRNA-seq datasets demonstrate that the newly developed ensemble algorithm scEM is effective in predicting cellular type composition.
Collapse
Affiliation(s)
- Xianxian Cai
- School of Sciences, East China Jiaotong University, Nanchang, 330013, China
| | - Wei Zhang
- School of Sciences, East China Jiaotong University, Nanchang, 330013, China.
| | - Xiaoying Zheng
- Operations research and planning department, Naval University of Engineering, Wuhan, 430033, China
| | - Yaxin Xu
- School of Sciences, East China Jiaotong University, Nanchang, 330013, China
| | - Yuanyuan Li
- School of Mathematics and Physics, Wuhan Institute of Technology, Wuhan, China
| |
Collapse
|
17
|
An S, Shi J, Liu R, Chen Y, Wang J, Hu S, Xia X, Dong G, Bo X, He Z, Ying X. scDAC: deep adaptive clustering of single-cell transcriptomic data with coupled autoencoder and Dirichlet process mixture model. Bioinformatics 2024; 40:btae198. [PMID: 38603616 PMCID: PMC11256937 DOI: 10.1093/bioinformatics/btae198] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Revised: 03/20/2024] [Accepted: 04/10/2024] [Indexed: 04/13/2024] Open
Abstract
MOTIVATION Clustering analysis for single-cell RNA sequencing (scRNA-seq) data is an important step in revealing cellular heterogeneity. Many clustering methods have been proposed to discover heterogenous cell types from scRNA-seq data. However, adaptive clustering with accurate cluster number reflecting intrinsic biology nature from large-scale scRNA-seq data remains quite challenging. RESULTS Here, we propose a single-cell Deep Adaptive Clustering (scDAC) model by coupling the Autoencoder (AE) and the Dirichlet Process Mixture Model (DPMM). By jointly optimizing the model parameters of AE and DPMM, scDAC achieves adaptive clustering with accurate cluster numbers on scRNA-seq data. We verify the performance of scDAC on five subsampled datasets with different numbers of cell types and compare it with 15 widely used clustering methods across nine scRNA-seq datasets. Our results demonstrate that scDAC can adaptively find accurate numbers of cell types or subtypes and outperforms other methods. Moreover, the performance of scDAC is robust to hyperparameter changes. AVAILABILITY AND IMPLEMENTATION The scDAC is implemented in Python. The source code is available at https://github.com/labomics/scDAC.
Collapse
Affiliation(s)
- Sijing An
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing 100850, China
| | - Jinhui Shi
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing 100850, China
| | - Runyan Liu
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing 100850, China
| | - Yaowen Chen
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing 100850, China
| | - Jing Wang
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing 100850, China
| | - Shuofeng Hu
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing 100850, China
| | - Xinyu Xia
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing 100850, China
| | - Guohua Dong
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing 100850, China
| | - Xiaochen Bo
- Department of Bioinformatics, Institute of Health Service and Transfusion Medicine, Beijing 100850, China
| | - Zhen He
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing 100850, China
| | - Xiaomin Ying
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing 100850, China
| |
Collapse
|
18
|
Feng X, Xiu YH, Long HX, Wang ZT, Bilal A, Yang LM. Advancing single-cell RNA-seq data analysis through the fusion of multi-layer perceptron and graph neural network. Brief Bioinform 2023; 25:bbad481. [PMID: 38171931 PMCID: PMC10764207 DOI: 10.1093/bib/bbad481] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2023] [Revised: 11/18/2023] [Accepted: 12/03/2023] [Indexed: 01/05/2024] Open
Abstract
The advancement of single-cell sequencing technology has smoothed the ability to do biological studies at the cellular level. Nevertheless, single-cell RNA sequencing (scRNA-seq) data presents several obstacles due to the considerable heterogeneity, sparsity and complexity. Although many machine-learning models have been devised to tackle these difficulties, there is still a need to enhance their efficiency and accuracy. Current deep learning methods often fail to fully exploit the intrinsic interconnections within cells, resulting in unsatisfactory results. Given these obstacles, we propose a unique approach for analyzing scRNA-seq data called scMPN. This methodology integrates multi-layer perceptron and graph neural network, including attention network, to execute gene imputation and cell clustering tasks. In order to evaluate the gene imputation performance of scMPN, several metrics like cosine similarity, median L1 distance and root mean square error are used. These metrics are utilized to compare the efficacy of scMPN with other existing approaches. This research utilizes criteria such as adjusted mutual information, normalized mutual information and integrity score to assess the efficacy of cell clustering across different approaches. The superiority of scMPN over current single-cell data processing techniques in cell clustering and gene imputation investigations is shown by the experimental findings obtained from four datasets with gold-standard cell labels. This observation demonstrates the efficacy of our suggested methodology in using deep learning methodologies to enhance the interpretation of scRNA-seq data.
Collapse
Affiliation(s)
- Xiang Feng
- Department of Information Science Technology, Hainan Normal University, 99 Longkun Road, Haikou, Hainan 571158, China
| | - Yu-Han Xiu
- Department of Information Science Technology, Hainan Normal University, 99 Longkun Road, Haikou, Hainan 571158, China
| | - Hai-Xia Long
- Department of Information Science Technology, Hainan Normal University, 99 Longkun Road, Haikou, Hainan 571158, China
| | - Zi-Tong Wang
- Department of Pathophysiology, School of Basic Medical Sciences, Harbin Medical University, Harbin 150081, China
| | - Anas Bilal
- Department of Information Science Technology, Hainan Normal University, 99 Longkun Road, Haikou, Hainan 571158, China
| | - Li-Ming Yang
- Department of Pathophysiology, School of Basic Medical Sciences, Harbin Medical University, Harbin 150081, China
| |
Collapse
|
19
|
Lei T, Chen R, Zhang S, Chen Y. Self-supervised deep clustering of single-cell RNA-seq data to hierarchically detect rare cell populations. Brief Bioinform 2023; 24:bbad335. [PMID: 37769630 PMCID: PMC10539043 DOI: 10.1093/bib/bbad335] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Revised: 09/05/2023] [Accepted: 09/06/2023] [Indexed: 10/02/2023] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) is a widely used technique for characterizing individual cells and studying gene expression at the single-cell level. Clustering plays a vital role in grouping similar cells together for various downstream analyses. However, the high sparsity and dimensionality of large scRNA-seq data pose challenges to clustering performance. Although several deep learning-based clustering algorithms have been proposed, most existing clustering methods have limitations in capturing the precise distribution types of the data or fully utilizing the relationships between cells, leaving a considerable scope for improving the clustering performance, particularly in detecting rare cell populations from large scRNA-seq data. We introduce DeepScena, a novel single-cell hierarchical clustering tool that fully incorporates nonlinear dimension reduction, negative binomial-based convolutional autoencoder for data fitting, and a self-supervision model for cell similarity enhancement. In comprehensive evaluation using multiple large-scale scRNA-seq datasets, DeepScena consistently outperformed seven popular clustering tools in terms of accuracy. Notably, DeepScena exhibits high proficiency in identifying rare cell populations within large datasets that contain large numbers of clusters. When applied to scRNA-seq data of multiple myeloma cells, DeepScena successfully identified not only previously labeled large cell types but also subpopulations in CD14 monocytes, T cells and natural killer cells, respectively.
Collapse
Affiliation(s)
- Tianyuan Lei
- College of Computer and Information Engineering, Tianjin Normal University, Tianjin 300387, China
| | - Ruoyu Chen
- Moorestown High School, Moorestown, NJ 08057, USA
| | - Shaoqiang Zhang
- College of Computer and Information Engineering, Tianjin Normal University, Tianjin 300387, China
| | - Yong Chen
- Department of Biological and Biomedical Sciences, Rowan University, NJ 08028, USA
| |
Collapse
|
20
|
Wang S, Zhang Y, Zhang Y, Wu W, Ye L, Li Y, Su J, Pang S. scASGC: An adaptive simplified graph convolution model for clustering single-cell RNA-seq data. Comput Biol Med 2023; 163:107152. [PMID: 37364529 DOI: 10.1016/j.compbiomed.2023.107152] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Revised: 05/24/2023] [Accepted: 06/07/2023] [Indexed: 06/28/2023]
Abstract
Single-cell RNA sequencing (scRNA-seq) is now a successful technique for identifying cellular heterogeneity, revealing novel cell subpopulations, and forecasting developmental trajectories. A crucial component of the processing of scRNA-seq data is the precise identification of cell subpopulations. Although many unsupervised clustering methods have been developed to cluster cell subpopulations, the performance of these methods is vulnerable to dropouts and high dimensionality. In addition, most existing methods are time-consuming and fail to adequately account for potential associations between cells. In the manuscript, we present an unsupervised clustering method based on an adaptive simplified graph convolution model called scASGC. The proposed method builds plausible cell graphs, aggregates neighbor information using a simplified graph convolution model, and adaptively determines the most optimal number of convolution layers for various graphs. Experiments on 12 public datasets show that scASGC outperforms both classical and state-of-the-art clustering methods. In addition, in a study of mouse intestinal muscle containing 15,983 cells, we identified distinct marker genes based on the clustering results of scASGC. The source code of scASGC is available at https://github.com/ZzzOctopus/scASGC.
Collapse
Affiliation(s)
- Shudong Wang
- College of Computer Science and Technology, Qingdao Institute of Software, China University of Petroleum, Qingdao, 266580, China.
| | - Yu Zhang
- College of Computer Science and Technology, Qingdao Institute of Software, China University of Petroleum, Qingdao, 266580, China.
| | - Yulin Zhang
- College of Mathematics and Systems Science, Shandong University of Science and Technology, Qingdao, 266590, China.
| | - Wenhao Wu
- College of Computer Science and Technology, Qingdao Institute of Software, China University of Petroleum, Qingdao, 266580, China.
| | - Lan Ye
- Cancer Center, the Second Hospital of Shandong University, Jinan, 250033, China.
| | - YunYin Li
- College of Computer Science and Technology, Qingdao Institute of Software, China University of Petroleum, Qingdao, 266580, China.
| | - Jionglong Su
- School of AI and Advanced Computing, XJTLU Entrepreneur College (Taicang), Xi'an Jiaotong-Liverpool University, Suzhou, 215123, China.
| | - Shanchen Pang
- College of Computer Science and Technology, Qingdao Institute of Software, China University of Petroleum, Qingdao, 266580, China.
| |
Collapse
|
21
|
Sun Y, Shim WJ, Shen S, Sinniah E, Pham D, Su Z, Mizikovsky D, White MD, Ho JK, Nguyen Q, Bodén M, Palpant N. Inferring cell diversity in single cell data using consortium-scale epigenetic data as a biological anchor for cell identity. Nucleic Acids Res 2023; 51:e62. [PMID: 37125641 PMCID: PMC10287941 DOI: 10.1093/nar/gkad307] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2022] [Accepted: 04/28/2023] [Indexed: 05/02/2023] Open
Abstract
Methods for cell clustering and gene expression from single-cell RNA sequencing (scRNA-seq) data are essential for biological interpretation of cell processes. Here, we present TRIAGE-Cluster which uses genome-wide epigenetic data from diverse bio-samples to identify genes demarcating cell diversity in scRNA-seq data. By integrating patterns of repressive chromatin deposited across diverse cell types with weighted density estimation, TRIAGE-Cluster determines cell type clusters in a 2D UMAP space. We then present TRIAGE-ParseR, a machine learning method which evaluates gene expression rank lists to define gene groups governing the identity and function of cell types. We demonstrate the utility of this two-step approach using atlases of in vivo and in vitro cell diversification and organogenesis. We also provide a web accessible dashboard for analysis and download of data and software. Collectively, genome-wide epigenetic repression provides a versatile strategy to define cell diversity and study gene regulation of scRNA-seq data.
Collapse
Affiliation(s)
- Yuliangzi Sun
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
| | - Woo Jun Shim
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
| | - Sophie Shen
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
| | - Enakshi Sinniah
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
| | - Duy Pham
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
| | - Zezhuo Su
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong SAR, China
- Laboratory of Data Discovery for Health Limited (D24H), Hong Kong Science Park, Hong Kong SAR, China
| | - Dalia Mizikovsky
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
| | - Melanie D White
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
| | - Joshua W K Ho
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong SAR, China
- Laboratory of Data Discovery for Health Limited (D24H), Hong Kong Science Park, Hong Kong SAR, China
| | - Quan Nguyen
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
| | - Mikael Bodén
- School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, Australia
| | - Nathan J Palpant
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
| |
Collapse
|
22
|
Su Y, Lin R, Wang J, Tan D, Zheng C. Denoising adaptive deep clustering with self-attention mechanism on single-cell sequencing data. Brief Bioinform 2023; 24:7008799. [PMID: 36715275 DOI: 10.1093/bib/bbad021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2022] [Revised: 12/20/2022] [Accepted: 01/05/2023] [Indexed: 01/31/2023] Open
Abstract
A large number of works have presented the single-cell RNA sequencing (scRNA-seq) to study the diversity and biological functions of cells at the single-cell level. Clustering identifies unknown cell types, which is essential for downstream analysis of scRNA-seq samples. However, the high dimensionality, high noise and pervasive dropout rate of scRNA-seq samples have a significant challenge to the cluster analysis of scRNA-seq samples. Herein, we propose a new adaptive fuzzy clustering model based on the denoising autoencoder and self-attention mechanism called the scDASFK. It implements the comparative learning to integrate cell similar information into the clustering method and uses a deep denoising network module to denoise the data. scDASFK consists of a self-attention mechanism for further denoising where an adaptive clustering optimization function for iterative clustering is implemented. In order to make the denoised latent features better reflect the cell structure, we introduce a new adaptive feedback mechanism to supervise the denoising process through the clustering results. Experiments on 16 real scRNA-seq datasets show that scDASFK performs well in terms of clustering accuracy, scalability and stability. Overall, scDASFK is an effective clustering model with great potential for scRNA-seq samples analysis. Our scDASFK model codes are freely available at https://github.com/LRX2022/scDASFK.
Collapse
Affiliation(s)
- Yansen Su
- Key Lab of Intelligent Computing and Signal Processing of Ministry of Education, School of Artificial Intelligence, Anhui University, Hefei, 230601, China
| | - Rongxin Lin
- School of Computer Science and Technology, Anhui University, Hefei, 230601, China
| | - Jing Wang
- School of Computer Science and Technology, Anhui University, Hefei, 230601, China
| | - Dayu Tan
- Institutes of Physical Science and Information Technology, Anhui University, Hefei, 230601, China
| | - Chunhou Zheng
- Key Lab of Intelligent Computing and Signal Processing of Ministry of Education, School of Artificial Intelligence, Anhui University, Hefei, 230601, China
| |
Collapse
|
23
|
Wang J, Xia J, Wang H, Su Y, Zheng CH. scDCCA: deep contrastive clustering for single-cell RNA-seq data based on auto-encoder network. Brief Bioinform 2023; 24:6984787. [PMID: 36631401 DOI: 10.1093/bib/bbac625] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2022] [Revised: 12/12/2022] [Accepted: 12/19/2022] [Indexed: 01/13/2023] Open
Abstract
The advances in single-cell ribonucleic acid sequencing (scRNA-seq) allow researchers to explore cellular heterogeneity and human diseases at cell resolution. Cell clustering is a prerequisite in scRNA-seq analysis since it can recognize cell identities. However, the high dimensionality, noises and significant sparsity of scRNA-seq data have made it a big challenge. Although many methods have emerged, they still fail to fully explore the intrinsic properties of cells and the relationship among cells, which seriously affects the downstream clustering performance. Here, we propose a new deep contrastive clustering algorithm called scDCCA. It integrates a denoising auto-encoder and a dual contrastive learning module into a deep clustering framework to extract valuable features and realize cell clustering. Specifically, to better characterize and learn data representations robustly, scDCCA utilizes a denoising Zero-Inflated Negative Binomial model-based auto-encoder to extract low-dimensional features. Meanwhile, scDCCA incorporates a dual contrastive learning module to capture the pairwise proximity of cells. By increasing the similarities between positive pairs and the differences between negative ones, the contrasts at both the instance and the cluster level help the model learn more discriminative features and achieve better cell segregation. Furthermore, scDCCA joins feature learning with clustering, which realizes representation learning and cell clustering in an end-to-end manner. Experimental results of 14 real datasets validate that scDCCA outperforms eight state-of-the-art methods in terms of accuracy, generalizability, scalability and efficiency. Cell visualization and biological analysis demonstrate that scDCCA significantly improves clustering and facilitates downstream analysis for scRNA-seq data. The code is available at https://github.com/WJ319/scDCCA.
Collapse
Affiliation(s)
- Jing Wang
- Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei, China
| | - Junfeng Xia
- Institutes of Physical Science and Information Technology, Anhui University, Hefei, China
| | - Haiyun Wang
- School of Mathematics and Systems Science, Xinjiang University, Urumqi, China
| | - Yansen Su
- School of Artificial Intelligence, Anhui University, Hefei, China
| | - Chun-Hou Zheng
- School of Artificial Intelligence, Anhui University, Hefei, China
| |
Collapse
|
24
|
Duan H, Li F, Shang J, Liu J, Li Y, Liu X. scVAEBGM: Clustering Analysis of Single-Cell ATAC-seq Data Using a Deep Generative Model. Interdiscip Sci 2022; 14:917-928. [PMID: 35939233 DOI: 10.1007/s12539-022-00536-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2022] [Revised: 07/15/2022] [Accepted: 07/20/2022] [Indexed: 06/15/2023]
Abstract
A surge in research has occurred because of current developments in single-cell technologies. Above all, single-cell Assay for Transposase-Accessible Chromatin with high throughput sequencing (scATAC-seq) is a popular approach of analyzing chromatin accessibility differences at the level of single cell, either within or between groups. As a result, it is critical to examine cell heterogeneity at a previously unseen level and to identify both recognized and unknown cell types. However, with the ever-increasing number of cells engendered by technological development and the characteristics of the data, such as high noise, sparsity and dimension, challenges in distinguishing cell types have emerged. We propose scVAEBGM, which integrates a Variational Autoencoder (VAE) with a Bayesian Gaussian-mixture model (BGM) to process and analyze scATAC-seq data. This method combines and takes benefits of a Bayesian Gaussian mixture model to estimate the number of cell types without determining the cluster number in a beforehand. In other words, the size of the clusters is inferred from the data, thus avoiding biases introduced by subjective assessments when manually determining the size of the clusters. Additionally, the method is more robust to noise and can better represent single-cell data in lower dimensions. We also create a further clustering strategy. It is indicated by experiments that further clustering based on the already completed clustering can improve the clustering accuracy again. We test on six public datasets, and scVAEBGM outperforms various dimension reduction baselines. In downstream applications, scVAEBGM can reveal biological cell types.
Collapse
Affiliation(s)
- Hongyu Duan
- School of Computer Science, Qufu Normal University, Rizhao, 276826, China
| | - Feng Li
- School of Computer Science, Qufu Normal University, Rizhao, 276826, China.
| | - Junliang Shang
- School of Computer Science, Qufu Normal University, Rizhao, 276826, China
| | - Jinxing Liu
- School of Computer Science, Qufu Normal University, Rizhao, 276826, China
| | - Yan Li
- Department of Electrical Engineering and Information Technology, Shandong University of Science and Technology, Jinan, 250031, Shandong, China
| | - Xikui Liu
- Department of Electrical Engineering and Information Technology, Shandong University of Science and Technology, Jinan, 250031, Shandong, China
| |
Collapse
|
25
|
Brendel M, Su C, Bai Z, Zhang H, Elemento O, Wang F. Application of Deep Learning on Single-cell RNA Sequencing Data Analysis: A Review. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:814-835. [PMID: 36528240 PMCID: PMC10025684 DOI: 10.1016/j.gpb.2022.11.011] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/23/2022] [Revised: 08/17/2022] [Accepted: 11/24/2022] [Indexed: 12/23/2022]
Abstract
Single-cell RNA sequencing (scRNA-seq) has become a routinely used technique to quantify the gene expression profile of thousands of single cells simultaneously. Analysis of scRNA-seq data plays an important role in the study of cell states and phenotypes, and has helped elucidate biological processes, such as those occurring during the development of complex organisms, and improved our understanding of disease states, such as cancer, diabetes, and coronavirus disease 2019 (COVID-19). Deep learning, a recent advance of artificial intelligence that has been used to address many problems involving large datasets, has also emerged as a promising tool for scRNA-seq data analysis, as it has a capacity to extract informative and compact features from noisy, heterogeneous, and high-dimensional scRNA-seq data to improve downstream analysis. The present review aims at surveying recently developed deep learning techniques in scRNA-seq data analysis, identifying key steps within the scRNA-seq data analysis pipeline that have been advanced by deep learning, and explaining the benefits of deep learning over more conventional analytic tools. Finally, we summarize the challenges in current deep learning approaches faced within scRNA-seq data and discuss potential directions for improvements in deep learning algorithms for scRNA-seq data analysis.
Collapse
Affiliation(s)
- Matthew Brendel
- Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA; Institute for Computational Biomedicine, Caryl and Israel Englander Institute for Precision Medicine, Department of Physiology and Biophysics, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA
| | - Chang Su
- Department of Health Service Administration and Policy, Temple University, Philadelphia, PA 19122, USA.
| | - Zilong Bai
- Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA
| | - Hao Zhang
- Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA
| | - Olivier Elemento
- Institute for Computational Biomedicine, Caryl and Israel Englander Institute for Precision Medicine, Department of Physiology and Biophysics, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA
| | - Fei Wang
- Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA.
| |
Collapse
|
26
|
Liang Z, Zheng R, Chen S, Yan X, Li M. A deep matrix factorization based approach for single-cell RNA-seq data clustering. Methods 2022; 205:114-122. [PMID: 35777719 DOI: 10.1016/j.ymeth.2022.06.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Revised: 05/28/2022] [Accepted: 06/24/2022] [Indexed: 11/17/2022] Open
Abstract
The rapid development of single-cell sequencing technologies makes it possible to analyze cellular heterogeneity at the single-cell level. Cell clustering is one of the most fundamental and common steps in the heterogeneity analysis. However, due to the high noise level, high dimensionality and high sparsity, accurate cell clustering is still challengeable. Here, we present DeepCI, a new clustering approach for scRNA-seq data. Using two autoencoders to obtain cell embedding and gene embedding, DeepCI can simultaneously learn cell low-dimensional representation and clustering. In addition, the recovered gene expression matrix can be obtained by the matrix multiplication of cell and gene embedding. To evaluate the performance of DeepCI, we performed it on several real scRNA-seq datasets for clustering and visualization analysis. The experimental results show that DeepCI obtains the overall better performance than several popular single cell analysis methods. We also evaluated the imputation performance of DeepCI by a dedicated experiment. The corresponding results show that the imputed gene expression of known specific marker gene can greatly improve the accuracy of cell type classification.
Collapse
Affiliation(s)
- Zhenlan Liang
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Ruiqing Zheng
- School of Computer Science and Engineering, Central South University, Changsha 410083, China.
| | - Siqi Chen
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Xuhua Yan
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha 410083, China.
| |
Collapse
|
27
|
Tran B, Tran D, Nguyen H, Ro S, Nguyen T. scCAN: single-cell clustering using autoencoder and network fusion. Sci Rep 2022; 12:10267. [PMID: 35715568 PMCID: PMC9206025 DOI: 10.1038/s41598-022-14218-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2021] [Accepted: 06/02/2022] [Indexed: 11/30/2022] Open
Abstract
Unsupervised clustering of single-cell RNA sequencing data (scRNA-seq) is important because it allows us to identify putative cell types. However, the large number of cells (up to millions), the high-dimensionality of the data (tens of thousands of genes), and the high dropout rates all present substantial challenges in single-cell analysis. Here we introduce a new method, named single-cell Clustering using Autoencoder and Network fusion (scCAN), that can overcome these challenges to accurately segregate different cell types in large and sparse scRNA-seq data. In an extensive analysis using 28 real scRNA-seq datasets (more than three million cells) and 243 simulated datasets, we validate that scCAN: (1) correctly estimates the number of true cell types, (2) accurately segregates cells of different types, (3) is robust against dropouts, and (4) is fast and memory efficient. We also compare scCAN with CIDR, SEURAT3, Monocle3, SHARP, and SCANPY. scCAN outperforms these state-of-the-art methods in terms of both accuracy and scalability. The scCAN package is available at https://cran.r-project.org/package=scCAN . Data and R scripts are available at http://sccan.tinnguyen-lab.com/.
Collapse
Affiliation(s)
- Bang Tran
- Department of Computer Science and Engineering, University of Nevada, Reno, NV, 89557, USA
| | - Duc Tran
- Department of Computer Science and Engineering, University of Nevada, Reno, NV, 89557, USA
| | - Hung Nguyen
- Department of Computer Science and Engineering, University of Nevada, Reno, NV, 89557, USA
| | - Seungil Ro
- Department of Physiology and Cell Biology, University of Nevada School of Medicine, Reno, NV, 89557, USA
| | - Tin Nguyen
- Department of Computer Science and Engineering, University of Nevada, Reno, NV, 89557, USA.
| |
Collapse
|
28
|
Li Y, Xu S, Ma S, Wu M. Network-based cancer heterogeneity analysis incorporating multi-view of prior information. Bioinformatics 2022; 38:2855-2862. [PMID: 35561185 PMCID: PMC9113254 DOI: 10.1093/bioinformatics/btac183] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Revised: 02/22/2022] [Accepted: 03/22/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Cancer genetic heterogeneity analysis has critical implications for tumour classification, response to therapy and choice of biomarkers to guide personalized cancer medicine. However, existing heterogeneity analysis based solely on molecular profiling data usually suffers from a lack of information and has limited effectiveness. Many biomedical and life sciences databases have accumulated a substantial volume of meaningful biological information. They can provide additional information beyond molecular profiling data, yet pose challenges arising from potential noise and uncertainty. RESULTS In this study, we aim to develop a more effective heterogeneity analysis method with the help of prior information. A network-based penalization technique is proposed to innovatively incorporate a multi-view of prior information from multiple databases, which accommodates heterogeneity attributed to both differential genes and gene relationships. To account for the fact that the prior information might not be fully credible, we propose a weighted strategy, where the weight is determined dependent on the data and can ensure that the present model is not excessively disturbed by incorrect information. Simulation and analysis of The Cancer Genome Atlas glioblastoma multiforme data demonstrate the practical applicability of the proposed method. AVAILABILITY AND IMPLEMENTATION R code implementing the proposed method is available at https://github.com/mengyunwu2020/PECM. The data that support the findings in this paper are openly available in TCGA (The Cancer Genome Atlas) at https://portal.gdc.cancer.gov/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yang Li
- Center for Applied Statistics, School of Statistics, Statistical Consulting Center, and RSS and China-Re Life Joint Lab on Public Health and Risk Management, Renmin University of China, Beijing 100872, China
| | - Shaodong Xu
- Center for Applied Statistics, School of Statistics, Statistical Consulting Center, and RSS and China-Re Life Joint Lab on Public Health and Risk Management, Renmin University of China, Beijing 100872, China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, New Haven, CT 06520, USA
| | - Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China
| |
Collapse
|
29
|
Zeng Y, Wei Z, Zhong F, Pan Z, Lu Y, Yang Y. A parameter-free deep embedded clustering method for single-cell RNA-seq data. Brief Bioinform 2022; 23:6582003. [PMID: 35524494 DOI: 10.1093/bib/bbac172] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2021] [Revised: 03/25/2022] [Accepted: 04/18/2022] [Indexed: 11/12/2022] Open
Abstract
Clustering analysis is widely used in single-cell ribonucleic acid (RNA)-sequencing (scRNA-seq) data to discover cell heterogeneity and cell states. While many clustering methods have been developed for scRNA-seq analysis, most of these methods require to provide the number of clusters. However, it is not easy to know the exact number of cell types in advance, and experienced determination is not always reliable. Here, we have developed ADClust, an automatic deep embedding clustering method for scRNA-seq data, which can accurately cluster cells without requiring a predefined number of clusters. Specifically, ADClust first obtains low-dimensional representation through pre-trained autoencoder and uses the representations to cluster cells into initial micro-clusters. The clusters are then compared in between by a statistical test, and similar micro-clusters are merged into larger clusters. According to the clustering, cell representations are updated so that each cell will be pulled toward centers of its assigned cluster and similar clusters, while cells are separated to keep distances between clusters. This is accomplished through jointly optimizing the carefully designed clustering and autoencoder loss functions. This merging process continues until convergence. ADClust was tested on 11 real scRNA-seq datasets and was shown to outperform existing methods in terms of both clustering performance and the accuracy on the number of the determined clusters. More importantly, our model provides high speed and scalability for large datasets.
Collapse
Affiliation(s)
- Yuansong Zeng
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Zhuoyi Wei
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Fengqi Zhong
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Zixiang Pan
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Yutong Lu
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Yuedong Yang
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China.,Key Laboratory of Machine Intelligence and Advanced Computing (MOE), Guangzhou 510000, China
| |
Collapse
|
30
|
Lin L, Shi W, Ye J, Li J. Multi‐source single‐cell data integration by MAW barycenter for gaussian mixture models. Biometrics 2022. [DOI: 10.1111/biom.13630] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2021] [Accepted: 01/29/2022] [Indexed: 11/26/2022]
Affiliation(s)
- Lin Lin
- Department of Biostatistics and Bioinformatics Duke University Durham NC 27710 USA
| | - Wei Shi
- Department of Statistics and Data Science National University of Singapore 117546 Singapore
| | - Jianbo Ye
- Amazon Lab126 Sunnyvale CA 94089 USA
| | - Jia Li
- Department of Statistics Pennsylvania State University University Park PA 16802 USA
| |
Collapse
|
31
|
Wang J, Xia J, Tan D, Lin R, Su Y, Zheng CH. scHFC: a hybrid fuzzy clustering method for single-cell RNA-seq data optimized by natural computation. Brief Bioinform 2022; 23:6523126. [PMID: 35136924 DOI: 10.1093/bib/bbab588] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2021] [Revised: 12/08/2021] [Accepted: 12/22/2021] [Indexed: 12/13/2022] Open
Abstract
Rapid development of single-cell RNA sequencing (scRNA-seq) technology has allowed researchers to explore biological phenomena at the cellular scale. Clustering is a crucial and helpful step for researchers to study the heterogeneity of cell. Although many clustering methods have been proposed, massive dropout events and the curse of dimensionality in scRNA-seq data make it still difficult to analysis because they reduce the accuracy of clustering methods, leading to misidentification of cell types. In this work, we propose the scHFC, which is a hybrid fuzzy clustering method optimized by natural computation based on Fuzzy C Mean (FCM) and Gath-Geva (GG) algorithms. Specifically, principal component analysis algorithm is utilized to reduce the dimensions of scRNA-seq data after it is preprocessed. Then, FCM algorithm optimized by simulated annealing algorithm and genetic algorithm is applied to cluster the data to output a membership matrix, which represents the initial clustering result and is taken as the input for GG algorithm to get the final clustering results. We also develop a cluster number estimation method called multi-index comprehensive estimation, which can estimate the cluster numbers well by combining four clustering effectiveness indexes. The performance of the scHFC method is evaluated on 17 scRNA-seq datasets, and compared with six state-of-the-art methods. Experimental results validate the better performance of our scHFC method in terms of clustering accuracy and stability of algorithm. In short, scHFC is an effective method to cluster cells for scRNA-seq data, and it presents great potential for downstream analysis of scRNA-seq data. The source code is available at https://github.com/WJ319/scHFC.
Collapse
Affiliation(s)
- Jing Wang
- Key Lab of Intelligent Computing and Signal Processing of Ministry of Education, School of Computer Science and Technology, Anhui University, Hefei, China
| | - Junfeng Xia
- Institutes of Physical Science and Information Technology, Anhui University, Hefei, China
| | - Dayu Tan
- Institutes of Physical Science and Information Technology, Anhui University, Hefei, China
| | - Rongxin Lin
- Key Lab of Intelligent Computing and Signal Processing of Ministry of Education, School of Computer Science and Technology, Anhui University, Hefei, China
| | - Yansen Su
- Key Lab of Intelligent Computing and Signal Processing of Ministry of Education, School of Artificial Intelligence, Anhui University, Hefei, China
| | - Chun-Hou Zheng
- Key Lab of Intelligent Computing and Signal Processing of Ministry of Education, School of Artificial Intelligence, Anhui University, Hefei, China
| |
Collapse
|
32
|
Flores M, Liu Z, Zhang T, Hasib MM, Chiu YC, Ye Z, Paniagua K, Jo S, Zhang J, Gao SJ, Jin YF, Chen Y, Huang Y. Deep learning tackles single-cell analysis-a survey of deep learning for scRNA-seq analysis. Brief Bioinform 2022; 23:bbab531. [PMID: 34929734 PMCID: PMC8769926 DOI: 10.1093/bib/bbab531] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2021] [Revised: 11/15/2021] [Accepted: 11/16/2021] [Indexed: 12/17/2022] Open
Abstract
Since its selection as the method of the year in 2013, single-cell technologies have become mature enough to provide answers to complex research questions. With the growth of single-cell profiling technologies, there has also been a significant increase in data collected from single-cell profilings, resulting in computational challenges to process these massive and complicated datasets. To address these challenges, deep learning (DL) is positioned as a competitive alternative for single-cell analyses besides the traditional machine learning approaches. Here, we survey a total of 25 DL algorithms and their applicability for a specific step in the single cell RNA-seq processing pipeline. Specifically, we establish a unified mathematical representation of variational autoencoder, autoencoder, generative adversarial network and supervised DL models, compare the training strategies and loss functions for these models, and relate the loss functions of these models to specific objectives of the data processing step. Such a presentation will allow readers to choose suitable algorithms for their particular objective at each step in the pipeline. We envision that this survey will serve as an important information portal for learning the application of DL for scRNA-seq analysis and inspire innovative uses of DL to address a broader range of new challenges in emerging multi-omics and spatial single-cell sequencing.
Collapse
Affiliation(s)
- Mario Flores
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX 78249, USA
| | - Zhentao Liu
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX 78249, USA
| | - Tinghe Zhang
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX 78249, USA
| | - Md Musaddaqui Hasib
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX 78249, USA
| | - Yu-Chiao Chiu
- Greehey Children’s Cancer Research Institute, University of Texas Health San Antonio, San Antonio, TX 78229, USA
| | - Zhenqing Ye
- Greehey Children’s Cancer Research Institute, University of Texas Health San Antonio, San Antonio, TX 78229, USA
- Department of Population Health Sciences, University of Texas Health San Antonio, San Antonio, TX 78229, USA
| | - Karla Paniagua
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX 78249, USA
| | - Sumin Jo
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX 78249, USA
| | - Jianqiu Zhang
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX 78249, USA
| | - Shou-Jiang Gao
- Department of Microbiology and Molecular Genetics, University of Pittsburgh, Pittsburgh, Pennsylvania, PA 15232, USA
- UPMC Hillman Cancer Center, University of Pittsburgh, PA 15232, USA
| | - Yu-Fang Jin
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX 78249, USA
| | - Yidong Chen
- Greehey Children’s Cancer Research Institute, University of Texas Health San Antonio, San Antonio, TX 78229, USA
- Department of Population Health Sciences, University of Texas Health San Antonio, San Antonio, TX 78229, USA
| | - Yufei Huang
- Department of Medicine, School of Medicine, University of Pittsburgh, PA 15232, USA
- UPMC Hillman Cancer Center, University of Pittsburgh, PA 15232, USA
| |
Collapse
|
33
|
Bao S, Li K, Yan C, Zhang Z, Qu J, Zhou M. Deep learning-based advances and applications for single-cell RNA-sequencing data analysis. Brief Bioinform 2021; 23:6444320. [PMID: 34849562 DOI: 10.1093/bib/bbab473] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2021] [Revised: 09/24/2021] [Accepted: 10/15/2021] [Indexed: 11/14/2022] Open
Abstract
The rapid development of single-cell RNA-sequencing (scRNA-seq) technology has raised significant computational and analytical challenges. The application of deep learning to scRNA-seq data analysis is rapidly evolving and can overcome the unique challenges in upstream (quality control and normalization) and downstream (cell-, gene- and pathway-level) analysis of scRNA-seq data. In the present study, recent advances and applications of deep learning-based methods, together with specific tools for scRNA-seq data analysis, were summarized. Moreover, the future perspectives and challenges of deep-learning techniques regarding the appropriate analysis and interpretation of scRNA-seq data were investigated. The present study aimed to provide evidence supporting the biomedical application of deep learning-based tools and may aid biologists and bioinformaticians in navigating this exciting and fast-moving area.
Collapse
Affiliation(s)
- Siqi Bao
- School of Information and Communication Engineering, Hainan University, Haikou 570228, P. R. China.,School of Biomedical Engineering, School of Ophthalmology & Optometry and Eye Hospital, Wenzhou Medical University, Wenzhou 325027, P. R. China.,Hainan Institute of Real World Data, Haikou 570228, P. R. China
| | - Ke Li
- School of Biomedical Engineering, School of Ophthalmology & Optometry and Eye Hospital, Wenzhou Medical University, Wenzhou 325027, P. R. China
| | - Congcong Yan
- School of Biomedical Engineering, School of Ophthalmology & Optometry and Eye Hospital, Wenzhou Medical University, Wenzhou 325027, P. R. China
| | - Zicheng Zhang
- School of Biomedical Engineering, School of Ophthalmology & Optometry and Eye Hospital, Wenzhou Medical University, Wenzhou 325027, P. R. China
| | - Jia Qu
- School of Information and Communication Engineering, Hainan University, Haikou 570228, P. R. China.,School of Biomedical Engineering, School of Ophthalmology & Optometry and Eye Hospital, Wenzhou Medical University, Wenzhou 325027, P. R. China.,Hainan Institute of Real World Data, Haikou 570228, P. R. China
| | - Meng Zhou
- School of Biomedical Engineering, School of Ophthalmology & Optometry and Eye Hospital, Wenzhou Medical University, Wenzhou 325027, P. R. China
| |
Collapse
|
34
|
Li X, Zhang S, Wong KC. Deep embedded clustering with multiple objectives on scRNA-seq data. Brief Bioinform 2021; 22:6209682. [PMID: 33822877 DOI: 10.1093/bib/bbab090] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2020] [Revised: 02/17/2021] [Accepted: 02/25/2021] [Indexed: 12/19/2022] Open
Abstract
In recent years, single-cell RNA sequencing (scRNA-seq) technologies have been widely adopted to interrogate gene expression of individual cells; it brings opportunities to understand the underlying processes in a high-throughput manner. Deep embedded clustering (DEC) was demonstrated successful in high-dimensional sparse scRNA-seq data by joint feature learning and cluster assignment for identifying cell types simultaneously. However, the deep network architecture for embedding clustering is not trivial to optimize. Therefore, we propose an evolutionary multiobjective DEC by synergizing the multiobjective evolutionary optimization to simultaneously evolve the hyperparameters and architectures of DEC in an automatic manner. Firstly, a denoising autoencoder is integrated into the DEC to project the high-dimensional sparse scRNA-seq data into a low-dimensional space. After that, to guide the evolution, three objective functions are formulated to balance the model's generality and clustering performance for robustness. Meanwhile, migration and mutation operators are proposed to optimize the objective functions to select the suitable hyperparameters and architectures of DEC in the multiobjective framework. Multiple comparison analyses are conducted on twenty synthetic data and eight real data from different representative single-cell sequencing platforms to validate the effectiveness. The experimental results reveal that the proposed algorithm outperforms other state-of-the-art clustering methods under different metrics. Meanwhile, marker genes identification, gene ontology enrichment and pathology analysis are conducted to reveal novel insights into the cell type identification and characterization mechanisms.
Collapse
Affiliation(s)
- Xiangtao Li
- School of Artificial Intelligence Jilin University, Jilin, China
| | - Shixiong Zhang
- Department of Computer science City University of Hong Kong, Hong Kong SAR
| | - Ka-Chun Wong
- Department of Computer science City University of Hong Kong, Hong Kong SAR
| |
Collapse
|