1
|
Lan J, Zhuo X, Ye S, Deng J. A semi-supervised non-negative matrix factorization model for scRNA-seq data analysis. Appl Soft Comput 2025; 174:112982. [DOI: 10.1016/j.asoc.2025.112982] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/02/2025]
|
2
|
Zhou L, Peng X, Chen M, He X, Tian G, Yang J, Peng L. Unveiling patterns in spatial transcriptomics data: a novel approach utilizing graph attention autoencoder and multiscale deep subspace clustering network. Gigascience 2025; 14:giae103. [PMID: 39804726 PMCID: PMC11727722 DOI: 10.1093/gigascience/giae103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2024] [Revised: 07/06/2024] [Accepted: 11/21/2024] [Indexed: 01/16/2025] Open
Abstract
BACKGROUND The accurate deciphering of spatial domains, along with the identification of differentially expressed genes and the inference of cellular trajectory based on spatial transcriptomic (ST) data, holds significant potential for enhancing our understanding of tissue organization and biological functions. However, most of spatial clustering methods can neither decipher complex structures in ST data nor entirely employ features embedded in different layers. RESULTS This article introduces STMSGAL, a novel framework for analyzing ST data by incorporating graph attention autoencoder and multiscale deep subspace clustering. First, STMSGAL constructs ctaSNN, a cell type-aware shared nearest neighbor graph, using Louvian clustering exclusively based on gene expression profiles. Subsequently, it integrates expression profiles and ctaSNN to generate spot latent representations using a graph attention autoencoder and multiscale deep subspace clustering. Lastly, STMSGAL implements spatial clustering, differential expression analysis, and trajectory inference, providing comprehensive capabilities for thorough data exploration and interpretation. STMSGAL was evaluated against 7 methods, including SCANPY, SEDR, CCST, DeepST, GraphST, STAGATE, and SiGra, using four 10x Genomics Visium datasets, 1 mouse visual cortex STARmap dataset, and 2 Stereo-seq mouse embryo datasets. The comparison showcased STMSGAL's remarkable performance across Davies-Bouldin, Calinski-Harabasz, S_Dbw, and ARI values. STMSGAL significantly enhanced the identification of layer structures across ST data with different spatial resolutions and accurately delineated spatial domains in 2 breast cancer tissues, adult mouse brain (FFPE), and mouse embryos. CONCLUSIONS STMSGAL can serve as an essential tool for bridging the analysis of cellular spatial organization and disease pathology, offering valuable insights for researchers in the field.
Collapse
Affiliation(s)
- Liqian Zhou
- School of Computer Science, Hunan University of Technology, Zhuzhou 412007, Hunan, China
| | - Xinhuai Peng
- School of Computer Science, Hunan University of Technology, Zhuzhou 412007, Hunan, China
| | - Min Chen
- School of Computer Science, Hunan Institute of Technology, Hengyang 421002, Hunan, China
| | - Xianzhi He
- School of Computer Science, Hunan University of Technology, Zhuzhou 412007, Hunan, China
| | - Geng Tian
- Geneis (Beijing) Co. Ltd., Beijing 100102, China
| | | | - Lihong Peng
- School of Computer Science, Hunan University of Technology, Zhuzhou 412007, Hunan, China
- College of Life Science and Chemistry, Hunan University of Technology, Zhuzhou 412007, Hunan, China
| |
Collapse
|
3
|
Cui L, Guo G, Ng MK, Zou Q, Qiu Y. GSTRPCA: irregular tensor singular value decomposition for single-cell multi-omics data clustering. Brief Bioinform 2024; 26:bbae649. [PMID: 39680741 DOI: 10.1093/bib/bbae649] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2024] [Revised: 10/29/2024] [Accepted: 12/01/2024] [Indexed: 12/18/2024] Open
Abstract
Single-cell multi-omics refers to the various types of biological data at the single-cell level. These data have enabled insight and resolution to cellular phenotypes, biological processes, and developmental stages. Current advances hold high potential for breakthroughs by integrating multiple different omics layers. However, singlecell multi-omics data usually have different feature dimensions and direct or indirect relationships. How to keep the data structure of these different data and extract hidden relationships is a major challenge for omics data integration, and effective integration models are urgently needed. In this paper, we propose an irregular tensor decomposition model (GSTRPCA) based on tensor robust principal component analysis (TRPCA). We developed a weighted threshold model for the decomposition of irregular tensor data by combining low-rank and sparsity constraints, which requires that the low-dimensional embeddings of the data remain lowrank and sparse. The major advantage of the GSTRPCA algorithm is its ability to keep the original data structure and explore hidden related features among omics data. For GSTRPCA, we also designed an effective algorithm that theoretically guarantees global convergence for the tensor decomposition. The computational experiments on irregular tensor datasets demonstrate that GSTRPCA significantly outperformed the state-of-the-art methods and hence confirm the superiority of GSTRPCA in clustering single-cell multiomics data. To our knowledge, this is the first tensor decomposition method for irregular tensor data to keep the data structure and hence improve the clustering performance for single-cell multi-omics data. GSTRPCA is a Matlabbased algorithm, and the code is available from https://github.com/GGL-B/GSTRPCA.
Collapse
Affiliation(s)
- Lubin Cui
- School of Mathematics and Statistics, Henan Normal University, Xinxiang 453007, China
| | - Guiliang Guo
- School of Mathematics and Statistics, Henan Normal University, Xinxiang 453007, China
| | - Michael K Ng
- Department of Mathematics, Hong Kong Baptist University, Hong Kong 999077, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, Electronic Science and Technology University, Chengdu 611731, China
| | - Yushan Qiu
- School of Mathematical Sciences, Shenzhen University, Guangdong 518000, China
| |
Collapse
|
4
|
Ding Q, Yang W, Xue G, Liu H, Cai Y, Que J, Jin X, Luo M, Pang F, Yang Y, Lin Y, Liu Y, Sun H, Tan R, Wang P, Xu Z, Jiang Q. Dimension reduction, cell clustering, and cell-cell communication inference for single-cell transcriptomics with DcjComm. Genome Biol 2024; 25:241. [PMID: 39252099 PMCID: PMC11382422 DOI: 10.1186/s13059-024-03385-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Accepted: 08/30/2024] [Indexed: 09/11/2024] Open
Abstract
Advances in single-cell transcriptomics provide an unprecedented opportunity to explore complex biological processes. However, computational methods for analyzing single-cell transcriptomics still have room for improvement especially in dimension reduction, cell clustering, and cell-cell communication inference. Herein, we propose a versatile method, named DcjComm, for comprehensive analysis of single-cell transcriptomics. DcjComm detects functional modules to explore expression patterns and performs dimension reduction and clustering to discover cellular identities by the non-negative matrix factorization-based joint learning model. DcjComm then infers cell-cell communication by integrating ligand-receptor pairs, transcription factors, and target genes. DcjComm demonstrates superior performance compared to state-of-the-art methods.
Collapse
Affiliation(s)
- Qian Ding
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin, 150000, China
| | - Wenyi Yang
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin, 150000, China
| | - Guangfu Xue
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin, 150000, China
| | - Hongxin Liu
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin, 150000, China
| | - Yideng Cai
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin, 150000, China
| | - Jinhao Que
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin, 150000, China
| | - Xiyun Jin
- School of Interdisciplinary Medicine and Engineering, Harbin Medical University, Harbin, 150076, China
| | - Meng Luo
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin, 150000, China
| | - Fenglan Pang
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin, 150000, China
| | - Yuexin Yang
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin, 150000, China
| | - Yi Lin
- School of Interdisciplinary Medicine and Engineering, Harbin Medical University, Harbin, 150076, China
| | - Yusong Liu
- School of Interdisciplinary Medicine and Engineering, Harbin Medical University, Harbin, 150076, China
| | - Haoxiu Sun
- School of Interdisciplinary Medicine and Engineering, Harbin Medical University, Harbin, 150076, China
| | - Renjie Tan
- School of Interdisciplinary Medicine and Engineering, Harbin Medical University, Harbin, 150076, China
| | - Pingping Wang
- School of Interdisciplinary Medicine and Engineering, Harbin Medical University, Harbin, 150076, China.
| | - Zhaochun Xu
- School of Interdisciplinary Medicine and Engineering, Harbin Medical University, Harbin, 150076, China.
| | - Qinghua Jiang
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin, 150000, China.
- School of Interdisciplinary Medicine and Engineering, Harbin Medical University, Harbin, 150076, China.
- State Key Laboratory of Frigid Zone Cardiovascular Diseases (SKLFZCD), Harbin Medical University, Harbin, 150076, China.
| |
Collapse
|
5
|
Xu L, Li Z, Ren J, Liu S, Xu Y. Single-cell RNA sequencing data analysis utilizing multi-type graph neural networks. Comput Biol Med 2024; 179:108921. [PMID: 39059210 DOI: 10.1016/j.compbiomed.2024.108921] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2024] [Revised: 07/08/2024] [Accepted: 07/16/2024] [Indexed: 07/28/2024]
Abstract
Single-cell RNA sequencing (scRNA-seq) is the sequencing technology of a single cell whose expression reflects the overall characteristics of the individual cell, facilitating the research of problems at the cellular level. However, the problems of scRNA-seq such as dimensionality reduction processing of massive data, technical noise in data, and visualization of single-cell type clustering cause great difficulties for analyzing and processing scRNA-seq data. In this paper, we propose a new single-cell data analysis model using denoising autoencoder and multi-type graph neural networks (scDMG), which learns cell-cell topology information and latent representation of scRNA-seq data. scDMG introduces the zero-inflated negative binomial (ZINB) model into a denoising autoencoder (DAE) to perform dimensionality reduction and denoising on the raw data. scDMG integrates multiple-type graph neural networks as the encoder to further train the preprocessed data, which better deals with various types of scRNA-seq datasets, resolves dropout events in scRNA-seq data, and enables preliminary classification of scRNA-seq data. By employing TSNE and PCA algorithms for the trained data and invoking Louvain algorithm, scDMG has better dimensionality reduction and clustering optimization. Compared with other mainstream scRNA-seq clustering algorithms, scDMG outperforms other state-of-the-art methods in various clustering performance metrics and shows better scalability, shorter runtime, and great clustering results.
Collapse
Affiliation(s)
- Li Xu
- College of Computer Science and Technology, Harbin Engineering University, Harbin, 150001, Heilongjiang, China
| | - Zhenpeng Li
- College of Computer Science and Technology, Harbin Engineering University, Harbin, 150001, Heilongjiang, China.
| | - Jiaxu Ren
- College of Computer Science and Technology, Harbin Engineering University, Harbin, 150001, Heilongjiang, China
| | - Shuaipeng Liu
- College of Computer Science and Technology, Harbin Engineering University, Harbin, 150001, Heilongjiang, China
| | - Yiming Xu
- College of Engineering, Tokyo Institute of Technology, Tokyo, 226-0026, Tokyo, Japan
| |
Collapse
|
6
|
Jiang H, Wang MN, Huang YA, Huang Y. Graph-Regularized Non-Negative Matrix Factorization for Single-Cell Clustering in scRNA-Seq Data. IEEE J Biomed Health Inform 2024; 28:4986-4994. [PMID: 38787664 DOI: 10.1109/jbhi.2024.3400050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/26/2024]
Abstract
The advent of single-cell RNA sequencing (scRNA-seq) has brought forth fresh perspectives on intricate biological processes, revealing the nuances and divergences present among distinct cells. Accurate single-cell analysis is a crucial prerequisite for in-depth investigation into the underlying mechanisms of heterogeneity. Due to various technical noises, like the impact of dropout values, scRNA-seq data remains challenging to interpret. In this work, we propose an unsupervised learning framework for scRNA-seq data analysis (aka Sc-GNNMF). Based on the non-negativity and sparsity of scRNA-seq data, we propose employing graph-regularized non-negative matrix factorization (GNNMF) algorithm for the analysis of scRNA-seq data, which involves estimating cell-cell sparse similarity and gene-gene sparse similarity through Laplacian kernels and p-nearest neighbor graphs ( p-NNG). By assuming intrinsic geometric local invariance, we use a weighted p-nearest known neighbors ( p-NKN) to optimize the scRNA-seq data. The optimized scRNA-seq data then participates in the matrix decomposition process, promoting the closeness of cells with similar types in cell-gene data space and determining a more suitable embedding space for clustering. Sc-GNNMF demonstrates superior performance compared to other methods and maintains satisfactory compatibility and robustness, as evidenced by experiments on 11 real scRNA-seq datasets. Furthermore, Sc-GNNMF yields excellent results in clustering tasks, extracting useful gene markers, and pseudo-temporal analysis.
Collapse
|
7
|
Gao Q, Ai Q. DCRELM: dual correlation reduction network-based extreme learning machine for single-cell RNA-seq data clustering. Sci Rep 2024; 14:13541. [PMID: 38866896 PMCID: PMC11169517 DOI: 10.1038/s41598-024-64217-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2024] [Accepted: 06/06/2024] [Indexed: 06/14/2024] Open
Abstract
Single-cell ribonucleic acid sequencing (scRNA-seq) is a high-throughput genomic technique that is utilized to investigate single-cell transcriptomes. Cluster analysis can effectively reveal the heterogeneity and diversity of cells in scRNA-seq data, but existing clustering algorithms struggle with the inherent high dimensionality, noise, and sparsity of scRNA-seq data. To overcome these limitations, we propose a clustering algorithm: the Dual Correlation Reduction network-based Extreme Learning Machine (DCRELM). First, DCRELM obtains the low-dimensional and dense result features of scRNA-seq data in an extreme learning machine (ELM) random mapping space. Second, the ELM graph distortion module is employed to obtain a dual view of the resulting features, effectively enhancing their robustness. Third, the autoencoder fusion module is employed to learn the attributes and structural information of the resulting features, and merge these two types of information to generate consistent latent representations of these features. Fourth, the dual information reduction network is used to filter the redundant information and noise in the dual consistent latent representations. Last, a triplet self-supervised learning mechanism is utilized to further improve the clustering performance. Extensive experiments show that the DCRELM performs well in terms of clustering performance and robustness. The code is available at https://github.com/gaoqingyun-lucky/awesome-DCRELM .
Collapse
Affiliation(s)
- Qingyun Gao
- School of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, 114051, China
| | - Qing Ai
- School of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, 114051, China.
| |
Collapse
|
8
|
Qiu Y, Guo D, Zhao P, Zou Q. scMNMF: a novel method for single-cell multi-omics clustering based on matrix factorization. Brief Bioinform 2024; 25:bbae228. [PMID: 38754408 PMCID: PMC11097994 DOI: 10.1093/bib/bbae228] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 04/02/2024] [Accepted: 04/22/2024] [Indexed: 05/18/2024] Open
Abstract
MOTIVATION The technology for analyzing single-cell multi-omics data has advanced rapidly and has provided comprehensive and accurate cellular information by exploring cell heterogeneity in genomics, transcriptomics, epigenomics, metabolomics and proteomics data. However, because of the high-dimensional and sparse characteristics of single-cell multi-omics data, as well as the limitations of various analysis algorithms, the clustering performance is generally poor. Matrix factorization is an unsupervised, dimensionality reduction-based method that can cluster individuals and discover related omics variables from different blocks. Here, we present a novel algorithm that performs joint dimensionality reduction learning and cell clustering analysis on single-cell multi-omics data using non-negative matrix factorization that we named scMNMF. We formulate the objective function of joint learning as a constrained optimization problem and derive the corresponding iterative formulas through alternating iterative algorithms. The major advantage of the scMNMF algorithm remains its capability to explore hidden related features among omics data. Additionally, the feature selection for dimensionality reduction and cell clustering mutually influence each other iteratively, leading to a more effective discovery of cell types. We validated the performance of the scMNMF algorithm using two simulated and five real datasets. The results show that scMNMF outperformed seven other state-of-the-art algorithms in various measurements. AVAILABILITY AND IMPLEMENTATION scMNMF code can be found at https://github.com/yushanqiu/scMNMF.
Collapse
Affiliation(s)
- Yushan Qiu
- School of Mathematical Sciences, Shenzhen University, 518000, Guangdong, China
| | - Dong Guo
- School of Mathematical Sciences, Shenzhen University, 518000, Guangdong, China
| | - Pu Zhao
- College of Life and Health Sciences, Northeastern University, Shenyang, 110169, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 610056, China
| |
Collapse
|
9
|
Peng L, Gao P, Xiong W, Li Z, Chen X. Identifying potential ligand-receptor interactions based on gradient boosted neural network and interpretable boosting machine for intercellular communication analysis. Comput Biol Med 2024; 171:108110. [PMID: 38367445 DOI: 10.1016/j.compbiomed.2024.108110] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2023] [Revised: 01/24/2024] [Accepted: 02/04/2024] [Indexed: 02/19/2024]
Abstract
Cell-cell communication is essential to many key biological processes. Intercellular communication is generally mediated by ligand-receptor interactions (LRIs). Thus, building a comprehensive and high-quality LRI resource can significantly improve intercellular communication analysis. Meantime, due to lack of a "gold standard" dataset, it remains a challenge to evaluate LRI-mediated intercellular communication results. Here, we introduce CellGiQ, a high-confident LRI prediction framework for intercellular communication analysis. Highly confident LRIs are first inferred by LRI feature extraction with BioTriangle, LRI selection using LightGBM, and LRI classification based on ensemble of gradient boosted neural network and interpretable boosting machine. Subsequently, known and identified high-confident LRIs are filtered by combining single-cell RNA sequencing (scRNA-seq) data and further applied to intercellular communication inference through a quartile scoring strategy. To validation the predictions, CellGiQ exploited several evaluation strategies: using AUC and AUPR, it surpassed six competing LRI prediction models on four LRI datasets; through Venn diagrams and molecular docking, its predicted LRIs were validated by five other popular intercellular communication inference methods; based on the overlapping LRIs, it computed high Jaccard index with six other state-of-the-art intercellular communication prediction tools within human HNSCC tissues; by comparing with classical models and literature retrieve, its inferred HNSCC-related intercellular communication results was further validated. The novelty of this study is to identify high-confident LRIs based on machine learning as well as design several LRI validation ways, providing reference for computational LRI prediction. CellGiQ provides an open-source and useful tool to decompose LRI-mediated intercellular communication at single cell resolution. CellGiQ is freely available at https://github.com/plhhnu/CellGiQ.
Collapse
Affiliation(s)
- Lihong Peng
- College of Life Science and Chemistry, Hunan University of Technology, Zhuzhou, 412007, Hunan, China
| | - Pengfei Gao
- College of Life Science and Chemistry, Hunan University of Technology, Zhuzhou, 412007, Hunan, China
| | - Wei Xiong
- College of Life Science and Chemistry, Hunan University of Technology, Zhuzhou, 412007, Hunan, China
| | - Zejun Li
- School of Computer Science and Engineering, Hunan Institute of Technology, Hengyang, 421002, Hunan, China.
| | - Xing Chen
- School of Science, Jiangnan University, Wuxi, 214122, Jiangsu, China.
| |
Collapse
|
10
|
Zhu X, Meng S, Li G, Wang J, Peng X. AGImpute: imputation of scRNA-seq data based on a hybrid GAN with dropouts identification. Bioinformatics 2024; 40:btae068. [PMID: 38317025 PMCID: PMC10877090 DOI: 10.1093/bioinformatics/btae068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2024] [Revised: 02/19/2024] [Accepted: 02/19/2024] [Indexed: 02/07/2024] Open
Abstract
MOTIVATION Dropout events bring challenges in analyzing single-cell RNA sequencing data as they introduce noise and distort the true distributions of gene expression profiles. Recent studies focus on estimating dropout probability and imputing dropout events by leveraging information from similar cells or genes. However, the number of dropout events differs in different cells, due to the complex factors, such as different sequencing protocols, cell types, and batch effects. The dropout event differences are not fully considered in assessing the similarities between cells and genes, which compromises the reliability of downstream analysis. RESULTS This work proposes a hybrid Generative Adversarial Network with dropouts identification to impute single-cell RNA sequencing data, named AGImpute. First, the numbers of dropout events in different cells in scRNA-seq data are differentially estimated by using a dynamic threshold estimation strategy. Next, the identified dropout events are imputed by a hybrid deep learning model, combining Autoencoder with a Generative Adversarial Network. To validate the efficiency of the AGImpute, it is compared with seven state-of-the-art dropout imputation methods on two simulated datasets and seven real single-cell RNA sequencing datasets. The results show that AGImpute imputes the least number of dropout events than other methods. Moreover, AGImpute enhances the performance of downstream analysis, including clustering performance, identifying cell-specific marker genes, and inferring trajectory in the time-course dataset. AVAILABILITY AND IMPLEMENTATION The source code can be obtained from https://github.com/xszhu-lab/AGImpute.
Collapse
Affiliation(s)
- Xiaoshu Zhu
- School of Computer and Information Security, Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin 541004, China
| | - Shuang Meng
- School of Computer Science and Engineering, Guangxi Normal University, Guilin 541006, China
| | - Gaoshi Li
- School of Computer Science and Engineering, Guangxi Normal University, Guilin 541006, China
| | - Jianxin Wang
- School of Computer Science and Engineering, Hunan Provincial Key Lab on Bioinformatics, Central South University, Changsha 400083, China
| | - Xiaoqing Peng
- School of Life Sciences, Center for Medical Genetics, Central South University, Changsha 400083, China
| |
Collapse
|
11
|
Fang Z, Zheng R, Li M. scMAE: a masked autoencoder for single-cell RNA-seq clustering. Bioinformatics 2024; 40:btae020. [PMID: 38230824 PMCID: PMC10832357 DOI: 10.1093/bioinformatics/btae020] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Revised: 01/07/2024] [Accepted: 01/12/2024] [Indexed: 01/18/2024] Open
Abstract
MOTIVATION Single-cell RNA sequencing has emerged as a powerful technology for studying gene expression at the individual cell level. Clustering individual cells into distinct subpopulations is fundamental in scRNA-seq data analysis, facilitating the identification of cell types and exploration of cellular heterogeneity. Despite the recent development of many deep learning-based single-cell clustering methods, few have effectively exploited the correlations among genes, resulting in suboptimal clustering outcomes. RESULTS Here, we propose a novel masked autoencoder-based method, scMAE, for cell clustering. scMAE perturbs gene expression and employs a masked autoencoder to reconstruct the original data, learning robust and informative cell representations. The masked autoencoder introduces a masking predictor, which captures relationships among genes by predicting whether gene expression values are masked. By integrating this masking mechanism, scMAE effectively captures latent structures and dependencies in the data, enhancing clustering performance. We conducted extensive comparative experiments using various clustering evaluation metrics on 15 scRNA-seq datasets from different sequencing platforms. Experimental results indicate that scMAE outperforms other state-of-the-art methods on these datasets. In addition, scMAE accurately identifies rare cell types, which are challenging to detect due to their low abundance. Furthermore, biological analyses confirm the biological significance of the identified cell subpopulations. AVAILABILITY AND IMPLEMENTATION The source code of scMAE is available at: https://zenodo.org/records/10465991.
Collapse
Affiliation(s)
- Zhaoyu Fang
- School of Computer Science and Engineering, Central South University, 932 South Lushan Road, Yuelu District, Changsha 410083, China
| | - Ruiqing Zheng
- School of Computer Science and Engineering, Central South University, 932 South Lushan Road, Yuelu District, Changsha 410083, China
| | - Min Li
- School of Computer Science and Engineering, Central South University, 932 South Lushan Road, Yuelu District, Changsha 410083, China
| |
Collapse
|