1
|
Li D, Mei Q, Li G. scQA: A dual-perspective cell type identification model for single cell transcriptome data. Comput Struct Biotechnol J 2024; 23:520-536. [PMID: 38235363 PMCID: PMC10791572 DOI: 10.1016/j.csbj.2023.12.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Revised: 12/16/2023] [Accepted: 12/18/2023] [Indexed: 01/19/2024] Open
Abstract
Single-cell RNA sequencing technologies have been pivotal in advancing the development of algorithms for clustering heterogeneous cell populations. Existing methods for utilizing scRNA-seq data to identify cell types tend to neglect the beneficial impact of dropout events and perform clustering focusing solely on quantitative perspective. Here, we introduce a novel method named scQA, notable for its ability to concurrently identify cell types and cell type-specific key genes from both qualitative and quantitative perspectives. In contrast to other methods, scQA not only identifies cell types but also extracts key genes associated with these cell types, enabling bidirectional clustering for scRNA-seq data. Through an iterative process, our approach aims to minimize the number of landmarks to approximately a dozen while maximizing the inclusion of quasi-trend-preserved genes with dropouts both qualitatively and quantitatively. It then clusters cells by employing an ingenious label propagation strategy, obviating the requirement for a predetermined number of cell types. Validated on 20 publicly available scRNA-seq datasets, scQA consistently outperforms other salient tools. Furthermore, we confirm the effectiveness and potential biological significance of the identified key genes through both external and internal validation. In conclusion, scQA emerges as a valuable tool for investigating cell heterogeneity due to its distinctive fusion of qualitative and quantitative facets, along with bidirectional clustering capabilities. Furthermore, it can be seamlessly integrated into border scRNA-seq analyses. The source codes are publicly available at https://github.com/LD-Lyndee/scQA.
Collapse
Affiliation(s)
- Di Li
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao 266237, China
| | - Qinglin Mei
- MOE Key Laboratory of Bioinformatics, BNRIST Bioinformatics Division, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Guojun Li
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao 266237, China
| |
Collapse
|
2
|
Guo ZH, Wang YB, Wang S, Zhang Q, Huang DS. scCorrector: a robust method for integrating multi-study single-cell data. Brief Bioinform 2024; 25:bbad525. [PMID: 38271483 PMCID: PMC10810333 DOI: 10.1093/bib/bbad525] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 11/12/2023] [Accepted: 12/19/2023] [Indexed: 01/27/2024] Open
Abstract
The advent of single-cell sequencing technologies has revolutionized cell biology studies. However, integrative analyses of diverse single-cell data face serious challenges, including technological noise, sample heterogeneity, and different modalities and species. To address these problems, we propose scCorrector, a variational autoencoder-based model that can integrate single-cell data from different studies and map them into a common space. Specifically, we designed a Study Specific Adaptive Normalization for each study in decoder to implement these features. scCorrector substantially achieves competitive and robust performance compared with state-of-the-art methods and brings novel insights under various circumstances (e.g. various batches, multi-omics, cross-species, and development stages). In addition, the integration of single-cell data and spatial data makes it possible to transfer information between different studies, which greatly expand the narrow range of genes covered by MERFISH technology. In summary, scCorrector can efficiently integrate multi-study single-cell datasets, thereby providing broad opportunities to tackle challenges emerging from noisy resources.
Collapse
Affiliation(s)
- Zhen-Hao Guo
- College of Electronics and Information Engineering, Tongji University, Shanghai 200000, China
| | - Yan-Bin Wang
- College of Computer Science and Technology, Zhejiang University 310027, China
| | - Siguo Wang
- Eastern Institute for Advanced Study, Eastern Institute of Technology, Tongxin Road No.568, Ningbo, Zhejiang 315201, China
| | - Qinhu Zhang
- Eastern Institute for Advanced Study, Eastern Institute of Technology, Tongxin Road No.568, Ningbo, Zhejiang 315201, China
| | - De-Shuang Huang
- Eastern Institute for Advanced Study, Eastern Institute of Technology, Tongxin Road No.568, Ningbo, Zhejiang 315201, China
| |
Collapse
|
3
|
Xu J, Xu J, Meng Y, Lu C, Cai L, Zeng X, Nussinov R, Cheng F. Graph embedding and Gaussian mixture variational autoencoder network for end-to-end analysis of single-cell RNA sequencing data. CELL REPORTS METHODS 2023; 3:100382. [PMID: 36814845 PMCID: PMC9939381 DOI: 10.1016/j.crmeth.2022.100382] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/26/2022] [Revised: 10/31/2022] [Accepted: 12/08/2022] [Indexed: 05/25/2023]
Abstract
Single-cell RNA sequencing (scRNA-seq) is a revolutionary technology to determine the precise gene expression of individual cells and identify cell heterogeneity and subpopulations. However, technical limitations of scRNA-seq lead to heterogeneous and sparse data. Here, we present autoCell, a deep-learning approach for scRNA-seq dropout imputation and feature extraction. autoCell is a variational autoencoding network that combines graph embedding and a probabilistic depth Gaussian mixture model to infer the distribution of high-dimensional, sparse scRNA-seq data. We validate autoCell on simulated datasets and biologically relevant scRNA-seq. We show that interpolation of autoCell improves the performance of existing tools in identifying cell developmental trajectories of human preimplantation embryos. We identify disease-associated astrocytes (DAAs) and reconstruct DAA-specific molecular networks and ligand-receptor interactions involved in cell-cell communications using Alzheimer's disease as a prototypical example. autoCell provides a toolbox for end-to-end analysis of scRNA-seq data, including visualization, clustering, imputation, and disease-specific gene network identification.
Collapse
Affiliation(s)
- Junlin Xu
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan 410082, China
| | - Jielin Xu
- Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH 44195, USA
| | - Yajie Meng
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan 410082, China
| | - Changcheng Lu
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan 410082, China
| | - Lijun Cai
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan 410082, China
| | - Xiangxiang Zeng
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan 410082, China
| | - Ruth Nussinov
- Computational Structural Biology Section, Basic Science Program, Frederick National Laboratory for Cancer Research, National Cancer Institute at Frederick, Frederick, MD 21702, USA
- Department of Human Molecular Genetics and Biochemistry, Sackler School of Medicine, Tel Aviv University, Tel Aviv 69978, Israel
| | - Feixiong Cheng
- Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH 44195, USA
- Department of Molecular Medicine, Cleveland Clinic Lerner College of Medicine, Case Western Reserve University, Cleveland, OH 44195, USA
- Case Comprehensive Cancer Center, Case Western Reserve University School of Medicine, Cleveland, OH 44106, USA
| |
Collapse
|
4
|
Wang J, Xia J, Wang H, Su Y, Zheng CH. scDCCA: deep contrastive clustering for single-cell RNA-seq data based on auto-encoder network. Brief Bioinform 2023; 24:6984787. [PMID: 36631401 DOI: 10.1093/bib/bbac625] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2022] [Revised: 12/12/2022] [Accepted: 12/19/2022] [Indexed: 01/13/2023] Open
Abstract
The advances in single-cell ribonucleic acid sequencing (scRNA-seq) allow researchers to explore cellular heterogeneity and human diseases at cell resolution. Cell clustering is a prerequisite in scRNA-seq analysis since it can recognize cell identities. However, the high dimensionality, noises and significant sparsity of scRNA-seq data have made it a big challenge. Although many methods have emerged, they still fail to fully explore the intrinsic properties of cells and the relationship among cells, which seriously affects the downstream clustering performance. Here, we propose a new deep contrastive clustering algorithm called scDCCA. It integrates a denoising auto-encoder and a dual contrastive learning module into a deep clustering framework to extract valuable features and realize cell clustering. Specifically, to better characterize and learn data representations robustly, scDCCA utilizes a denoising Zero-Inflated Negative Binomial model-based auto-encoder to extract low-dimensional features. Meanwhile, scDCCA incorporates a dual contrastive learning module to capture the pairwise proximity of cells. By increasing the similarities between positive pairs and the differences between negative ones, the contrasts at both the instance and the cluster level help the model learn more discriminative features and achieve better cell segregation. Furthermore, scDCCA joins feature learning with clustering, which realizes representation learning and cell clustering in an end-to-end manner. Experimental results of 14 real datasets validate that scDCCA outperforms eight state-of-the-art methods in terms of accuracy, generalizability, scalability and efficiency. Cell visualization and biological analysis demonstrate that scDCCA significantly improves clustering and facilitates downstream analysis for scRNA-seq data. The code is available at https://github.com/WJ319/scDCCA.
Collapse
Affiliation(s)
- Jing Wang
- Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei, China
| | - Junfeng Xia
- Institutes of Physical Science and Information Technology, Anhui University, Hefei, China
| | - Haiyun Wang
- School of Mathematics and Systems Science, Xinjiang University, Urumqi, China
| | - Yansen Su
- School of Artificial Intelligence, Anhui University, Hefei, China
| | - Chun-Hou Zheng
- School of Artificial Intelligence, Anhui University, Hefei, China
| |
Collapse
|
5
|
Liu G, Li M, Wang H, Lin S, Xu J, Li R, Tang M, Li C. D3K: The Dissimilarity-Density-Dynamic Radius K-means Clustering Algorithm for scRNA-Seq Data. Front Genet 2022; 13:912711. [PMID: 35846121 PMCID: PMC9284269 DOI: 10.3389/fgene.2022.912711] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2022] [Accepted: 04/25/2022] [Indexed: 12/02/2022] Open
Abstract
A single-cell sequencing data set has always been a challenge for clustering because of its high dimension and multi-noise points. The traditional K-means algorithm is not suitable for this type of data. Therefore, this study proposes a Dissimilarity-Density-Dynamic Radius-K-means clustering algorithm. The algorithm adds the dynamic radius parameter to the calculation. It flexibly adjusts the active radius according to the data characteristics, which can eliminate the influence of noise points and optimize the clustering results. At the same time, the algorithm calculates the weight through the dissimilarity density of the data set, the average contrast of candidate clusters, and the dissimilarity of candidate clusters. It obtains a set of high-quality initial center points, which solves the randomness of the K-means algorithm in selecting the center points. Finally, compared with similar algorithms, this algorithm shows a better clustering effect on single-cell data. Each clustering index is higher than other single-cell clustering algorithms, which overcomes the shortcomings of the traditional K-means algorithm.
Collapse
Affiliation(s)
- Guoyun Liu
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Manzhi Li
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
- Key Laboratory of Data Science and Smart Education, Ministry of Education, Hainan Normal University, Haikou, China
- *Correspondence: Manzhi Li,
| | - Hongtao Wang
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Shijun Lin
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Junlin Xu
- College of Information Science and Engineering, Hunan University, Changsha, China
| | - Ruixi Li
- Geneis Beijing Co., Ltd., Beijing, China
| | - Min Tang
- School of Life Sciences, Jiangsu University, Zhenjiang, China
| | - Chun Li
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| |
Collapse
|