1
|
Dai H, Meng X, Pan Z, Yang Q, Song H, Gao Y, Wang X. scSwinTNet: A Cell Type Annotation Method for Large-Scale Single-Cell RNA-Seq Data Based on Shifted Window Attention. IEEE J Biomed Health Inform 2025; 29:3035-3044. [PMID: 39466872 DOI: 10.1109/jbhi.2024.3487174] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/30/2024]
Abstract
The annotation of cell types based on single-cell RNA sequencing (scRNA-seq) data is a critical downstream task in single-cell analysis, with significant implications for a deeper understanding of biological processes. Most analytical methods cluster cells by unsupervised clustering, which requires manual annotation for cell type determination. This procedure is time-overwhelming and non-repeatable. To accommodate the exponential growth of sequencing cells, reduce the impact of data bias, and integrate large-scale datasets for further improvement of type annotation accuracy, we proposed scSwinTNet. It is a pre-trained tool for annotating cell types in scRNA-seq data, which uses self-attention based on shifted windows and enables intelligent information extraction from gene data. We demonstrated the effectiveness and robustness of scSwinTNet by using 399 760 cells from human and mouse tissues. To the best of our knowledge, scSwinTNet is the first model to annotate cell types in scRNA-seq data using a pre-trained shifted window attention-based model. It does not require a priori knowledge and accurately annotates cell types without manual annotation.
Collapse
|
2
|
Zhang Y, Wang Y, Liu X, Feng X. PbImpute: Precise Zero Discrimination and Balanced Imputation in Single-Cell RNA Sequencing Data. J Chem Inf Model 2025; 65:2670-2684. [PMID: 39957720 PMCID: PMC11898086 DOI: 10.1021/acs.jcim.4c02125] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2024] [Revised: 01/31/2025] [Accepted: 02/03/2025] [Indexed: 02/18/2025]
Abstract
Single-cell RNA sequencing (scRNA-seq) has emerged as a transformative technology for elucidating cellular heterogeneity at unprecedented resolution. However, technical limitations such as limited sequencing depth and mRNA capture efficiency often result in zero counts, commonly referred to as "dropout zeros" in scRNA-seq data. These zeros pose significant challenges to downstream analysis, as they can distort the interpretation of cellular transcriptomes. While numerous computational methods have been developed to address this challenge, existing approaches frequently suffer from either insufficient imputation of zeros (under-imputation) or excessive modification of zeros (over-imputation). Here, we propose a precisely balanced imputation (PbImpute) method designed to achieve optimal equilibrium between dropout recovery and biological zero preservation in scRNA-seq data. PbImpute employs a multistage approach: (1) Initial discrimination between technical dropouts and biological zeros through parameter optimization of a new zero-inflated negative binomial (ZINB) distribution model, followed by initial imputation; (2) Application of a uniquely designed static repair algorithm to enhance data fidelity; (3) Secondary dropout identification based on gene expression frequency and partition-specific coefficient of variation; (4) Graph-embedding neural network-based imputation; and (5) Implementation of a uniquely designed dynamic repair mechanism to mitigate over-imputation effects. PbImpute distinguishes itself by uniquely integrating ZINB modeling with static and dynamic repair. This advantageous combined approach achieves a balance between over- and under-imputation, while simultaneously preserving true biological zeros and reducing signal distortion. Comprehensive evaluation using both simulated and real scRNA-seq data sets demonstrated that PbImpute achieves superior performance (F1 Score = 0.88 at 83% dropout rate, ARI = 0.78 on PBMC) in discriminating between technical dropouts and biological zeros compared to state-of-the-art methods. The method significantly improves gene-gene and cell-cell correlation structures, enhances differential expression analysis sensitivity, optimizes clustering resolution and dimensional reduction visualization, and facilitates more accurate trajectory inference. Ablation studies confirmed the essential contribution of both the imputation and repair modules to the method's performance. The code is available at https://github.com/WyBioTeam/PbImpute. By enhancing the accuracy of scRNA-seq data imputation, PbImpute can improve the identification of cell subpopulations and the detection of differentially expressed genes, thereby facilitating more precise analyses of cellular heterogeneity and advancing disease research.
Collapse
Affiliation(s)
- Yi Zhang
- School
of Computer Science and Engineering, Guilin
University of Technology, 12 Jiangan Road, Qixing District, Guilin 541004, China
- Guangxi
Key Laboratory of Embedded Technology and Intelligent System, Guilin University of Technology, 12 Jiangan Road, Qixing District, Guilin 541004, China
| | - Yin Wang
- School
of Computer Science and Engineering, Guilin
University of Technology, 12 Jiangan Road, Qixing District, Guilin 541004, China
- Guangxi
Key Laboratory of Embedded Technology and Intelligent System, Guilin University of Technology, 12 Jiangan Road, Qixing District, Guilin 541004, China
| | - Xinyuan Liu
- School
of Computer Science and Engineering, Guilin
University of Technology, 12 Jiangan Road, Qixing District, Guilin 541004, China
- Guangxi
Key Laboratory of Embedded Technology and Intelligent System, Guilin University of Technology, 12 Jiangan Road, Qixing District, Guilin 541004, China
| | - Xi Feng
- School
of Computer Science and Engineering, Guilin
University of Technology, 12 Jiangan Road, Qixing District, Guilin 541004, China
- Guangxi
Key Laboratory of Embedded Technology and Intelligent System, Guilin University of Technology, 12 Jiangan Road, Qixing District, Guilin 541004, China
| |
Collapse
|
3
|
Li S, Hua H, Chen S. Graph neural networks for single-cell omics data: a review of approaches and applications. Brief Bioinform 2025; 26:bbaf109. [PMID: 40091193 PMCID: PMC11911123 DOI: 10.1093/bib/bbaf109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2024] [Revised: 02/09/2025] [Accepted: 02/25/2025] [Indexed: 03/19/2025] Open
Abstract
Rapid advancement of sequencing technologies now allows for the utilization of precise signals at single-cell resolution in various omics studies. However, the massive volume, ultra-high dimensionality, and high sparsity nature of single-cell data have introduced substantial difficulties to traditional computational methods. The intricate non-Euclidean networks of intracellular and intercellular signaling molecules within single-cell datasets, coupled with the complex, multimodal structures arising from multi-omics joint analysis, pose significant challenges to conventional deep learning operations reliant on Euclidean geometries. Graph neural networks (GNNs) have extended deep learning to non-Euclidean data, allowing cells and their features in single-cell datasets to be modeled as nodes within a graph structure. GNNs have been successfully applied across a broad range of tasks in single-cell data analysis. In this survey, we systematically review 107 successful applications of GNNs and their six variants in various single-cell omics tasks. We begin by outlining the fundamental principles of GNNs and their six variants, followed by a systematic review of GNN-based models applied in single-cell epigenomics, transcriptomics, spatial transcriptomics, proteomics, and multi-omics. In each section dedicated to a specific omics type, we have summarized the publicly available single-cell datasets commonly utilized in the articles reviewed in that section, totaling 77 datasets. Finally, we summarize the potential shortcomings of current research and explore directions for future studies. We anticipate that this review will serve as a guiding resource for researchers to deepen the application of GNNs in single-cell omics.
Collapse
Affiliation(s)
- Sijie Li
- School of Mathematical Sciences and The Key Laboratory of Pure Mathematics and Combinatorics, Ministry of Education (LPMC), Nankai University, No. 94 Weijin Road, Nankai District, Tianjin 300071, China
| | - Heyang Hua
- School of Mathematical Sciences and The Key Laboratory of Pure Mathematics and Combinatorics, Ministry of Education (LPMC), Nankai University, No. 94 Weijin Road, Nankai District, Tianjin 300071, China
| | - Shengquan Chen
- School of Mathematical Sciences and The Key Laboratory of Pure Mathematics and Combinatorics, Ministry of Education (LPMC), Nankai University, No. 94 Weijin Road, Nankai District, Tianjin 300071, China
| |
Collapse
|
4
|
Schumann Y, Gocke A, Neumann JE. Computational Methods for Data Integration and Imputation of Missing Values in Omics Datasets. Proteomics 2025; 25:e202400100. [PMID: 39740174 DOI: 10.1002/pmic.202400100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2024] [Revised: 11/08/2024] [Accepted: 11/26/2024] [Indexed: 01/02/2025]
Abstract
Molecular profiling of different omic-modalities (e.g., DNA methylomics, transcriptomics, proteomics) in biological systems represents the basis for research and clinical decision-making. Measurement-specific biases, so-called batch effects, often hinder the integration of independently acquired datasets, and missing values further hamper the applicability of typical data processing algorithms. In addition to careful experimental design, well-defined standards in data acquisition and data exchange, the alleviation of these phenomena particularly requires a dedicated data integration and preprocessing pipeline. This review aims to give a comprehensive overview of computational methods for data integration and missing value imputation for omic data analyses. We provide formal definitions for missing value mechanisms and propose a novel statistical taxonomy for batch effects, especially in the presence of missing data. Based on an automated document search and systematic literature review, we describe 32 distinct data integration methods from five main methodological categories, as well as 37 algorithms for missing value imputation from five separate categories. Additionally, this review highlights multiple quantitative evaluation methods to aid researchers in selecting a suitable set of methods for their work. Finally, this work provides an integrated discussion of the relevance of batch effects and missing values in omics with corresponding method recommendations. We then propose a comprehensive three-step workflow from the study conception to final data analysis and deduce perspectives for future research. Eventually, we present a comprehensive flow chart as well as exemplary decision trees to aid practitioners in the selection of specific approaches for imputation and data integration in their studies.
Collapse
Affiliation(s)
- Yannis Schumann
- IT-Department, Deutsches Elektronen-Synchroton DESY, Hamburg, Germany
| | - Antonia Gocke
- Center for Molecular Neurobiology (ZMNH), University Medical Center Hamburg-Eppendorf (UKE), Hamburg, Germany
- Core Facility Mass Spectrometric Proteomics, University Medical Center Hamburg-Eppendorf (UKE), Hamburg, Germany
| | - Julia E Neumann
- Center for Molecular Neurobiology (ZMNH), University Medical Center Hamburg-Eppendorf (UKE), Hamburg, Germany
- Institute of Neuropathology, University Medical Center Hamburg-Eppendorf (UKE), Hamburg, Germany
| |
Collapse
|
5
|
Cui W, Long Q, Liu W, Fang C, Wang X, Wang P, Zhou Y. Hierarchical Graph Transformer With Contrastive Learning for Gene Regulatory Network Inference. IEEE J Biomed Health Inform 2025; 29:690-699. [PMID: 39401117 DOI: 10.1109/jbhi.2024.3476490] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
Gene regulatory networks (GRNs) are crucial for understanding gene regulation and cellular processes. Inferring GRNs helps uncover regulatory pathways, shedding light on the regulation and development of cellular processes. With the rise of high-throughput sequencing and advancements in computational technology, computational models have emerged as cost-effective alternatives to traditional experimental studies. Moreover, the surge in ChIP-seq data for TF-DNA binding has catalyzed the development of graph neural network (GNN)-based methods, greatly advancing GRN inference capabilities. However, most existing GNN-based methods suffer from the inability to capture long-distance structural semantic correlations due to transitive interactions. In this paper, we introduce a novel GNN-based model named Hierarchical Graph Transformer with Contrastive Learning for GRN (HGTCGRN) inference. HGTCGRN excels at capturing structural semantics using a hierarchical graph Transformer, which introduces a series of gene family nodes representing gene functions as virtual nodes to interact with nodes in the GRNS. These semantic-aware virtual-node embeddings are aggregated to produce node representations with varying emphasis. Additionally, we leverage gene ontology information to construct gene interaction networks for contrastive learning optimization of GRNs. Experimental results demonstrate that HGTCGRN achieves superior performance in GRN inference.
Collapse
|
6
|
Wang Y, Li K, Zhang R, Fan Y, Huang L, Zhou F. GraCEImpute: A novel graph clustering autoencoder approach for imputation of single-cell RNA-seq data. Comput Biol Med 2025; 184:109400. [PMID: 39561511 DOI: 10.1016/j.compbiomed.2024.109400] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2024] [Revised: 10/14/2024] [Accepted: 11/07/2024] [Indexed: 11/21/2024]
Abstract
Single-cell RNA sequencing (scRNA-seq) technology establishes a unique view for elucidating cellular heterogeneity in various biological systems. Yet the scRNA-seq data is compromised by a high dropout rate due to the technological limitation, and the substantial data loss poses computational challenges on subsequent analyses. This study introduces a novel graph clustering autoencoder (GCAE)-based imputation approach (GraCEImpute) to address the challenge of missing data in scRNA-seq data. Our comprehensive evaluation demonstrates that the GraCEImpute model outperforms existing approaches in accurately imputing dropout zeros within scRNA-seq data. The proposed GraCEImpute model also demonstrates the significantly enhanced quality of downstream scRNA-seq data analyses, including clustering, differential gene expression (DEG) analysis, and cell trajectory inference. These improvements underscore the GraCEImpute model's potential to facilitate a deeper understanding of cellular processes and heterogeneity through the scRNA-seq data analyses. The source code is released at https://www.healthinformaticslab.org/supp/.
Collapse
Affiliation(s)
- Yueying Wang
- College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China
| | - Kewei Li
- College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China
| | - Ruochi Zhang
- School of Artificial Intelligence, Jilin University, Changchun, 130012, China
| | - Yusi Fan
- College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China.
| | - Lan Huang
- College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China
| | - Fengfeng Zhou
- College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China; School of Biology and Engineering, Guizhou Medical University, Guiyang, 550025, Guizhou, China.
| |
Collapse
|
7
|
Sun Z, Song K. GEMimp: An Accurate and Robust Imputation Method for Microbiome Data Using Graph Embedding Neural Network. J Mol Biol 2024; 436:168841. [PMID: 39490678 DOI: 10.1016/j.jmb.2024.168841] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2024] [Revised: 10/23/2024] [Accepted: 10/23/2024] [Indexed: 11/05/2024]
Abstract
Microbiome research has increasingly underscored the profound link between microbial compositions and human health, with numerous studies establishing a strong correlation between microbiome characteristics and various diseases. However, the analysis of microbiome data is frequently compromised by inherent sparsity issues, characterized by a substantial presence of observed zeros. These zeros not only skew the abundance distribution of microbial species but also undermine the reliability of scientific conclusions drawn from such data. Addressing this challenge, we introduce GEMimp, an innovative imputation method designed to infuse robustness into microbiome data analysis. GEMimp leverages the node2vec algorithm, which incorporates both Breadth-First Search (BFS) and Depth-First Search (DFS) strategies in its random walks sampling process. This approach enables GEMimp to learn nuanced, low-dimensional representations of each taxonomic unit, facilitating the reconstruction of their similarity networks with unprecedented accuracy. Our comparative analysis pits GEMimp against state-of-the-art imputation methods including SAVER, MAGIC and mbImpute. The results unequivocally demonstrate that GEMimp outperforms its counterparts by achieving the highest Pearson correlation coefficient when compared to the original raw dataset. Furthermore, GEMimp shows notable proficiency in identifying significant taxa, enhancing the detection of disease-related taxa and effectively mitigating the impact of sparsity on both simulated and real-world datasets, such as those pertaining to Type 2 Diabetes (T2D) and Colorectal Cancer (CRC). These findings collectively highlight the strong effectiveness of GEMimp, allowing for better analysis on microbial data. With alleviation of sparsity issues, it could be greatly facilitated in downstream analyses and even in the field of microbiology.
Collapse
Affiliation(s)
- Ziwei Sun
- School of Mathematics and Statistics, Qingdao University, Qingdao, China.
| | - Kai Song
- School of Mathematics and Statistics, Qingdao University, Qingdao, China.
| |
Collapse
|
8
|
Zhang Y, Wang Y, Liu X, Feng X. CPARI: a novel approach combining cell partitioning with absolute and relative imputation to address dropout in single-cell RNA-seq data. Brief Bioinform 2024; 26:bbae668. [PMID: 39715686 DOI: 10.1093/bib/bbae668] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2024] [Revised: 10/06/2024] [Accepted: 12/06/2024] [Indexed: 12/25/2024] Open
Abstract
A key challenge in analyzing single-cell RNA sequencing data is the large number of false zeros, known as "dropout zeros", which are caused by technical limitations such as shallow sequencing depth or inefficient mRNA capture. To address this challenge, we propose a novel imputation model called CPARI, which combines cell partitioning with our designed absolute and relative imputation methods. Initially, CPARI employs a new approach to select highly variable genes and constructs an average consensus matrix using C-mean fuzzy clustering-based blockchain technology to obtain results at different resolutions. Hierarchical clustering is then applied to further refine these blocks, resulting in well-defined cellular partitions. Subsequently, CPARI identifies dropout events and determines the imputation positions of these identified zeros. An autoencoder is trained within each cellular block to learn gene features and reconstruct data. Our uniquely defined absolute imputation technique is first applied to the identified positions, followed by our relative imputation technique to address remaining dropout zeros, ensuring that both global consistency and local variation are maintained. Through comprehensive analyses conducted on simulated and real scRNA-seq datasets, including quantitative assessment, differential expression analysis, cell clustering, cell trajectory inference, robustness evaluation, and large-scale data imputation, CPARI demonstrates superior performance compared to 12 other art-of-state imputation models. Additionally, ablation experiments further confirm the significance and necessity of both the cell partitioning and relative imputation components of CPARI. Notably, CPARI as a new denoising approach could distinguish between real biological zeros and dropout zeros and minimize false positives, and maximize the accuracy of imputation.
Collapse
Affiliation(s)
- Yi Zhang
- School of Computer Science and Engineering, Guilin University of Technology, 12 Jiangan Road, Qixing District, Guilin 541004, China
- Guangxi Key Laboratory of Embedded Technology and Intelligent System, Guilin University of Technology, 12 Jiangan Road, Qixing District, Guilin 541004, China
| | - Yin Wang
- School of Computer Science and Engineering, Guilin University of Technology, 12 Jiangan Road, Qixing District, Guilin 541004, China
- Guangxi Key Laboratory of Embedded Technology and Intelligent System, Guilin University of Technology, 12 Jiangan Road, Qixing District, Guilin 541004, China
| | - Xinyuan Liu
- School of Computer Science and Engineering, Guilin University of Technology, 12 Jiangan Road, Qixing District, Guilin 541004, China
- Guangxi Key Laboratory of Embedded Technology and Intelligent System, Guilin University of Technology, 12 Jiangan Road, Qixing District, Guilin 541004, China
| | - Xi Feng
- School of Computer Science and Engineering, Guilin University of Technology, 12 Jiangan Road, Qixing District, Guilin 541004, China
- Guangxi Key Laboratory of Embedded Technology and Intelligent System, Guilin University of Technology, 12 Jiangan Road, Qixing District, Guilin 541004, China
| |
Collapse
|
9
|
Yu Z, Liu F, Li Y. scTCA: a hybrid Transformer-CNN architecture for imputation and denoising of scDNA-seq data. Brief Bioinform 2024; 25:bbae577. [PMID: 39523623 PMCID: PMC11551055 DOI: 10.1093/bib/bbae577] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2024] [Revised: 10/05/2024] [Accepted: 10/29/2024] [Indexed: 11/16/2024] Open
Abstract
Single-cell DNA sequencing (scDNA-seq) has been widely used to unmask tumor copy number alterations (CNAs) at single-cell resolution. Despite that arm-level CNAs can be accurately detected from single-cell read counts, it is difficult to precisely identify focal CNAs as the read counts are featured with high dimensionality, high sparsity and low signal-to-noise ratio. This gives rise to a desperate demand for reconstructing high-quality scDNA-seq data. We develop a new method called scTCA for imputation and denoising of single-cell read counts, thus aiding in downstream analysis of both arm-level and focal CNAs. scTCA employs hybrid Transformer-CNN architectures to identify local and non-local correlations between genes for precise recovery of the read counts. Unlike conventional Transformers, the Transformer block in scTCA is a two-stage attention module containing a stepwise self-attention layer and a window Transformer, and can efficiently deal with the high-dimensional read counts data. We showcase the superior performance of scTCA through comparison with the state-of-the-arts on both synthetic and real datasets. The results indicate it is highly effective in imputation and denoising of scDNA-seq data.
Collapse
Affiliation(s)
- Zhenhua Yu
- School of Information Engineering, Ningxia University, 750021 Ningxia, China
- Ningxia Key Laboratory of Artificial Intelligence and Information Security for Channeling Computing Resources from the East to the West, Ningxia University, 750021 Ningxia, China
| | - Furui Liu
- School of Information Engineering, Ningxia University, 750021 Ningxia, China
| | - Yang Li
- School of Information Engineering, Ningxia University, 750021 Ningxia, China
| |
Collapse
|
10
|
Zhang Z, Liu Y, Xiao M, Wang K, Huang Y, Bian J, Yang R, Li F. Graph contrastive learning as a versatile foundation for advanced scRNA-seq data analysis. Brief Bioinform 2024; 25:bbae558. [PMID: 39487083 PMCID: PMC11530284 DOI: 10.1093/bib/bbae558] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2024] [Revised: 09/24/2024] [Accepted: 10/16/2024] [Indexed: 11/04/2024] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) offers unprecedented insights into transcriptome-wide gene expression at the single-cell level. Cell clustering has been long established in the analysis of scRNA-seq data to identify the groups of cells with similar expression profiles. However, cell clustering is technically challenging, as raw scRNA-seq data have various analytical issues, including high dimensionality and dropout values. Existing research has developed deep learning models, such as graph machine learning models and contrastive learning-based models, for cell clustering using scRNA-seq data and has summarized the unsupervised learning of cell clustering into a human-interpretable format. While advances in cell clustering have been profound, we are no closer to finding a simple yet effective framework for learning high-quality representations necessary for robust clustering. In this study, we propose scSimGCL, a novel framework based on the graph contrastive learning paradigm for self-supervised pretraining of graph neural networks. This framework facilitates the generation of high-quality representations crucial for cell clustering. Our scSimGCL incorporates cell-cell graph structure and contrastive learning to enhance the performance of cell clustering. Extensive experimental results on simulated and real scRNA-seq datasets suggest the superiority of the proposed scSimGCL. Moreover, clustering assignment analysis confirms the general applicability of scSimGCL, including state-of-the-art clustering algorithms. Further, ablation study and hyperparameter analysis suggest the efficacy of our network architecture with the robustness of decisions in the self-supervised learning setting. The proposed scSimGCL can serve as a robust framework for practitioners developing tools for cell clustering. The source code of scSimGCL is publicly available at https://github.com/zhangzh1328/scSimGCL.
Collapse
Affiliation(s)
- Zhenhao Zhang
- College of Life Sciences, Northwest A&F University, Yangling, 712100 Shaanxi, China
- College of Information Engineering, Northwest A&F University, Yangling, 712100 Shaanxi, China
| | - Yuxi Liu
- College of Medicine, University of Florida, Gainesville, FL 32610, USA
| | - Meichen Xiao
- College of Life Sciences, Northwest A&F University, Yangling, 712100 Shaanxi, China
| | - Kun Wang
- College of Life Sciences, Northwest A&F University, Yangling, 712100 Shaanxi, China
| | - Yu Huang
- College of Medicine, University of Florida, Gainesville, FL 32610, USA
| | - Jiang Bian
- College of Medicine, University of Florida, Gainesville, FL 32610, USA
| | - Ruolin Yang
- College of Life Sciences, Northwest A&F University, Yangling, 712100 Shaanxi, China
| | - Fuyi Li
- College of Information Engineering, Northwest A&F University, Yangling, 712100 Shaanxi, China
| |
Collapse
|
11
|
Zhao L, Jiang L, Xie Y, Huang J, Xie H, Tian J, Zhang D. scDTL: enhancing single-cell RNA-seq imputation through deep transfer learning with bulk cell information. Brief Bioinform 2024; 25:bbae555. [PMID: 39504481 PMCID: PMC11540133 DOI: 10.1093/bib/bbae555] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2024] [Revised: 08/30/2024] [Accepted: 10/16/2024] [Indexed: 11/08/2024] Open
Abstract
The increasing single-cell RNA sequencing (scRNA-seq) data enable researchers to explore cellular heterogeneity and gene expression profiles, offering a high-resolution view of the transcriptome at the single-cell level. However, the dropout events, which are often present in scRNA-seq data, remaining challenges for downstream analysis. Although a number of studies have been developed to recover single-cell expression profiles, their performance may be hindered due to not fully exploring the inherent relations between genes. To address the issue, we propose scDTL, a deep transfer learning based approach for scRNA-seq data imputation by harnessing the bulk RNA-sequencing information. We firstly employ a denoising autoencoder trained on bulk RNA-seq data as the initial imputation model, and then leverage a domain adaptation framework that transfers the knowledge learned by the bulk imputation model to scRNA-seq learning task. In addition, scDTL employs a parallel operation with a 1D U-Net denoising model to provide gene representations of varying granularity, capturing both coarse and fine features of the scRNA-seq data. Finally, we utilize a cross-channel attention mechanism to fuse the features learned from the transferred bulk imputation model and U-Net model. In the evaluation, we conduct extensive experiments to demonstrate that scDTL could outperform other state-of-the-art methods in the quantitative comparison and downstream analyses.
Collapse
Affiliation(s)
- Liuyang Zhao
- College of Computer Science and Software Engineering, Shenzhen University, Guangdong 518057, China
| | - Landu Jiang
- College of Future Technology, HKUST(GZ), Guangdong 510641, China
| | - Yufeng Xie
- Shenzhen Hospital of Guangzhou University of Chinese Medicine (Futian), Guangdong 518034, China
| | - JianHao Huang
- Shenzhen Hospital of Guangzhou University of Chinese Medicine (Futian), Guangdong 518034, China
| | - Haoran Xie
- Department of Computing and Decision Sciences, Lingnan University, Hong Kong Special Administrative Region 999077, China
| | - Jun Tian
- Department of Biochemistry, School of Medicine, Southern University of Science and Technology, Guangdong 518055, China
- Key University Laboratory of Metabolism and Health of Guangdong, Southern University of Science and Technology, Shenzhen 518055, China
| | - Dian Zhang
- College of Computer Science and Software Engineering, Shenzhen University, Guangdong 518057, China
| |
Collapse
|
12
|
Liu W, Pan Y, Teng Z, Xu J. scDMAE: A Generative Denoising Model Adopted Mask Strategy for scRNA-Seq Data Recovery. IEEE J Biomed Health Inform 2024; 28:3772-3780. [PMID: 38568766 DOI: 10.1109/jbhi.2024.3383921] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/05/2024]
Abstract
The advent of single-cell RNA sequencing (scRNA-seq) technology has revolutionized gene expression studies at the single-cell level. However, the presence of technical noise and data sparsity in scRNA-seq often undermines the accuracy of subsequent analyses. Existing methods for denoising and imputing scRNA-seq data often rely on stringent assumptions about data distribution, limiting the effectiveness of data recovery. In this study, we propose the scDMAE model for denoising and recovery of scRNA-seq data. First, the model fuses gene expression features and topological features to discern the primary expression patterns of genes in cells. Then, an autoencoder with a masking strategy is used to model dropout events and separate potential noise in the data. Finally, the model incorporates the original raw data to recover the true biological expression value. By conducting experiments on various types of scRNA-Seq datasets, scDMAE demonstrates superior performance compared to other comparative methods based on six distinct evaluation metrics in downstream analysis. The scDMAE method can accurately cluster similar cell populations, identify differential genes and infer cell trajectories.
Collapse
|
13
|
Han Y, Zhou Q, Liu L, Li J, Zhou Y. DNI-MDCAP: improvement of causal MiRNA-disease association prediction based on deep network imputation. BMC Bioinformatics 2024; 25:22. [PMID: 38216907 PMCID: PMC10785389 DOI: 10.1186/s12859-024-05644-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Accepted: 01/08/2024] [Indexed: 01/14/2024] Open
Abstract
BACKGROUND MiRNAs are involved in the occurrence and development of many diseases. Extensive literature studies have demonstrated that miRNA-disease associations are stratified and encompass ~ 20% causal associations. Computational models that predict causal miRNA-disease associations provide effective guidance in identifying novel interpretations of disease mechanisms and potential therapeutic targets. Although several predictive models for miRNA-disease associations exist, it is still challenging to discriminate causal miRNA-disease associations from non-causal ones. Hence, there is a pressing need to develop an efficient prediction model for causal miRNA-disease association prediction. RESULTS We developed DNI-MDCAP, an improved computational model that incorporated additional miRNA similarity metrics, deep graph embedding learning-based network imputation and semi-supervised learning framework. Through extensive predictive performance evaluation, including tenfold cross-validation and independent test, DNI-MDCAP showed excellent performance in identifying causal miRNA-disease associations, achieving an area under the receiver operating characteristic curve (AUROC) of 0.896 and 0.889, respectively. Regarding the challenge of discriminating causal miRNA-disease associations from non-causal ones, DNI-MDCAP exhibited superior predictive performance compared to existing models MDCAP and LE-MDCAP, reaching an AUROC of 0.870. Wilcoxon test also indicated significantly higher prediction scores for causal associations than for non-causal ones. Finally, the potential causal miRNA-disease associations predicted by DNI-MDCAP, exemplified by diabetic nephropathies and hsa-miR-193a, have been validated by recently published literature, further supporting the reliability of the prediction model. CONCLUSIONS DNI-MDCAP is a dedicated tool to specifically distinguish causal miRNA-disease associations with substantially improved accuracy. DNI-MDCAP is freely accessible at http://www.rnanut.net/DNIMDCAP/ .
Collapse
Affiliation(s)
- Yu Han
- Department of Biomedical Informatics, School of Basic Medical Sciences, Peking University, Beijing, China
| | - Qiong Zhou
- Department of Biomedical Informatics, School of Basic Medical Sciences, Peking University, Beijing, China
| | - Leibo Liu
- Department of Biomedical Informatics, School of Basic Medical Sciences, Peking University, Beijing, China
| | - Jianwei Li
- Institute of Computational Medicine, School of Artificial Intelligence, Hebei University of Technology, Tianjin, China
| | - Yuan Zhou
- Department of Biomedical Informatics, School of Basic Medical Sciences, Peking University, Beijing, China.
- State Key Laboratory of Vascular Homeostasis and Remodeling, Peking University, Beijing, China.
| |
Collapse
|
14
|
Mao G, Pang Z, Zuo K, Wang Q, Pei X, Chen X, Liu J. Predicting gene regulatory links from single-cell RNA-seq data using graph neural networks. Brief Bioinform 2023; 24:bbad414. [PMID: 37985457 PMCID: PMC10661972 DOI: 10.1093/bib/bbad414] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2023] [Revised: 10/25/2023] [Accepted: 10/26/2023] [Indexed: 11/22/2023] Open
Abstract
Single-cell RNA-sequencing (scRNA-seq) has emerged as a powerful technique for studying gene expression patterns at the single-cell level. Inferring gene regulatory networks (GRNs) from scRNA-seq data provides insight into cellular phenotypes from the genomic level. However, the high sparsity, noise and dropout events inherent in scRNA-seq data present challenges for GRN inference. In recent years, the dramatic increase in data on experimentally validated transcription factors binding to DNA has made it possible to infer GRNs by supervised methods. In this study, we address the problem of GRN inference by framing it as a graph link prediction task. In this paper, we propose a novel framework called GNNLink, which leverages known GRNs to deduce the potential regulatory interdependencies between genes. First, we preprocess the raw scRNA-seq data. Then, we introduce a graph convolutional network-based interaction graph encoder to effectively refine gene features by capturing interdependencies between nodes in the network. Finally, the inference of GRN is obtained by performing matrix completion operation on node features. The features obtained from model training can be applied to downstream tasks such as measuring similarity and inferring causality between gene pairs. To evaluate the performance of GNNLink, we compare it with six existing GRN reconstruction methods using seven scRNA-seq datasets. These datasets encompass diverse ground truth networks, including functional interaction networks, Loss of Function/Gain of Function data, non-specific ChIP-seq data and cell-type-specific ChIP-seq data. Our experimental results demonstrate that GNNLink achieves comparable or superior performance across these datasets, showcasing its robustness and accuracy. Furthermore, we observe consistent performance across datasets of varying scales. For reproducibility, we provide the data and source code of GNNLink on our GitHub repository: https://github.com/sdesignates/GNNLink.
Collapse
Affiliation(s)
- Guo Mao
- Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, deya, 410073 Changsha, China
| | - Zhengbin Pang
- Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, deya, 410073 Changsha, China
| | - Ke Zuo
- Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, deya, 410073 Changsha, China
| | - Qinglin Wang
- Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, deya, 410073 Changsha, China
| | - Xiangdong Pei
- Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, deya, 410073 Changsha, China
| | - Xinhai Chen
- Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, deya, 410073 Changsha, China
| | - Jie Liu
- Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, deya, 410073 Changsha, China
- Laboratory of Software Engineering for Complex System, National University of Defense Technology, deya, 410073 Changsha, China
| |
Collapse
|
15
|
Shi Y, Wan J, Zhang X, Yin Y. CL-Impute: A contrastive learning-based imputation for dropout single-cell RNA-seq data. Comput Biol Med 2023; 164:107263. [PMID: 37531858 DOI: 10.1016/j.compbiomed.2023.107263] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2023] [Revised: 06/27/2023] [Accepted: 07/16/2023] [Indexed: 08/04/2023]
Abstract
BACKGROUND Single-cell RNA-sequencing (scRNA-seq) technology has revolutionized the study of cell heterogeneity and biological interpretation at the single-cell level. However, the dropout events commonly present in scRNA-seq data can markedly reduce the reliability of downstream analysis. Existing imputation methods often overlook the discrepancy between the established cell relationship from dropout noisy data and reality, which limits their performances due to the learned untrustworthy cell representations. METHOD Here, we propose a novel approach called the CL-Impute (Contrastive Learning-based Impute) model for estimating missing genes without relying on preconstructed cell relationships. CL-Impute utilizes contrastive learning and a self-attention network to address this challenge. Specifically, the proposed CL-Impute model leverages contrastive learning to learn cell representations from the self-perspective of dropout events, whereas the self-attention network captures cell relationships from the global-perspective. RESULTS Experimental results on four benchmark datasets, including quantitative assessment, cell clustering, gene identification, and trajectory inference, demonstrate the superior performance of CL-Impute compared with that of existing state-of-the-art imputation methods. Furthermore, our experiment reveals that combining contrastive learning and masking cell augmentation enables the model to learn actual latent features from noisy data with a high rate of dropout events, enhancing the reliability of imputed values. CONCLUSIONS CL-Impute is a novel contrastive learning-based method to impute scRNA-seq data in the context of high dropout rate. The source code of CL-Impute is available at https://github.com/yuchen21-web/Imputation-for-scRNA-seq.
Collapse
Affiliation(s)
- Yuchen Shi
- School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, 310018, China; Key Laboratory of Complex Systems Modeling and Simulation Ministry of Education, Ministry of Education, China
| | - Jian Wan
- School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, 310018, China; School of Information and Electronic Engineering, Zhejiang University of Science and Technology, Hangzhou, 310023, China
| | - Xin Zhang
- School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, 310018, China; Key Laboratory of Complex Systems Modeling and Simulation Ministry of Education, Ministry of Education, China.
| | - Yuyu Yin
- School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, 310018, China; Key Laboratory of Complex Systems Modeling and Simulation Ministry of Education, Ministry of Education, China.
| |
Collapse
|
16
|
Pandey D, Onkara PP. Improved downstream functional analysis of single-cell RNA-sequence data using DGAN. Sci Rep 2023; 13:1618. [PMID: 36709340 PMCID: PMC9884242 DOI: 10.1038/s41598-023-28952-y] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2022] [Accepted: 01/27/2023] [Indexed: 01/29/2023] Open
Abstract
The dramatic increase in the number of single-cell RNA-sequence (scRNA-seq) investigations is indeed an endorsement of the new-fangled proficiencies of next generation sequencing technologies that facilitate the accurate measurement of tens of thousands of RNA expression levels at the cellular resolution. Nevertheless, missing values of RNA amplification persist and remain as a significant computational challenge, as these data omission induce further noise in their respective cellular data and ultimately impede downstream functional analysis of scRNA-seq data. Consequently, it turns imperative to develop robust and efficient scRNA-seq data imputation methods for improved downstream functional analysis outcomes. To overcome this adversity, we have designed an imputation framework namely deep generative autoencoder network [DGAN]. In essence, DGAN is an evolved variational autoencoder designed to robustly impute data dropouts in scRNA-seq data manifested as a sparse gene expression matrix. DGAN principally reckons count distribution, besides data sparsity utilizing a gaussian model whereby, cell dependencies are capitalized to detect and exclude outlier cells via imputation. When tested on five publicly available scRNA-seq data, DGAN outperformed every single baseline method paralleled, with respect to downstream functional analysis including cell data visualization, clustering, classification and differential expression analysis. DGAN is executed in Python and is accessible at https://github.com/dikshap11/DGAN .
Collapse
Affiliation(s)
- Diksha Pandey
- Department of Biotechnology, National Institute of Technology, Warangal, India
| | - Perumal P Onkara
- Department of Biotechnology, National Institute of Technology, Warangal, India.
| |
Collapse
|
17
|
Qi Y, Han S, Tang L, Liu L. Imputation method for single-cell RNA-seq data using neural topic model. Gigascience 2022; 12:giad098. [PMID: 38000911 PMCID: PMC10673642 DOI: 10.1093/gigascience/giad098] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2023] [Revised: 09/02/2023] [Accepted: 10/23/2023] [Indexed: 11/26/2023] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) technology studies transcriptome and cell-to-cell differences from higher single-cell resolution and different perspectives. Despite the advantage of high capture efficiency, downstream functional analysis of scRNA-seq data is made difficult by the excess of zero values (i.e., the dropout phenomenon). To effectively address this problem, we introduced scNTImpute, an imputation framework based on a neural topic model. A neural network encoder is used to extract underlying topic features of single-cell transcriptome data to infer high-quality cell similarity. At the same time, we determine which transcriptome data are affected by the dropout phenomenon according to the learning of the mixture model by the neural network. On the basis of stable cell similarity, the same gene information in other similar cells is borrowed to impute only the missing expression values. By evaluating the performance of real data, scNTImpute can accurately and efficiently identify the dropout values and imputes them accurately. In the meantime, the clustering of cell subsets is improved and the original biological information in cell clustering is solved, which is covered by technical noise. The source code for the scNTImpute module is available as open source at https://github.com/qiyueyang-7/scNTImpute.git.
Collapse
Affiliation(s)
- Yueyang Qi
- Yunnan Normal University, School of Information, Kunming 650500, China
| | - Shuangkai Han
- Yunnan Normal University, School of Information, Kunming 650500, China
| | - Lin Tang
- Yunnan Normal University, Faculty of Education, Kunming 650500, China
| | - Lin Liu
- Yunnan Normal University, School of Information, Kunming 650500, China
| |
Collapse
|