1
|
Ge S, Sun S, Xu H, Cheng Q, Ren Z. Deep learning in single-cell and spatial transcriptomics data analysis: advances and challenges from a data science perspective. Brief Bioinform 2025; 26:bbaf136. [PMID: 40185158 PMCID: PMC11970898 DOI: 10.1093/bib/bbaf136] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2024] [Revised: 02/17/2025] [Accepted: 03/05/2025] [Indexed: 04/07/2025] Open
Abstract
The development of single-cell and spatial transcriptomics has revolutionized our capacity to investigate cellular properties, functions, and interactions in both cellular and spatial contexts. Despite this progress, the analysis of single-cell and spatial omics data remains challenging. First, single-cell sequencing data are high-dimensional and sparse, and are often contaminated by noise and uncertainty, obscuring the underlying biological signal. Second, these data often encompass multiple modalities, including gene expression, epigenetic modifications, metabolite levels, and spatial locations. Integrating these diverse data modalities is crucial for enhancing prediction accuracy and biological interpretability. Third, while the scale of single-cell sequencing has expanded to millions of cells, high-quality annotated datasets are still limited. Fourth, the complex correlations of biological tissues make it difficult to accurately reconstruct cellular states and spatial contexts. Traditional feature engineering approaches struggle with the complexity of biological networks, while deep learning, with its ability to handle high-dimensional data and automatically identify meaningful patterns, has shown great promise in overcoming these challenges. Besides systematically reviewing the strengths and weaknesses of advanced deep learning methods, we have curated 21 datasets from nine benchmarks to evaluate the performance of 58 computational methods. Our analysis reveals that model performance can vary significantly across different benchmark datasets and evaluation metrics, providing a useful perspective for selecting the most appropriate approach based on a specific application scenario. We highlight three key areas for future development, offering valuable insights into how deep learning can be effectively applied to transcriptomic data analysis in biological, medical, and clinical settings.
Collapse
Affiliation(s)
- Shuang Ge
- Shenzhen International Graduate School, Tsinghua University, 2279 Lishui Road, Nanshan District, Shenzhen 518055, Guangdong, China
- Pengcheng Laboratory, 6001 Shahe West Road, Nanshan District, Shenzhen 518055, Guangdong, China
| | - Shuqing Sun
- Shenzhen International Graduate School, Tsinghua University, 2279 Lishui Road, Nanshan District, Shenzhen 518055, Guangdong, China
| | - Huan Xu
- School of Public Health, Anhui University of Science and Technology, 15 Fengxia Road, Changfeng County, Hefei 231131, Anhui, China
| | - Qiang Cheng
- Department of Computer Science, University of Kentucky, 329 Rose Street, Lexington 40506, Kentucky, USA
- Institute for Biomedical Informatics, University of Kentucky, 800 Rose Street, Lexington 40506, Kentucky, USA
| | - Zhixiang Ren
- Pengcheng Laboratory, 6001 Shahe West Road, Nanshan District, Shenzhen 518055, Guangdong, China
| |
Collapse
|
2
|
Li CY, Hong YJ, Li B, Zhang XF. Benchmarking single-cell cross-omics imputation methods for surface protein expression. Genome Biol 2025; 26:46. [PMID: 40038818 PMCID: PMC11881419 DOI: 10.1186/s13059-025-03514-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2024] [Accepted: 02/24/2025] [Indexed: 03/06/2025] Open
Abstract
BACKGROUND Recent advances in single-cell multimodal omics sequencing have facilitated the simultaneous profiling of transcriptomes and surface proteomes within individual cells, offering insights into cellular functions and heterogeneity. However, the high costs and technical complexity of protocols like CITE-seq and REAP-seq constrain large-scale dataset generation. To overcome this limitation, surface protein data imputation methods have emerged to predict protein abundances from scRNA-seq data. RESULTS We present a comprehensive benchmark of twelve state-of-the-art imputation methods across eleven datasets and six scenarios. Our analysis evaluates the methods' accuracy, sensitivity to training data size, robustness across experiments, and usability in terms of running time, memory usage, popularity, and user-friendliness. With benchmark experiments in diverse scenarios and a comprehensive evaluation framework of the results, our study offers valuable insights into the performance and applicability of surface protein data imputation methods in single-cell omics research. CONCLUSIONS Based on our results, Seurat v4 (PCA) and Seurat v3 (PCA) demonstrate exceptional performance, offering promising avenues for further research in single-cell omics.
Collapse
Affiliation(s)
- Chen-Yang Li
- School of Mathematics and Statistics, and Hubei Key Lab-Math. Sci., Central China Normal University, Wuhan, 430079, China
| | - Yong-Jia Hong
- School of Mathematics and Statistics, and Hubei Key Lab-Math. Sci., Central China Normal University, Wuhan, 430079, China
| | - Bo Li
- School of Mathematics and Statistics, and Hubei Key Lab-Math. Sci., Central China Normal University, Wuhan, 430079, China
- Key Laboratory of Nonlinear Analysis & Applications (Ministry of Education), Central China Normal University, Wuhan, 430079, China
| | - Xiao-Fei Zhang
- School of Mathematics and Statistics, and Hubei Key Lab-Math. Sci., Central China Normal University, Wuhan, 430079, China.
- Key Laboratory of Nonlinear Analysis & Applications (Ministry of Education), Central China Normal University, Wuhan, 430079, China.
| |
Collapse
|
3
|
Hu Y, Wan S, Luo Y, Li Y, Wu T, Deng W, Jiang C, Jiang S, Zhang Y, Liu N, Yang Z, Chen F, Li B, Qu K. Benchmarking algorithms for single-cell multi-omics prediction and integration. Nat Methods 2024; 21:2182-2194. [PMID: 39322753 DOI: 10.1038/s41592-024-02429-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2023] [Accepted: 08/19/2024] [Indexed: 09/27/2024]
Abstract
The development of single-cell multi-omics technology has greatly enhanced our understanding of biology, and in parallel, numerous algorithms have been proposed to predict the protein abundance and/or chromatin accessibility of cells from single-cell transcriptomic information and to integrate various types of single-cell multi-omics data. However, few studies have systematically compared and evaluated the performance of these algorithms. Here, we present a benchmark study of 14 protein abundance/chromatin accessibility prediction algorithms and 18 single-cell multi-omics integration algorithms using 47 single-cell multi-omics datasets. Our benchmark study showed overall totalVI and scArches outperformed the other algorithms for predicting protein abundance, and LS_Lab was the top-performing algorithm for the prediction of chromatin accessibility in most cases. Seurat, MOJITOO and scAI emerge as leading algorithms for vertical integration, whereas totalVI and UINMF excel beyond their counterparts in both horizontal and mosaic integration scenarios. Additionally, we provide a pipeline to assist researchers in selecting the optimal multi-omics prediction and integration algorithm.
Collapse
Affiliation(s)
- Yinlei Hu
- Department of Oncology, The First Affiliated Hospital of USTC, School of Basic Medical Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
- Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China
- School of Mathematical Science, University of Science and Technology of China, Hefei, China
| | - Siyuan Wan
- Department of Oncology, The First Affiliated Hospital of USTC, School of Basic Medical Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
- Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China
- School of Artificial Intelligence and Data Science, University of Science and Technology of China, Hefei, China
| | - Yuanhanyu Luo
- Tsinghua Institute of Multidisciplinary Biomedical Research, Tsinghua University, Beijing, China
- National Institute of Biological Sciences, Beijing, China
| | - Yuanzhe Li
- Department of Oncology, The First Affiliated Hospital of USTC, School of Basic Medical Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
- Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China
- School of Artificial Intelligence and Data Science, University of Science and Technology of China, Hefei, China
| | - Tong Wu
- National Institute of Biological Sciences, Beijing, China
- College of Life Sciences, Beijing Normal University, Beijing, China
| | - Wentao Deng
- Department of Oncology, The First Affiliated Hospital of USTC, School of Basic Medical Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
- Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China
| | - Chen Jiang
- Department of Oncology, The First Affiliated Hospital of USTC, School of Basic Medical Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
- Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China
| | - Shan Jiang
- National Institute of Biological Sciences, Beijing, China
| | - Yueping Zhang
- School of Artificial Intelligence and Data Science, University of Science and Technology of China, Hefei, China
| | - Nianping Liu
- School of Biomedical Engineering, Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou, China
| | - Zongcheng Yang
- Department of Oncology, The First Affiliated Hospital of USTC, School of Basic Medical Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China
| | - Falai Chen
- School of Mathematical Science, University of Science and Technology of China, Hefei, China.
- School of Artificial Intelligence and Data Science, University of Science and Technology of China, Hefei, China.
| | - Bin Li
- Tsinghua Institute of Multidisciplinary Biomedical Research, Tsinghua University, Beijing, China.
- National Institute of Biological Sciences, Beijing, China.
| | - Kun Qu
- Department of Oncology, The First Affiliated Hospital of USTC, School of Basic Medical Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, China.
- Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China.
- School of Artificial Intelligence and Data Science, University of Science and Technology of China, Hefei, China.
- School of Biomedical Engineering, Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou, China.
| |
Collapse
|
4
|
Zhao G, Wang Y, Zhou J, Ma P, Wang S, Li N. Pan-cancer analysis of polo-like kinase family genes reveals polo-like kinase 1 as a novel oncogene in kidney renal papillary cell carcinoma. Heliyon 2024; 10:e29373. [PMID: 38644836 PMCID: PMC11033160 DOI: 10.1016/j.heliyon.2024.e29373] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2023] [Revised: 04/02/2024] [Accepted: 04/07/2024] [Indexed: 04/23/2024] Open
Abstract
BACKGROUND Polo-like kinases (PLKs) are a kinase class of serine/threonine with five members that play crucial roles in cell cycle regulation. However, their biological functions, regulation, and expression remain unclear. This study revealed the molecular properties, oncogenic role, and clinical significance of PLK genes in pan-cancers, particularly in kidney renal papillary cell carcinoma (KIRP). METHODS We evaluated the mutation landscape, expression level, and prognostic values of PLK genes using bioinformatics analyses and explored the association between the expression level of PLK genes and tumor microenvironment (TME), immune subtype, cancer immunotherapy, tumor stemness, and drug sensitivity. Finally, we verified the prognostic value in patients with KIRP through univariate and multivariate analyses and nomogram construction. RESULTS PLK genes are extensively altered in pan-cancer, which may contribute to tumorigenesis. These genes are aberrantly expressed in some types of cancer, with PLK1 being overexpressed in 31 cancers. PLK expression is closely associated with the prognosis of various cancers. The expression level of PLK genes is related with sensitivity to diverse drugs and cancer immunity as well as cancer immunotherapy. Importantly, we verified that PLK1 was overexpressed in KIRP tissues and could be an unfavorable prognostic biomarker in patients with KIRP. Hence, PLK1 may serve as an oncogenic gene in KIRP and should be explored in future studies. CONCLUSIONS Our study comprehensively reports the molecular characteristics and biological functions of PLK family gens across human cancers and recommends further investigation of these genes as potential biomarkers and therapeutic targets, especially in KIRP.
Collapse
Affiliation(s)
| | | | - Jiawei Zhou
- Clinical Trial Center, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100021, China
| | - Peiwen Ma
- Clinical Trial Center, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100021, China
| | - Shuhang Wang
- Clinical Trial Center, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100021, China
| | - Ning Li
- Clinical Trial Center, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100021, China
| |
Collapse
|
5
|
Wang Y, Chen Q, Shao H, Zhang R, Shen H. Generating bulk RNA-Seq gene expression data based on generative deep learning models and utilizing it for data augmentation. Comput Biol Med 2024; 169:107828. [PMID: 38101117 DOI: 10.1016/j.compbiomed.2023.107828] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2023] [Revised: 11/22/2023] [Accepted: 12/04/2023] [Indexed: 12/17/2023]
Abstract
Large-scale high-throughput transcriptome sequencing data holds significant value in biomedical research. However, practical challenges such as difficulty in sample acquisition often limit the availability of large sample sizes, leading to decreased reliability of the analysis results. In practice, generative deep learning models, such as Generative Adversarial Networks (GANs) and Diffusion Models (DMs), have been proven to generate realistic data and may be used to solve this promblem. In this study, we utilized bulk RNA-Seq gene expression data to construct different generative models with two data preprocessing methods: Min-Max-GAN, Z-Score-GAN, Min-Max-DM, and Z-Score-DM. We demonstrated that the generated data from the Min-Max-GAN model exhibited high similarity to real data, surpassing the performance of the other models significantly. Furthermore, we trained the models on the largest dataset available to date, achieving MMD (Maximum Mean Discrepancy) of 0.030 and 0.033 on the training and independent datasets, respectively. Through SHAP (SHapley Additive exPlanations) explanations of our generative model, we also enhanced our model's credibility. Finally, we applied the generated data to data augmentation and observed a significant improvement in the performance of classification models. In summary, this study establishes a GAN-based approach for generating bulk RNA-Seq gene expression data, which contributes to enhancing the performance and reliability of downstream tasks in high-throughput transcriptome analysis.
Collapse
Affiliation(s)
- Yinglun Wang
- School of Life Sciences and Biopharmaceutics, Guangdong Pharmaceutical University, Guangzhou, 51006, PR China
| | - Qiurui Chen
- School of Life Sciences and Biopharmaceutics, Guangdong Pharmaceutical University, Guangzhou, 51006, PR China
| | - Hongwei Shao
- School of Life Sciences and Biopharmaceutics, Guangdong Pharmaceutical University, Guangzhou, 51006, PR China
| | - Rongxin Zhang
- School of Life Sciences and Biopharmaceutics, Guangdong Pharmaceutical University, Guangzhou, 51006, PR China.
| | - Han Shen
- School of Life Sciences and Biopharmaceutics, Guangdong Pharmaceutical University, Guangzhou, 51006, PR China.
| |
Collapse
|