1
|
Kirchgaessner R, Watson C, Creason A, Keutler K, Goecks J. Imputing single-cell protein abundance in multiplex tissue imaging. Nat Commun 2025; 16:4747. [PMID: 40404617 PMCID: PMC12098973 DOI: 10.1038/s41467-025-59788-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2024] [Accepted: 05/06/2025] [Indexed: 05/24/2025] Open
Abstract
Multiplex tissue imaging enables single-cell spatial proteomics and transcriptomics but remains limited by incomplete molecular profiling, tissue loss, and probe failure. Here, we apply machine learning to impute single-cell protein abundance using multiplex tissue imaging data from a breast cancer cohort. We evaluate regularized linear regression, gradient-boosted trees, and deep learning autoencoders, incorporating spatial context to enhance imputation accuracy. Our models achieve mean absolute errors between 0.05-0.3 on a [0,1] scale, closely approximating ground truth values. Using imputed data, we classify single cells as pre- or post-treatment, demonstrating their biological relevance. These findings establish the feasibility of imputing missing protein abundance, highlight the advantages of spatial information, and support machine learning as a powerful tool for improving single-cell tissue imaging.
Collapse
Affiliation(s)
- Raphael Kirchgaessner
- Department of Biomedical Engineering, Oregon Health & Science University, Portland, OR, USA
- The Knight Cancer Institute, Oregon Health & Science University, Portland, OR, USA
| | - Cameron Watson
- Department of Biomedical Engineering, Oregon Health & Science University, Portland, OR, USA
- The Knight Cancer Institute, Oregon Health & Science University, Portland, OR, USA
| | - Allison Creason
- Department of Biomedical Engineering, Oregon Health & Science University, Portland, OR, USA
- The Knight Cancer Institute, Oregon Health & Science University, Portland, OR, USA
| | - Kaya Keutler
- Department of Chemical Physiology and Biochemistry, Oregon Health & Science University, Portland, OR, USA
| | - Jeremy Goecks
- Department of Biomedical Engineering, Oregon Health & Science University, Portland, OR, USA.
- Department of Machine Learning, Moffitt Cancer Center, Tampa, FL, USA.
| |
Collapse
|
2
|
Wu Y, Xu L, Cong X, Li H, Li Y. Scmaskgan: masked multi-scale CNN and attention-enhanced GAN for scRNA-seq dropout imputation. BMC Bioinformatics 2025; 26:130. [PMID: 40394489 PMCID: PMC12093817 DOI: 10.1186/s12859-025-06138-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2025] [Accepted: 04/08/2025] [Indexed: 05/22/2025] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) enables high-resolution analysis of cellular heterogeneity, but dropout events, where gene expression is undetected in individual cells, present a significant challenge. We propose scMASKGAN, which transforms matrix imputation into a pixel restoration task to improve the recovery of missing gene expression data. Specifically, we integrate masking, convolutional neural networks (CNNs), attention mechanisms, and residual networks (ResNets) to effectively address dropout events in scRNA-seq data. The masking mechanism ensures the preservation of complete cellular information, while convolution and attention mechanisms are employed to capture both global and local features. Residual networks augment feature representation and effectively mitigate the risk of model overfitting. Additionally, cell-type labels are incorporated as constraints to guide the methods in learning more accurate cellular features. Finally, multiple experiments were conducted to evaluate the methods' performance using seven different data types and scRNA-seq data from ten neuroblastoma samples. The results demonstrate that the data imputed by scMASKGAN not only perform excellently across various evaluation metrics but also significantly enhance the effectiveness of downstream analyses, enabling a more comprehensive exploration of underlying biological information.
Collapse
Affiliation(s)
- You Wu
- College of Computer Science and Technology, Harbin Engineering University, Harbin, China
- National Engineering Laboratory for Modeling and Emulation in E-Government, Beijing, China
| | - Li Xu
- College of Computer Science and Technology, Harbin Engineering University, Harbin, China.
- National Engineering Laboratory for Modeling and Emulation in E-Government, Beijing, China.
| | - Xiaohong Cong
- College of Computer Science and Technology, Harbin Engineering University, Harbin, China
- National Engineering Laboratory for Modeling and Emulation in E-Government, Beijing, China
| | - Hanxiao Li
- College of Information Technology, University of New South Wales, Sydney, Australia
| | - Yanli Li
- College of Computer Science and Technology, Harbin Engineering University, Harbin, China
- National Engineering Laboratory for Modeling and Emulation in E-Government, Beijing, China
| |
Collapse
|
3
|
Pandey AC, Bezney J, DeAscanis D, Kirsch EB, Ahmed F, Crinklaw A, Choudhary KS, Mandala T, Deason J, Hamidi JS, Siddique A, Ranganathan S, Brown K, Armstrong J, Head S, Ordoukhanian P, Steinmetz LM, Topol EJ. A CRISPR/Cas9-based enhancement of high-throughput single-cell transcriptomics. Nat Commun 2025; 16:4664. [PMID: 40389438 PMCID: PMC12089397 DOI: 10.1038/s41467-025-59880-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2024] [Accepted: 05/03/2025] [Indexed: 05/21/2025] Open
Abstract
Single-cell RNA-seq (scRNAseq) struggles to capture the cellular heterogeneity of transcripts within individual cells due to the prevalence of highly abundant and ubiquitous transcripts, which can obscure the detection of biologically distinct transcripts expressed up to several orders of magnitude lower levels. To address this challenge, here we introduce single-cell CRISPRclean (scCLEAN), a molecular method that globally recomposes scRNAseq libraries, providing a benefit that cannot be recapitulated with deeper sequencing. scCLEAN utilizes the programmability of CRISPR/Cas9 to target and remove less than 1% of the transcriptome while redistributing approximately half of reads, shifting the focus toward less abundant transcripts. We experimentally apply scCLEAN to both heterogeneous immune cells and homogenous vascular smooth muscle cells to demonstrate its ability to uncover biological signatures in different biological contexts. We further emphasize scCLEAN's versatility by applying it to a third-generation sequencing method, single-cell MAS-Seq, to increase transcript-level detection and discovery. Here we show the possible utility of scCLEAN across a wide array of human tissues and cell types, indicating which contexts this technology proves beneficial and those in which its application is not advisable.
Collapse
Affiliation(s)
- Amitabh C Pandey
- Section of Cardiology, Tulane Heart and Vascular Institute, Department of Medicine, Tulane University School of Medicine, New Orleans, LA, USA.
- Southeast Louisiana Veterans Health Care System, New Orleans, LA, USA.
- Department of Molecular Medicine, Scripps Research Translational Institute, The Scripps Research Institute, La Jolla, CA, USA.
| | - Jon Bezney
- Genomics Core Facility, The Scripps Research Institute, La Jolla, CA, USA
- Jumpcode Genomics, San Diego, CA, USA
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | | | - Ethan B Kirsch
- Department of Molecular Medicine, Scripps Research Translational Institute, The Scripps Research Institute, La Jolla, CA, USA
| | - Farin Ahmed
- Genomics Core Facility, The Scripps Research Institute, La Jolla, CA, USA
| | | | | | - Tony Mandala
- Genomics Core Facility, The Scripps Research Institute, La Jolla, CA, USA
| | | | - Jasmin S Hamidi
- Department of Molecular Medicine, Scripps Research Translational Institute, The Scripps Research Institute, La Jolla, CA, USA
| | | | | | | | | | - Steven Head
- Genomics Core Facility, The Scripps Research Institute, La Jolla, CA, USA
| | | | - Lars M Steinmetz
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
- Stanford Genome Technology Center, Palo Alto, CA, USA
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany
| | - Eric J Topol
- Department of Molecular Medicine, Scripps Research Translational Institute, The Scripps Research Institute, La Jolla, CA, USA
| |
Collapse
|
4
|
Arya A, Tripathi P, Dubey N, Aier I, Kumar Varadwaj P. Navigating single-cell RNA-sequencing: protocols, tools, databases, and applications. Genomics Inform 2025; 23:13. [PMID: 40382658 DOI: 10.1186/s44342-025-00044-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2025] [Accepted: 04/07/2025] [Indexed: 05/20/2025] Open
Abstract
Single-cell RNA-sequencing (scRNA-seq) technology brought about a revolutionary change in the transcriptomic world, paving the way for comprehensive analysis of cellular heterogeneity in complex biological systems. It enabled researchers to see how different cells behaved at single-cell levels, providing new insights into the process. However, despite all these advancements, scRNA-seq also experiences challenges related to the complexity of data analysis, interpretation, and multi-omics data integration. In this review, these complications were discussed in detail, directly pointing at the optimization of scRNA-seq approaches and understanding the world of single-cell and its dynamics. Different protocols and currently functional single-cell databases were also covered. This review highlights different tools for the analysis of scRNA-seq and their methodologies, emphasizing innovative techniques that enhance resolution and accuracy at a single-cell level. Various applications were explored across domains including drug discovery, tumor microenvironment (TME), biomarker discovery, and microbial profiling, and case studies were discussed to explain the importance of scRNA-seq by uncovering novel and rare cell types and their identification. This review underlines a crucial aspect of scRNA-seq in the advancement of personalized medicine and highlights its potential to understand the complexity of biological systems.
Collapse
Affiliation(s)
- Ankish Arya
- Department of Applied Sciences, Indian Institute of Information Technology Allahabad, Jhalwa, Prayagraj, 211015, Uttar Pradesh, India
| | - Prabhat Tripathi
- Department of Applied Sciences, Indian Institute of Information Technology Allahabad, Jhalwa, Prayagraj, 211015, Uttar Pradesh, India
| | - Nidhi Dubey
- Department of Applied Sciences, Indian Institute of Information Technology Allahabad, Jhalwa, Prayagraj, 211015, Uttar Pradesh, India
| | - Imlimaong Aier
- Department of Applied Sciences, Indian Institute of Information Technology Allahabad, Jhalwa, Prayagraj, 211015, Uttar Pradesh, India
| | - Pritish Kumar Varadwaj
- Department of Applied Sciences, Indian Institute of Information Technology Allahabad, Jhalwa, Prayagraj, 211015, Uttar Pradesh, India.
| |
Collapse
|
5
|
Wang J, Ye F, Chai H, Jiang Y, Wang T, Ran X, Xia Q, Xu Z, Fu Y, Zhang G, Wu H, Guo G, Guo H, Ruan Y, Wang Y, Xing D, Xu X, Zhang Z. Advances and applications in single-cell and spatial genomics. SCIENCE CHINA. LIFE SCIENCES 2025; 68:1226-1282. [PMID: 39792333 DOI: 10.1007/s11427-024-2770-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/20/2024] [Accepted: 10/10/2024] [Indexed: 01/12/2025]
Abstract
The applications of single-cell and spatial technologies in recent times have revolutionized the present understanding of cellular states and the cellular heterogeneity inherent in complex biological systems. These advancements offer unprecedented resolution in the examination of the functional genomics of individual cells and their spatial context within tissues. In this review, we have comprehensively discussed the historical development and recent progress in the field of single-cell and spatial genomics. We have reviewed the breakthroughs in single-cell multi-omics technologies, spatial genomics methods, and the computational strategies employed toward the analyses of single-cell atlas data. Furthermore, we have highlighted the advances made in constructing cellular atlases and their clinical applications, particularly in the context of disease. Finally, we have discussed the emerging trends, challenges, and opportunities in this rapidly evolving field.
Collapse
Affiliation(s)
- Jingjing Wang
- Bone Marrow Transplantation Center of the First Affiliated Hospital & Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 310058, China
| | - Fang Ye
- Bone Marrow Transplantation Center of the First Affiliated Hospital & Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 310058, China
| | - Haoxi Chai
- Life Sciences Institute and The Second Affiliated Hospital, Zhejiang University, Hangzhou, 310058, China
| | - Yujia Jiang
- BGI Research, Shenzhen, 518083, China
- BGI Research, Hangzhou, 310030, China
| | - Teng Wang
- Biomedical Pioneering Innovation Center (BIOPIC) and School of Life Sciences, Peking University, Beijing, 100871, China
- Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China
| | - Xia Ran
- Bone Marrow Transplantation Center of the First Affiliated Hospital & Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 310058, China
- Institute of Hematology, Zhejiang University, Hangzhou, 310000, China
| | - Qimin Xia
- Biomedical Pioneering Innovation Center (BIOPIC) and School of Life Sciences, Peking University, Beijing, 100871, China
| | - Ziye Xu
- Department of Laboratory Medicine of The First Affiliated Hospital & Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 310058, China
| | - Yuting Fu
- Bone Marrow Transplantation Center of the First Affiliated Hospital & Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 310058, China
- Center for Stem Cell and Regenerative Medicine, Zhejiang University School of Medicine, Hangzhou, 310058, China
| | - Guodong Zhang
- Bone Marrow Transplantation Center of the First Affiliated Hospital & Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 310058, China
- Center for Stem Cell and Regenerative Medicine, Zhejiang University School of Medicine, Hangzhou, 310058, China
| | - Hanyu Wu
- Bone Marrow Transplantation Center of the First Affiliated Hospital & Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 310058, China
- Center for Stem Cell and Regenerative Medicine, Zhejiang University School of Medicine, Hangzhou, 310058, China
| | - Guoji Guo
- Bone Marrow Transplantation Center of the First Affiliated Hospital & Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 310058, China.
- Center for Stem Cell and Regenerative Medicine, Zhejiang University School of Medicine, Hangzhou, 310058, China.
- Zhejiang Provincial Key Lab for Tissue Engineering and Regenerative Medicine, Dr. Li Dak Sum & Yip Yio Chin Center for Stem Cell and Regenerative Medicine, Hangzhou, 310058, China.
- Institute of Hematology, Zhejiang University, Hangzhou, 310000, China.
| | - Hongshan Guo
- Bone Marrow Transplantation Center of the First Affiliated Hospital & Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 310058, China.
- Institute of Hematology, Zhejiang University, Hangzhou, 310000, China.
| | - Yijun Ruan
- Life Sciences Institute and The Second Affiliated Hospital, Zhejiang University, Hangzhou, 310058, China.
| | - Yongcheng Wang
- Department of Laboratory Medicine of The First Affiliated Hospital & Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 310058, China.
| | - Dong Xing
- Biomedical Pioneering Innovation Center (BIOPIC) and School of Life Sciences, Peking University, Beijing, 100871, China.
- Beijing Advanced Innovation Center for Genomics (ICG), Peking University, Beijing, 100871, China.
| | - Xun Xu
- BGI Research, Shenzhen, 518083, China.
- BGI Research, Hangzhou, 310030, China.
- Guangdong Provincial Key Laboratory of Genome Read and Write, BGI Research, Shenzhen, 518083, China.
| | - Zemin Zhang
- Biomedical Pioneering Innovation Center (BIOPIC) and School of Life Sciences, Peking University, Beijing, 100871, China.
| |
Collapse
|
6
|
Pavel A, Grønberg MG, Clemmensen LH. The impact of dropouts in scRNAseq dense neighborhood analysis. Comput Struct Biotechnol J 2025; 27:1278-1285. [PMID: 40225837 PMCID: PMC11992407 DOI: 10.1016/j.csbj.2025.03.033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2024] [Revised: 03/19/2025] [Accepted: 03/20/2025] [Indexed: 04/15/2025] Open
Abstract
Single cell RNA sequencing (scRNAseq) provides the possibility to investigate transcriptomic profiles on a single cell level. However, the data show unique challenges in comparison to bulk transcriptomic data, one being high dropout rates, which yields high sparsity data. Many classical analysis and preprocessing pipelines are based on the assumption that poor data can be counteracted by quantity and that similar cells (samples) are close to each other in space. Clustering is commonly used to detect clusters (dense local cell neighborhoods) under the assumption that similar cells are close to each other in space (where close is dependent on the (distance) metric used). The most commonly used clustering methodologies to detect dense local neighborhoods are based on graph clustering on a nearest neighbor graph. However, high dropout rates may break this assumption and make it difficult to reliably detect such dense local neighborhoods. We assess the cluster homogeneity and stability under increasing degrees of dropouts in one of the most popular clustering pipelines (dimensionality reduction + graph based clustering), as provided by scRNAseq analyses packages Seurat and Scanpy. Our study showcases that while the default pipeline performs well in terms of cluster homogeneity (i.e., cells in a cluster are of the same type), also with increasing dropout rates, the stability of clusters (i.e., cell pairs consistently being in the same cluster) decreases. This implies that sub-populations within cell types are increasingly difficult to identify under increasing dropout rates because observations are not consistently close. Our results challenge the current practice of using default clustering pipelines and the general assumption of identifiable local neighborhoods on high dropout data. Hence, these results suggest that careful consideration in interpretation and downstream analysis need to be made when relying on local neighborhoods and clusters on scRNAseq data. In addition, these results call for extensive benchmarking, to identify and provide methods robust in their local neighborhood relationships on data containing low to high dropout rates.
Collapse
Affiliation(s)
- Alisa Pavel
- Department of Applied Mathematics and Computer Science, Technical University of Denmark, 2800, Kongens Lyngby, Denmark
| | - Manja Gersholm Grønberg
- Department of Applied Mathematics and Computer Science, Technical University of Denmark, 2800, Kongens Lyngby, Denmark
| | - Line H. Clemmensen
- Department of Applied Mathematics and Computer Science, Technical University of Denmark, 2800, Kongens Lyngby, Denmark
- Department of Mathematical Sciences, University of Copenhagen, 2100, Copenhagen, Denmark
| |
Collapse
|
7
|
Wu CH, Zhou X, Chen M. Exploring and mitigating shortcomings in single-cell differential expression analysis with a new statistical paradigm. Genome Biol 2025; 26:58. [PMID: 40098192 PMCID: PMC11912664 DOI: 10.1186/s13059-025-03525-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Accepted: 03/05/2025] [Indexed: 03/19/2025] Open
Abstract
BACKGROUND Differential expression analysis is pivotal in single-cell transcriptomics for unraveling cell-type-specific responses to stimuli. While numerous methods are available to identify differentially expressed genes in single-cell data, recent evaluations of both single-cell-specific methods and methods adapted from bulk studies have revealed significant shortcomings in performance. In this paper, we dissect the four major challenges in single-cell differential expression analysis: excessive zeros, normalization, donor effects, and cumulative biases. These "curses" underscore the limitations and conceptual pitfalls in existing workflows. RESULTS To address the limitations of current single-cell differential expression analysis methods, we propose GLIMES, a statistical framework that leverages UMI counts and zero proportions within a generalized Poisson/Binomial mixed-effects model to account for batch effects and within-sample variation. We rigorously benchmarked GLIMES against six existing differential expression methods using three case studies and simulations across different experimental scenarios, including comparisons across cell types, tissue regions, and cell states. Our results demonstrate that GLIMES is more adaptable to diverse experimental designs in single-cell studies and effectively mitigates key shortcomings of current approaches, particularly those related to normalization procedures. By preserving biologically meaningful signals, GLIMES offers improved performance in detecting differentially expressed genes. CONCLUSIONS By using absolute RNA expression rather than relative abundance, GLIMES improves sensitivity, reduces false discoveries, and enhances biological interpretability. This paradigm shift challenges existing workflows and highlights the need for careful consideration of normalization strategies, ultimately paving the way for more accurate and robust single-cell transcriptomic analyses.
Collapse
Affiliation(s)
- Chih-Hsuan Wu
- Department of Statistics, University of Chicago, Chicago, USA
| | - Xiang Zhou
- Department of Biostatistics, University of Michigan, Ann Arbor, USA
| | - Mengjie Chen
- Department of Human Genetics and Department of Medicine, University of Chicago, Chicago, USA.
| |
Collapse
|
8
|
Ge S, Sun S, Xu H, Cheng Q, Ren Z. Deep learning in single-cell and spatial transcriptomics data analysis: advances and challenges from a data science perspective. Brief Bioinform 2025; 26:bbaf136. [PMID: 40185158 PMCID: PMC11970898 DOI: 10.1093/bib/bbaf136] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2024] [Revised: 02/17/2025] [Accepted: 03/05/2025] [Indexed: 04/07/2025] Open
Abstract
The development of single-cell and spatial transcriptomics has revolutionized our capacity to investigate cellular properties, functions, and interactions in both cellular and spatial contexts. Despite this progress, the analysis of single-cell and spatial omics data remains challenging. First, single-cell sequencing data are high-dimensional and sparse, and are often contaminated by noise and uncertainty, obscuring the underlying biological signal. Second, these data often encompass multiple modalities, including gene expression, epigenetic modifications, metabolite levels, and spatial locations. Integrating these diverse data modalities is crucial for enhancing prediction accuracy and biological interpretability. Third, while the scale of single-cell sequencing has expanded to millions of cells, high-quality annotated datasets are still limited. Fourth, the complex correlations of biological tissues make it difficult to accurately reconstruct cellular states and spatial contexts. Traditional feature engineering approaches struggle with the complexity of biological networks, while deep learning, with its ability to handle high-dimensional data and automatically identify meaningful patterns, has shown great promise in overcoming these challenges. Besides systematically reviewing the strengths and weaknesses of advanced deep learning methods, we have curated 21 datasets from nine benchmarks to evaluate the performance of 58 computational methods. Our analysis reveals that model performance can vary significantly across different benchmark datasets and evaluation metrics, providing a useful perspective for selecting the most appropriate approach based on a specific application scenario. We highlight three key areas for future development, offering valuable insights into how deep learning can be effectively applied to transcriptomic data analysis in biological, medical, and clinical settings.
Collapse
Affiliation(s)
- Shuang Ge
- Shenzhen International Graduate School, Tsinghua University, 2279 Lishui Road, Nanshan District, Shenzhen 518055, Guangdong, China
- Pengcheng Laboratory, 6001 Shahe West Road, Nanshan District, Shenzhen 518055, Guangdong, China
| | - Shuqing Sun
- Shenzhen International Graduate School, Tsinghua University, 2279 Lishui Road, Nanshan District, Shenzhen 518055, Guangdong, China
| | - Huan Xu
- School of Public Health, Anhui University of Science and Technology, 15 Fengxia Road, Changfeng County, Hefei 231131, Anhui, China
| | - Qiang Cheng
- Department of Computer Science, University of Kentucky, 329 Rose Street, Lexington 40506, Kentucky, USA
- Institute for Biomedical Informatics, University of Kentucky, 800 Rose Street, Lexington 40506, Kentucky, USA
| | - Zhixiang Ren
- Pengcheng Laboratory, 6001 Shahe West Road, Nanshan District, Shenzhen 518055, Guangdong, China
| |
Collapse
|
9
|
Zhang W, Liu T, Zhang H, Li Y. AcImpute: a constraint-enhancing smooth-based approach for imputing single-cell RNA sequencing data. Bioinformatics 2025; 41:btae711. [PMID: 40037523 PMCID: PMC11890269 DOI: 10.1093/bioinformatics/btae711] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2024] [Revised: 10/14/2024] [Accepted: 02/27/2025] [Indexed: 03/06/2025] Open
Abstract
MOTIVATION Single-cell RNA sequencing (scRNA-seq) provides a powerful tool for studying cellular heterogeneity and complexity. However, dropout events in single-cell RNA-seq data severely hinder the effectiveness and accuracy of downstream analysis. Therefore, data preprocessing with imputation methods is crucial to scRNA-seq analysis. RESULTS To address the issue of oversmoothing in smoothing-based imputation methods, the presented AcImpute, an unsupervised method that enhances imputation accuracy by constraining the smoothing weights among cells for genes with different expression levels. Compared with nine other imputation methods in cluster analysis and trajectory inference, the experimental results can demonstrate that AcImpute effectively restores gene expression, preserves inter-cell variability, preventing oversmoothing and improving clustering and trajectory inference performance. AVAILABILITY AND IMPLEMENTATION The code is available at https://github.com/Liutto/AcImpute.
Collapse
Affiliation(s)
- Wei Zhang
- School of Mathematics and Physics, Wuhan Institute of Technology, Wuhan 430205, China
| | - Tiantian Liu
- School of Mathematics and Physics, Wuhan Institute of Technology, Wuhan 430205, China
| | - Han Zhang
- School of Mathematics and Physics, Wuhan Institute of Technology, Wuhan 430205, China
| | - Yuanyuan Li
- School of Mathematics and Physics, Wuhan Institute of Technology, Wuhan 430205, China
| |
Collapse
|
10
|
Xia P, Wu W, Liu Q, Huang B, Wu M, Lin Z, Zhu M, Yu M, Qu Y, Li K, Wu L, Zhang R, Wang Q. SCANER: robust and sensitive identification of malignant cells from the scRNA-seq profiled tumor ecosystem. Brief Bioinform 2025; 26:bbaf175. [PMID: 40253692 PMCID: PMC12009548 DOI: 10.1093/bib/bbaf175] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2024] [Revised: 12/25/2024] [Accepted: 03/26/2025] [Indexed: 04/22/2025] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) has enabled the dissection of complex tumor ecosystems. Recognition of malignant cells as an essential step has a profound impact on downstream interpretation. However, most existing computational strategies are based on prior knowledge of canonical cell-type markers. We have developed a marker-free approach, the Seed-Cluster based Approach for NEoplastic cells Recognition (SCANER), to identify malignant cells based on significant gene expression variations caused by genomic instability. Upon analyzing different cancer types, SCANER achieved superior accuracy and robustness in identifying malignant cells, effectively addressing dropout events and tumor purity variations. Besides, SCANER can significantly detect copy number variations (CNVs) in malignant cells compared to nonmalignant cells, which is further confirmed through the paired whole exome sequencing data. In conclusion, SCANER has the potential to facilitate the biological exploration of the tumor ecosystem by accurately identifying malignant cells and it is applicable across various solid cancer types regardless of prior knowledge. SCANER is available at https://github.com/woolingxiang/SCANER.
Collapse
Affiliation(s)
- Peng Xia
- School of Biological Science & Medical Engineering, Southeast University, 8 Dongnandaxue Road, Jiangning District, Nanjing 211189, Jiangsu, China
- Department of Bioinformatics, Nanjing Medical University, 101 Longmian Avenue, Jiangning District, Nanjing 211166, Jiangsu, China
| | - Wei Wu
- School of Biological Science & Medical Engineering, Southeast University, 8 Dongnandaxue Road, Jiangning District, Nanjing 211189, Jiangsu, China
- Department of Bioinformatics, Nanjing Medical University, 101 Longmian Avenue, Jiangning District, Nanjing 211166, Jiangsu, China
| | - Quanzhong Liu
- Department of Bioinformatics, Nanjing Medical University, 101 Longmian Avenue, Jiangning District, Nanjing 211166, Jiangsu, China
- Institute for Brain Tumors, Jiangsu Collaborative Innovation Center for Cancer Personalized Medicine, Nanjing Medical University, 101 Longmian Avenue, Jiangning District, Nanjing 211166, Jiangsu, China
| | - Bin Huang
- School of Biological Science & Medical Engineering, Southeast University, 8 Dongnandaxue Road, Jiangning District, Nanjing 211189, Jiangsu, China
- Department of Bioinformatics, Nanjing Medical University, 101 Longmian Avenue, Jiangning District, Nanjing 211166, Jiangsu, China
| | - Min Wu
- Department of Bioinformatics, Nanjing Medical University, 101 Longmian Avenue, Jiangning District, Nanjing 211166, Jiangsu, China
- Institute for Brain Tumors, Jiangsu Collaborative Innovation Center for Cancer Personalized Medicine, Nanjing Medical University, 101 Longmian Avenue, Jiangning District, Nanjing 211166, Jiangsu, China
| | - Zihan Lin
- Department of Bioinformatics, Nanjing Medical University, 101 Longmian Avenue, Jiangning District, Nanjing 211166, Jiangsu, China
- Institute for Brain Tumors, Jiangsu Collaborative Innovation Center for Cancer Personalized Medicine, Nanjing Medical University, 101 Longmian Avenue, Jiangning District, Nanjing 211166, Jiangsu, China
| | - Mengyan Zhu
- Department of Bioinformatics, Nanjing Medical University, 101 Longmian Avenue, Jiangning District, Nanjing 211166, Jiangsu, China
- Institute for Brain Tumors, Jiangsu Collaborative Innovation Center for Cancer Personalized Medicine, Nanjing Medical University, 101 Longmian Avenue, Jiangning District, Nanjing 211166, Jiangsu, China
| | - Miao Yu
- Department of Bioinformatics, Nanjing Medical University, 101 Longmian Avenue, Jiangning District, Nanjing 211166, Jiangsu, China
- Institute for Brain Tumors, Jiangsu Collaborative Innovation Center for Cancer Personalized Medicine, Nanjing Medical University, 101 Longmian Avenue, Jiangning District, Nanjing 211166, Jiangsu, China
| | - Ying Qu
- Department of Bioinformatics, Nanjing Medical University, 101 Longmian Avenue, Jiangning District, Nanjing 211166, Jiangsu, China
- Institute for Brain Tumors, Jiangsu Collaborative Innovation Center for Cancer Personalized Medicine, Nanjing Medical University, 101 Longmian Avenue, Jiangning District, Nanjing 211166, Jiangsu, China
| | - Kening Li
- Department of Bioinformatics, Nanjing Medical University, 101 Longmian Avenue, Jiangning District, Nanjing 211166, Jiangsu, China
- Institute for Brain Tumors, Jiangsu Collaborative Innovation Center for Cancer Personalized Medicine, Nanjing Medical University, 101 Longmian Avenue, Jiangning District, Nanjing 211166, Jiangsu, China
| | - Lingxiang Wu
- Department of Bioinformatics, Nanjing Medical University, 101 Longmian Avenue, Jiangning District, Nanjing 211166, Jiangsu, China
- Institute for Brain Tumors, Jiangsu Collaborative Innovation Center for Cancer Personalized Medicine, Nanjing Medical University, 101 Longmian Avenue, Jiangning District, Nanjing 211166, Jiangsu, China
- Department of Neurosurgery, Beijing Tiantan Hospital, Capital Medical University, 119 South 4th Ring West Road, Fengtai District, Beijing 100070, China
| | - Ruohan Zhang
- Department of Bioinformatics, Nanjing Medical University, 101 Longmian Avenue, Jiangning District, Nanjing 211166, Jiangsu, China
- Institute for Brain Tumors, Jiangsu Collaborative Innovation Center for Cancer Personalized Medicine, Nanjing Medical University, 101 Longmian Avenue, Jiangning District, Nanjing 211166, Jiangsu, China
| | - Qianghu Wang
- School of Biological Science & Medical Engineering, Southeast University, 8 Dongnandaxue Road, Jiangning District, Nanjing 211189, Jiangsu, China
- Department of Bioinformatics, Nanjing Medical University, 101 Longmian Avenue, Jiangning District, Nanjing 211166, Jiangsu, China
- Institute for Brain Tumors, Jiangsu Collaborative Innovation Center for Cancer Personalized Medicine, Nanjing Medical University, 101 Longmian Avenue, Jiangning District, Nanjing 211166, Jiangsu, China
- The Affiliated Cancer Hospital of Nanjing Medical University, Jiangsu Cancer Hospital, Jiangsu Institute of Cancer Research, 42 Baiziting Road, Xuanwu District, Nanjing 210009, Jiangsu, China
- Department of Pathology, Jiangsu Province Hospital and the First Affiliated Hospital of Nanjing Medical University, 300 Guangzhou Road, Gulou District, Nanjing 210029, Jiangsu, China
| |
Collapse
|
11
|
Juan W, Ahn KW, Chen YG, Lin CW. CCI: A Consensus Clustering-Based Imputation Method for Addressing Dropout Events in scRNA-Seq Data. Bioengineering (Basel) 2025; 12:31. [PMID: 39851305 PMCID: PMC11763284 DOI: 10.3390/bioengineering12010031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2024] [Revised: 12/29/2024] [Accepted: 12/30/2024] [Indexed: 01/26/2025] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) is a cutting-edge technique in molecular biology and genomics, revealing the cellular heterogeneity. However, scRNA-seq data often suffer from dropout events, meaning that certain genes exhibit very low or even zero expression levels due to technical limitations. Existing imputation methods for dropout events lack comprehensive evaluations in downstream analyses and do not demonstrate robustness across various scenarios. In response to this challenge, we propose a consensus clustering-based imputation (CCI) method. CCI performs clustering on each subset of data sampling across genes and summarizes clustering outcomes to define cellular similarities. CCI leverages the information from similar cells and employs the similarities to impute gene expression levels. Our comprehensive evaluations demonstrate that CCI not only reconstructs the original data pattern, but also improves the performance of downstream analyses. CCI outperforms existing methods for data imputation under different scenarios, exhibiting accuracy, robustness, and generalization.
Collapse
Affiliation(s)
- Wanlin Juan
- Division of Biostatistics, Data Science Institute, Medical College of Wisconsin (MCW), Milwaukee, WI 53226, USA; (W.J.); (K.W.A.)
| | - Kwang Woo Ahn
- Division of Biostatistics, Data Science Institute, Medical College of Wisconsin (MCW), Milwaukee, WI 53226, USA; (W.J.); (K.W.A.)
| | - Yi-Guang Chen
- Department of Pediatrics, Medical College of Wisconsin (MCW), Milwaukee, WI 53226, USA;
| | - Chien-Wei Lin
- Division of Biostatistics, Data Science Institute, Medical College of Wisconsin (MCW), Milwaukee, WI 53226, USA; (W.J.); (K.W.A.)
| |
Collapse
|
12
|
Lejun G, Like Y, Xinyi W, Shehai Z, Shuhua X. SeqBMC: Single-cell data processing using iterative block matrix completion algorithm based on matrix factorisation. IET Syst Biol 2025; 19:e70003. [PMID: 39943646 PMCID: PMC11821729 DOI: 10.1049/syb2.70003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2024] [Revised: 12/27/2024] [Accepted: 01/16/2025] [Indexed: 02/16/2025] Open
Abstract
With the development of high-throughput sequencing technology, the analysis of single-cell RNA sequencing data has become the focus of current research. Matrix analysis and processing of downstream gene expression after preprocessing is a hot topic for researchers. This paper proposed an iterative block matrix completion algorithm, called SeqBMC, based on matrix factorisation. The algorithm is used to complete the missing value of the gene expression matrix caused by the defect of sequencing technology. The gene frequency of the matrix is used to block the matrix, and then the matrix factorisation algorithm is used to complete the small matrix after the block, and then the biological zeros that may exist in the block matrix are retained. Experimental results show that the matrix completion algorithm can significantly improve the classification performance of the gene expression matrix after completion with 86.81% F1 score, which is conducive to the recognition of cell types in sequencing data. Moreover, this completion method can be completed only by the machine learning method without too much prior knowledge related to biology and has good effects. Compared with ALRA, SeqBMC increased 5.47% accuracy and 5.03% F1 score. It indicates that SeqBMC has significant advantages in the matrix completion of single-cell RNA sequencing data.
Collapse
Affiliation(s)
- Gong Lejun
- Jiangsu Key Lab of Big Data Security & Intelligent ProcessingSchool of Computer ScienceNanjing University of Posts and TelecommunicationsNanjingChina
| | - Yu Like
- Jiangsu Key Lab of Big Data Security & Intelligent ProcessingSchool of Computer ScienceNanjing University of Posts and TelecommunicationsNanjingChina
| | - Wei Xinyi
- Jiangsu Key Lab of Big Data Security & Intelligent ProcessingSchool of Computer ScienceNanjing University of Posts and TelecommunicationsNanjingChina
| | - Zhou Shehai
- Jiangsu Key Lab of Big Data Security & Intelligent ProcessingSchool of Computer ScienceNanjing University of Posts and TelecommunicationsNanjingChina
| | - Xu Shuhua
- School of Data Science and Artificial IntelligenceWenzhou University of TechnologyWenzhouChina
| |
Collapse
|
13
|
Zhang W, Zhang X, Teng F, Yang Q, Wang J, Sun B, Liu J, Zhang J, Sun X, Zhao H, Xie Y, Liao K, Wang X. Research progress and the prospect of using single-cell sequencing technology to explore the characteristics of the tumor microenvironment. Genes Dis 2025; 12:101239. [PMID: 39552788 PMCID: PMC11566696 DOI: 10.1016/j.gendis.2024.101239] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2023] [Revised: 11/23/2023] [Accepted: 12/01/2023] [Indexed: 11/19/2024] Open
Abstract
In precision cancer therapy, addressing intra-tumor heterogeneity poses a significant obstacle. Due to the heterogeneity of each cell subtype and between cells within the tumor, the sensitivity and resistance of different patients to targeted drugs, chemotherapy, etc., are inconsistent. Concerning a specific tumor type, many feasible treatments or combinations can be used by specifically targeting the tumor microenvironment. To solve this problem, it is necessary to further study the tumor microenvironment. Single-cell sequencing techniques can dissect distinct tumor cell populations by isolating cells and using statistical computational methods. This technology may assist in the selection of targeted combination therapy, and the obtained cell subset information is crucial for the rational application of targeted therapy. In this review, we summarized the research and application advances of single-cell sequencing technology in the tumor microenvironment, including the most commonly used single-cell genomic and transcriptomic sequencing, and their future development direction was proposed. The application of single-cell sequencing technology has been expanded to include epigenomics, proteomics, metabolomics, and microbiome analysis. The integration of these different omics approaches has significantly advanced the development of single-cell multiomics sequencing technology. This innovative approach holds immense potential for various fields, such as biological research and medical investigations. Finally, we discussed the advantages and disadvantages of using single-cell sequencing to explore the tumor microenvironment.
Collapse
Affiliation(s)
- Wenyige Zhang
- Department of Clinical Laboratory, The 2nd Affiliated Hospital, Jiangxi Medical College, Nanchang University, Nanchang, Jiangxi 330006, China
- Queen Mary College, Jiangxi Medical College, Nanchang University, Nanchang, Jiangxi 330006, China
| | - Xue Zhang
- Queen Mary College, Jiangxi Medical College, Nanchang University, Nanchang, Jiangxi 330006, China
| | - Feifei Teng
- School of Public Health, Jiangxi Medical College, Nanchang University, Nanchang, Jiangxi 330006, China
| | - Qijun Yang
- Queen Mary College, Jiangxi Medical College, Nanchang University, Nanchang, Jiangxi 330006, China
| | - Jiayi Wang
- Queen Mary College, Jiangxi Medical College, Nanchang University, Nanchang, Jiangxi 330006, China
| | - Bing Sun
- Queen Mary College, Jiangxi Medical College, Nanchang University, Nanchang, Jiangxi 330006, China
| | - Jie Liu
- School of Public Health, Jiangxi Medical College, Nanchang University, Nanchang, Jiangxi 330006, China
| | - Jingyan Zhang
- Queen Mary College, Jiangxi Medical College, Nanchang University, Nanchang, Jiangxi 330006, China
| | - Xiaomeng Sun
- Queen Mary College, Jiangxi Medical College, Nanchang University, Nanchang, Jiangxi 330006, China
| | - Hanqing Zhao
- Queen Mary College, Jiangxi Medical College, Nanchang University, Nanchang, Jiangxi 330006, China
| | - Yuxuan Xie
- The Second Clinical Medical School, Jiangxi Medical College, Nanchang University, Nanchang, Jiangxi 330006, China
| | - Kaili Liao
- Department of Clinical Laboratory, The 2nd Affiliated Hospital, Jiangxi Medical College, Nanchang University, Nanchang, Jiangxi 330006, China
| | - Xiaozhong Wang
- Department of Clinical Laboratory, The 2nd Affiliated Hospital, Jiangxi Medical College, Nanchang University, Nanchang, Jiangxi 330006, China
| |
Collapse
|
14
|
Yin D, Cao Y, Chen J, Mak CLY, Yu KHO, Zhang J, Li J, Lin Y, Ho JWK, Yang JYH. Scope+: an open source generalizable architecture for single-cell RNA-seq atlases at sample and cell levels. Bioinformatics 2024; 41:btae727. [PMID: 39705183 PMCID: PMC11755096 DOI: 10.1093/bioinformatics/btae727] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2024] [Revised: 11/18/2024] [Accepted: 12/13/2024] [Indexed: 12/22/2024] Open
Abstract
SUMMARY With the recent advancement in single-cell RNA-sequencing technologies and the increased availability of integrative tools, challenges arise in easy and fast access to large collections of cell atlas. Existing cell atlas portals rarely are open sourced and adaptable, and do not support meta-analysis at cell level. Here, we present an open source, highly optimized and scalable architecture, named Scope+, to allow quick access, meta-analysis and cell-level selection of the atlas data. We applied this architecture to our well-curated 5 million COVID-19 blood and immune cells, as a portal called Covidscope. We achieved efficient access to atlas-scale data via three strategies, such as cell-as-unit data modelling, novel database optimization techniques and innovative software architectural design. Scope+ serves as an open source architecture for researchers to build on with their own atlas. AVAILABILITY AND IMPLEMENTATION The COVID-19 web portal, data and meta-analysis are available on Covidscope (https://covidsc.d24h.hk/). User tutorials on how to implement Scope+ architecture with their atlases can be found at https://hiyin.github.io/scopeplus-user-tutorial/. Scope+ source code can be found at https://doi.org/10.5281/zenodo.14174632 and https://github.com/hiyin/scopeplus.
Collapse
Affiliation(s)
- Danqing Yin
- Laboratory of Data Discovery for Health Limited (D24H), Pak Shek Kok, Hong Kong SAR, 999077, China
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong SAR, 999077, China
| | - Yue Cao
- Laboratory of Data Discovery for Health Limited (D24H), Pak Shek Kok, Hong Kong SAR, 999077, China
- Charles Perkins Centre, University of Sydney, Camperdown, NSW, 2006, Australia
- School of Mathematics and Statistics, University of Sydney, Camperdown, NSW, 2006, Australia
- Sydney Precision Data Science Centre, University of Sydney, Camperdown, NSW, 2006, Australia
| | - Junyi Chen
- Laboratory of Data Discovery for Health Limited (D24H), Pak Shek Kok, Hong Kong SAR, 999077, China
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong SAR, 999077, China
| | - Candice L Y Mak
- Laboratory of Data Discovery for Health Limited (D24H), Pak Shek Kok, Hong Kong SAR, 999077, China
| | - Ken H O Yu
- Laboratory of Data Discovery for Health Limited (D24H), Pak Shek Kok, Hong Kong SAR, 999077, China
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong SAR, 999077, China
| | - Jiaxuan Zhang
- Guangzhou National Laboratory, Guangzhou International Bio Island, Guangzhou, Guangdong Province 510005, China
| | - Jia Li
- Guangzhou National Laboratory, Guangzhou International Bio Island, Guangzhou, Guangdong Province 510005, China
- State Key Laboratory of Respiratory Disease, National Clinical Research Center for Respiratory Disease, Guangzhou Institute of Respiratory Health, The First Affiliated Hospital of Guangzhou Medical University, Guangzhou, Guangdong Province, 510005, China
| | - Yingxin Lin
- Laboratory of Data Discovery for Health Limited (D24H), Pak Shek Kok, Hong Kong SAR, 999077, China
- Charles Perkins Centre, University of Sydney, Camperdown, NSW, 2006, Australia
- School of Mathematics and Statistics, University of Sydney, Camperdown, NSW, 2006, Australia
- Sydney Precision Data Science Centre, University of Sydney, Camperdown, NSW, 2006, Australia
| | - Joshua W K Ho
- Laboratory of Data Discovery for Health Limited (D24H), Pak Shek Kok, Hong Kong SAR, 999077, China
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Pokfulam, Hong Kong SAR, 999077, China
| | - Jean Y H Yang
- Laboratory of Data Discovery for Health Limited (D24H), Pak Shek Kok, Hong Kong SAR, 999077, China
- Charles Perkins Centre, University of Sydney, Camperdown, NSW, 2006, Australia
- School of Mathematics and Statistics, University of Sydney, Camperdown, NSW, 2006, Australia
- Sydney Precision Data Science Centre, University of Sydney, Camperdown, NSW, 2006, Australia
| |
Collapse
|
15
|
Sun Y, Kong L, Huang J, Deng H, Bian X, Li X, Cui F, Dou L, Cao C, Zou Q, Zhang Z. A comprehensive survey of dimensionality reduction and clustering methods for single-cell and spatial transcriptomics data. Brief Funct Genomics 2024; 23:733-744. [PMID: 38860675 DOI: 10.1093/bfgp/elae023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2023] [Revised: 02/29/2024] [Accepted: 05/27/2024] [Indexed: 06/12/2024] Open
Abstract
In recent years, the application of single-cell transcriptomics and spatial transcriptomics analysis techniques has become increasingly widespread. Whether dealing with single-cell transcriptomic or spatial transcriptomic data, dimensionality reduction and clustering are indispensable. Both single-cell and spatial transcriptomic data are often high-dimensional, making the analysis and visualization of such data challenging. Through dimensionality reduction, it becomes possible to visualize the data in a lower-dimensional space, allowing for the observation of relationships and differences between cell subpopulations. Clustering enables the grouping of similar cells into the same cluster, aiding in the identification of distinct cell subpopulations and revealing cellular diversity, providing guidance for downstream analyses. In this review, we systematically summarized the most widely recognized algorithms employed for the dimensionality reduction and clustering analysis of single-cell transcriptomic and spatial transcriptomic data. This endeavor provides valuable insights and ideas that can contribute to the development of novel tools in this rapidly evolving field.
Collapse
Affiliation(s)
- Yidi Sun
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Lingling Kong
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Jiayi Huang
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Hongyan Deng
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Xinling Bian
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Xingfeng Li
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Feifei Cui
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Lijun Dou
- Genomic Medicine Institute, Lerner Research Institute, Cleveland, OH 44106, United States
| | - Chen Cao
- School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing 210029, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China
| | - Zilong Zhang
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| |
Collapse
|
16
|
Sun Z, Song K. GEMimp: An Accurate and Robust Imputation Method for Microbiome Data Using Graph Embedding Neural Network. J Mol Biol 2024; 436:168841. [PMID: 39490678 DOI: 10.1016/j.jmb.2024.168841] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2024] [Revised: 10/23/2024] [Accepted: 10/23/2024] [Indexed: 11/05/2024]
Abstract
Microbiome research has increasingly underscored the profound link between microbial compositions and human health, with numerous studies establishing a strong correlation between microbiome characteristics and various diseases. However, the analysis of microbiome data is frequently compromised by inherent sparsity issues, characterized by a substantial presence of observed zeros. These zeros not only skew the abundance distribution of microbial species but also undermine the reliability of scientific conclusions drawn from such data. Addressing this challenge, we introduce GEMimp, an innovative imputation method designed to infuse robustness into microbiome data analysis. GEMimp leverages the node2vec algorithm, which incorporates both Breadth-First Search (BFS) and Depth-First Search (DFS) strategies in its random walks sampling process. This approach enables GEMimp to learn nuanced, low-dimensional representations of each taxonomic unit, facilitating the reconstruction of their similarity networks with unprecedented accuracy. Our comparative analysis pits GEMimp against state-of-the-art imputation methods including SAVER, MAGIC and mbImpute. The results unequivocally demonstrate that GEMimp outperforms its counterparts by achieving the highest Pearson correlation coefficient when compared to the original raw dataset. Furthermore, GEMimp shows notable proficiency in identifying significant taxa, enhancing the detection of disease-related taxa and effectively mitigating the impact of sparsity on both simulated and real-world datasets, such as those pertaining to Type 2 Diabetes (T2D) and Colorectal Cancer (CRC). These findings collectively highlight the strong effectiveness of GEMimp, allowing for better analysis on microbial data. With alleviation of sparsity issues, it could be greatly facilitated in downstream analyses and even in the field of microbiology.
Collapse
Affiliation(s)
- Ziwei Sun
- School of Mathematics and Statistics, Qingdao University, Qingdao, China.
| | - Kai Song
- School of Mathematics and Statistics, Qingdao University, Qingdao, China.
| |
Collapse
|
17
|
Sharifitabar M, Kazempour S, Razavian J, Sajedi S, Solhjoo S, Zare H. A deep neural network to de-noise single-cell RNA sequencing data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.11.20.624552. [PMID: 39605470 PMCID: PMC11601639 DOI: 10.1101/2024.11.20.624552] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/29/2024]
Abstract
Single-cell RNA sequencing (scRNA-seq), a powerful technique for investigating the transcriptome of individual cells, enables the discovery of heterogeneous cell populations, rare cell types, and transcriptional dynamics in separate cells. Yet, scRNA-seq data analysis is limited by the problem of measurement dropouts, i.e., genes displaying zero expression levels. We introduce ZiPo, a deep artificial neural network for rate estimation and library size prediction in scRNA-seq data which incorporates adjustable zero inflation in the distribution to capture the dropouts. ZiPo builds upon established concepts, including using deep autoencoders and adopting the Poisson and negative binomial distributions, by taking advantage of novel strategies, including library size prediction and residual connections, to improve the overall performance. A significant innovation of ZiPo is the introduction of a scale-invariant loss term, making the weights sparse and, hence, the model biologically more interpretable. ZiPo quickly handles vast singular and mixed datasets, with the processing time directly proportional to the number of cells. In this paper, we demonstrate the power of ZiPo on three datasets and show its advantages over other current techniques. The code used to produce the results in this manuscript is available at https://bitbucket.org/habilzare/alzheimer/src/master/code/deep/ZiPo/.
Collapse
|
18
|
Lall S, Ray S, Bandyopadhyay S. Enhancing Single-Cell RNA-seq Data Completeness with a Graph Learning Framework. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; PP:64-72. [PMID: 39504287 DOI: 10.1109/tcbb.2024.3492384] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/08/2024]
Abstract
Single cell RNA sequencing (scRNA-seq) is a powerful tool to capture gene expression snapshots in individual cells. However, a low amount of RNA in the individual cells results in dropout events, which introduce huge zero counts in the single cell expression matrix. We have developed VAImpute, a variational graph autoencoder based imputation technique that learns the inherent distribution of a large network/graph constructed from the scRNA-seq data leveraging copula correlation () among cells/genes. The trained model is utilized to predict the dropouts events by computing the probability of all non-edges (cell-gene) in the network. We devise an algorithm to impute the missing expression values of the detected dropouts. The performance of the proposed model is assessed on both simulated and real scRNA-seq datasets, comparing it to established single-cell imputation methods. VAImpute yields significant improvements to detect dropouts, thereby achieving superior performance in cell clustering, detecting rare cells, and differential expression. All codes and datasets are given in the github link: https://github.com/sumantaray/VAImputeAvailability.
Collapse
|
19
|
Tang Z, Chen G, Chen S, Yao J, You L, Chen CYC. Modal-nexus auto-encoder for multi-modality cellular data integration and imputation. Nat Commun 2024; 15:9021. [PMID: 39424861 PMCID: PMC11489673 DOI: 10.1038/s41467-024-53355-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2024] [Accepted: 10/02/2024] [Indexed: 10/21/2024] Open
Abstract
Heterogeneous feature spaces and technical noise hinder the cellular data integration and imputation. The high cost of obtaining matched data across modalities further restricts analysis. Thus, there's a critical need for deep learning approaches to effectively integrate and impute unpaired multi-modality single-cell data, enabling deeper insights into cellular behaviors. To address these issues, we introduce the Modal-Nexus Auto-Encoder (Monae). Leveraging regulatory relationships between modalities and employing contrastive learning within modality-specific auto-encoders, Monae enhances cell representations in the unified space. The integration capability of Monae furnishes it with modality-complementary cellular representations, enabling the generation of precise intra-modal and cross-modal imputation counts for extensive and complex downstream tasks. In addition, we develop Monae-E (Monae-Extension), a variant of Monae that can converge rapidly and support biological discoveries. Evaluations on various datasets have validated Monae and Monae-E's accuracy and robustness in multi-modality cellular data integration and imputation.
Collapse
Affiliation(s)
- Zhenchao Tang
- Artificial Intelligence Medical Research Center, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen, 518107, China
- AI for Science (AI4S)-Preferred Program, School of Electronic and Computer Engineering, Peking University Shenzhen Graduate School, Shenzhen, 518055, China
| | - Guanxing Chen
- Artificial Intelligence Medical Research Center, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen, 518107, China
- AI for Science (AI4S)-Preferred Program, School of Electronic and Computer Engineering, Peking University Shenzhen Graduate School, Shenzhen, 518055, China
| | - Shouzhi Chen
- Artificial Intelligence Medical Research Center, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen, 518107, China
- AI for Science (AI4S)-Preferred Program, School of Electronic and Computer Engineering, Peking University Shenzhen Graduate School, Shenzhen, 518055, China
| | | | - Linlin You
- Artificial Intelligence Medical Research Center, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen, 518107, China.
| | - Calvin Yu-Chian Chen
- AI for Science (AI4S)-Preferred Program, School of Electronic and Computer Engineering, Peking University Shenzhen Graduate School, Shenzhen, 518055, China.
- State Key Laboratory of Chemical Oncogenomics, Key Laboratory of Chemical Genomics, School of Chemical Biology and Biotechnology, Peking University Shenzhen Graduate School, Shenzhen, 518055, China.
- Department of Medical Research, China Medical University Hospital, Taichung, 40447, Taiwan.
- Department of Bioinformatics and Medical Engineering, Asia University, Taichung, 41354, Taiwan.
- Guangdong L-Med Biotechnology Co., Ltd., Meizhou, 514699, China.
| |
Collapse
|
20
|
Bai L, Ji B, Wang S. SAE-Impute: imputation for single-cell data via subspace regression and auto-encoders. BMC Bioinformatics 2024; 25:317. [PMID: 39354334 PMCID: PMC11443887 DOI: 10.1186/s12859-024-05944-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Accepted: 09/23/2024] [Indexed: 10/03/2024] Open
Abstract
BACKGROUND Single-cell RNA sequencing (scRNA-seq) technology has emerged as a crucial tool for studying cellular heterogeneity. However, dropouts are inherent to the sequencing process, known as dropout events, posing challenges in downstream analysis and interpretation. Imputing dropout data becomes a critical concern in scRNA-seq data analysis. Present imputation methods predominantly rely on statistical or machine learning approaches, often overlooking inter-sample correlations. RESULTS To address this limitation, We introduced SAE-Impute, a new computational method for imputing single-cell data by combining subspace regression and auto-encoders for enhancing the accuracy and reliability of the imputation process. Specifically, SAE-Impute assesses sample correlations via subspace regression, predicts potential dropout values, and then leverages these predictions within an autoencoder framework for interpolation. To validate the performance of SAE-Impute, we systematically conducted experiments on both simulated and real scRNA-seq datasets. These results highlight that SAE-Impute effectively reduces false negative signals in single-cell data and enhances the retrieval of dropout values, gene-gene and cell-cell correlations. Finally, We also conducted several downstream analyses on the imputed single-cell RNA sequencing (scRNA-seq) data, including the identification of differential gene expression, cell clustering and visualization, and cell trajectory construction. CONCLUSIONS These results once again demonstrate that SAE-Impute is able to effectively reduce the droupouts in single-cell dataset, thereby improving the functional interpretability of the data.
Collapse
Affiliation(s)
- Liang Bai
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, China
| | - Boya Ji
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, China.
| | - Shulin Wang
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, China.
| |
Collapse
|
21
|
Zhao L, Jiang L, Xie Y, Huang J, Xie H, Tian J, Zhang D. scDTL: enhancing single-cell RNA-seq imputation through deep transfer learning with bulk cell information. Brief Bioinform 2024; 25:bbae555. [PMID: 39504481 PMCID: PMC11540133 DOI: 10.1093/bib/bbae555] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2024] [Revised: 08/30/2024] [Accepted: 10/16/2024] [Indexed: 11/08/2024] Open
Abstract
The increasing single-cell RNA sequencing (scRNA-seq) data enable researchers to explore cellular heterogeneity and gene expression profiles, offering a high-resolution view of the transcriptome at the single-cell level. However, the dropout events, which are often present in scRNA-seq data, remaining challenges for downstream analysis. Although a number of studies have been developed to recover single-cell expression profiles, their performance may be hindered due to not fully exploring the inherent relations between genes. To address the issue, we propose scDTL, a deep transfer learning based approach for scRNA-seq data imputation by harnessing the bulk RNA-sequencing information. We firstly employ a denoising autoencoder trained on bulk RNA-seq data as the initial imputation model, and then leverage a domain adaptation framework that transfers the knowledge learned by the bulk imputation model to scRNA-seq learning task. In addition, scDTL employs a parallel operation with a 1D U-Net denoising model to provide gene representations of varying granularity, capturing both coarse and fine features of the scRNA-seq data. Finally, we utilize a cross-channel attention mechanism to fuse the features learned from the transferred bulk imputation model and U-Net model. In the evaluation, we conduct extensive experiments to demonstrate that scDTL could outperform other state-of-the-art methods in the quantitative comparison and downstream analyses.
Collapse
Affiliation(s)
- Liuyang Zhao
- College of Computer Science and Software Engineering, Shenzhen University, Guangdong 518057, China
| | - Landu Jiang
- College of Future Technology, HKUST(GZ), Guangdong 510641, China
| | - Yufeng Xie
- Shenzhen Hospital of Guangzhou University of Chinese Medicine (Futian), Guangdong 518034, China
| | - JianHao Huang
- Shenzhen Hospital of Guangzhou University of Chinese Medicine (Futian), Guangdong 518034, China
| | - Haoran Xie
- Department of Computing and Decision Sciences, Lingnan University, Hong Kong Special Administrative Region 999077, China
| | - Jun Tian
- Department of Biochemistry, School of Medicine, Southern University of Science and Technology, Guangdong 518055, China
- Key University Laboratory of Metabolism and Health of Guangdong, Southern University of Science and Technology, Shenzhen 518055, China
| | - Dian Zhang
- College of Computer Science and Software Engineering, Shenzhen University, Guangdong 518057, China
| |
Collapse
|
22
|
Gao H, Shen W, Li R, Liu C, Wu S. Collaborative Structure-Preserved Missing Data Imputation for Single-Cell RNA-Seq Clustering. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:1480-1491. [PMID: 38776196 DOI: 10.1109/tcbb.2024.3404013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2024]
Abstract
Clustering of the single-cell RNA-seq (scRNA-seq) transcriptome profiles is able to identify cell types, which is beneficial to improve the understanding of disease progression. However, in practice, the single-cell expression data often contains a significant number of missing values as a result of technical variability. Missing data is a critical challenge in scRNA-seq clustering analysis since the unknown value does not reflect the underlying true expression level and makes it difficult to discovering cell types by applying clustering algorithms directly. Various approaches have been developed to overcome missing data issue in scRNA-seq clustering. Most of them recover missing expression values by borrowing observed data from similar cells or synthesizing data via generative adversarial networks. Such that the biologically meaningful cluster structure has not been sufficiently exploited. In this work, we introduce ColImpute, a collaborative structure-preserved missing data imputation approach for the scRNA-seq clustering. Specifically, a cluster structure-preserved imputation module and a subspace clustering module, which respectively perform missing data imputation and cell subtypes identification, are integrated into a unified optimization framework to train the two networks in a collaborative manner. Consequently, the clustering module effectively contributes cluster-structure information to guide the trainning process of the missing data imputation module. Simultaneously, the cluster structure-preserved imputation module reciprocally enhances the performance of the clustering module by generating more precise recovered samples. Promising experimental results show that the proposed method is effective for both the data imputation and the cell types identification.
Collapse
|
23
|
Kirchgaessner R, Watson C, Creason A, Keutler K, Goecks J. Imputing Single-Cell Protein Abundance in Multiplex Tissue Imaging. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.12.05.570058. [PMID: 38106203 PMCID: PMC10723289 DOI: 10.1101/2023.12.05.570058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
Multiplex tissue imaging are a collection of increasingly popular single-cell spatial proteomics and transcriptomics assays for characterizing biological tissues both compositionally and spatially. However, several technical issues limit the utility of multiplex tissue imaging, including the limited number of molecules (proteins and RNAs) that can be assayed, tissue loss, and protein probe failure. In this work, we demonstrate how machine learning methods can address these limitations by imputing protein abundance at the single-cell level using multiplex tissue imaging datasets from a breast cancer cohort. We first compared machine learning methods' strengths and weaknesses for imputing single-cell protein abundance. Machine learning methods used in this work include regularized linear regression, gradient-boosted regression trees, and deep learning autoencoders. We also incorporated cellular spatial information to improve imputation performance. Using machine learning, single-cell protein expression can be imputed with mean absolute error ranging between 0.05-0.3 on a [0,1] scale. Finally, we used imputed data to predict whether single cells were more likely to come from pre-treatment or post-treatment biopsies. Our results demonstrate (1) the feasibility of imputing single-cell abundance levels for many proteins using machine learning; (2) how including cellular spatial information can substantially enhance imputation results; and (3) the use of single-cell protein abundance levels in a use case to demonstrate biological relevance.
Collapse
|
24
|
Zhao J, Ching WK, Wong CW, Cheng X. BANMF-S: a blockwise accelerated non-negative matrix factorization framework with structural network constraints for single cell imputation. Brief Bioinform 2024; 25:bbae432. [PMID: 39242194 PMCID: PMC11379494 DOI: 10.1093/bib/bbae432] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2024] [Revised: 07/23/2024] [Accepted: 08/19/2024] [Indexed: 09/09/2024] Open
Abstract
MOTIVATION Single cell RNA sequencing (scRNA-seq) technique enables the transcriptome profiling of hundreds to ten thousands of cells at the unprecedented individual level and provides new insights to study cell heterogeneity. However, its advantages are hampered by dropout events. To address this problem, we propose a Blockwise Accelerated Non-negative Matrix Factorization framework with Structural network constraints (BANMF-S) to impute those technical zeros. RESULTS BANMF-S constructs a gene-gene similarity network to integrate prior information from the external PPI network by the Triadic Closure Principle and a cell-cell similarity network to capture the neighborhood structure and temporal information through a Minimum-Spanning Tree. By collaboratively employing these two networks as regularizations, BANMF-S encourages the coherence of similar gene and cell pairs in the latent space, enhancing the potential to recover the underlying features. Besides, BANMF-S adopts a blocklization strategy to solve the traditional NMF problem through distributed Stochastic Gradient Descent method in a parallel way to accelerate the optimization. Numerical experiments on simulations and real datasets verify that BANMF-S can improve the accuracy of downstream clustering and pseudo-trajectory inference, and its performance is superior to seven state-of-the-art algorithms. AVAILABILITY All data used in this work are downloaded from publicly available data sources, and their corresponding accession numbers or source URLs are provided in Supplementary File Section 5.1 Dataset Information. The source codes are publicly available in Github repository https://github.com/jiayingzhao/BANMF-S.
Collapse
Affiliation(s)
- Jiaying Zhao
- Department of Mathematics, The University of Hong Kong, Pokfulam Road, Hong Kong
| | - Wai-Ki Ching
- Department of Mathematics, The University of Hong Kong, Pokfulam Road, Hong Kong
| | - Chi-Wing Wong
- Department of Mathematics, The University of Hong Kong, Pokfulam Road, Hong Kong
| | - Xiaoqing Cheng
- School of Mathematics and Statistics, Xi’an Jiaotong University, No. 28 Xianning West Road, Xi'an, Shaanxi 710049, China
| |
Collapse
|
25
|
Wang X, Lian Q, Dong H, Xu S, Su Y, Wu X. Benchmarking Algorithms for Gene Set Scoring of Single-cell ATAC-seq Data. GENOMICS, PROTEOMICS & BIOINFORMATICS 2024; 22:qzae014. [PMID: 39049508 PMCID: PMC11423854 DOI: 10.1093/gpbjnl/qzae014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/01/2022] [Revised: 06/20/2023] [Accepted: 06/25/2023] [Indexed: 07/27/2024]
Abstract
Gene set scoring (GSS) has been routinely conducted for gene expression analysis of bulk or single-cell RNA sequencing (RNA-seq) data, which helps to decipher single-cell heterogeneity and cell type-specific variability by incorporating prior knowledge from functional gene sets. Single-cell assay for transposase accessible chromatin using sequencing (scATAC-seq) is a powerful technique for interrogating single-cell chromatin-based gene regulation, and genes or gene sets with dynamic regulatory potentials can be regarded as cell type-specific markers as if in single-cell RNA-seq (scRNA-seq). However, there are few GSS tools specifically designed for scATAC-seq, and the applicability and performance of RNA-seq GSS tools on scATAC-seq data remain to be investigated. Here, we systematically benchmarked ten GSS tools, including four bulk RNA-seq tools, five scRNA-seq tools, and one scATAC-seq method. First, using matched scATAC-seq and scRNA-seq datasets, we found that the performance of GSS tools on scATAC-seq data was comparable to that on scRNA-seq, suggesting their applicability to scATAC-seq. Then, the performance of different GSS tools was extensively evaluated using up to ten scATAC-seq datasets. Moreover, we evaluated the impact of gene activity conversion, dropout imputation, and gene set collections on the results of GSS. Results show that dropout imputation can significantly promote the performance of almost all GSS tools, while the impact of gene activity conversion methods or gene set collections on GSS performance is more dependent on GSS tools or datasets. Finally, we provided practical guidelines for choosing appropriate preprocessing methods and GSS tools in different application scenarios.
Collapse
Affiliation(s)
- Xi Wang
- Pasteurien College, Suzhou Medical College of Soochow University, Soochow University, Suzhou 215000, China
- Department of Automation, Xiamen University, Xiamen 361005, China
| | - Qiwei Lian
- Pasteurien College, Suzhou Medical College of Soochow University, Soochow University, Suzhou 215000, China
- Department of Automation, Xiamen University, Xiamen 361005, China
| | - Haoyu Dong
- Pasteurien College, Suzhou Medical College of Soochow University, Soochow University, Suzhou 215000, China
| | - Shuo Xu
- Department of Automation, Xiamen University, Xiamen 361005, China
| | - Yaru Su
- College of Mathematics and Computer Science, Fuzhou University, Fuzhou 350116, China
| | - Xiaohui Wu
- Pasteurien College, Suzhou Medical College of Soochow University, Soochow University, Suzhou 215000, China
| |
Collapse
|
26
|
França GS, Baron M, King BR, Bossowski JP, Bjornberg A, Pour M, Rao A, Patel AS, Misirlioglu S, Barkley D, Tang KH, Dolgalev I, Liberman DA, Avital G, Kuperwaser F, Chiodin M, Levine DA, Papagiannakopoulos T, Marusyk A, Lionnet T, Yanai I. Cellular adaptation to cancer therapy along a resistance continuum. Nature 2024; 631:876-883. [PMID: 38987605 PMCID: PMC11925205 DOI: 10.1038/s41586-024-07690-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2022] [Accepted: 06/07/2024] [Indexed: 07/12/2024]
Abstract
Advancements in precision oncology over the past decades have led to new therapeutic interventions, but the efficacy of such treatments is generally limited by an adaptive process that fosters drug resistance1. In addition to genetic mutations2, recent research has identified a role for non-genetic plasticity in transient drug tolerance3 and the acquisition of stable resistance4,5. However, the dynamics of cell-state transitions that occur in the adaptation to cancer therapies remain unknown and require a systems-level longitudinal framework. Here we demonstrate that resistance develops through trajectories of cell-state transitions accompanied by a progressive increase in cell fitness, which we denote as the 'resistance continuum'. This cellular adaptation involves a stepwise assembly of gene expression programmes and epigenetically reinforced cell states underpinned by phenotypic plasticity, adaptation to stress and metabolic reprogramming. Our results support the notion that epithelial-to-mesenchymal transition or stemness programmes-often considered a proxy for phenotypic plasticity-enable adaptation, rather than a full resistance mechanism. Through systematic genetic perturbations, we identify the acquisition of metabolic dependencies, exposing vulnerabilities that can potentially be exploited therapeutically. The concept of the resistance continuum highlights the dynamic nature of cellular adaptation and calls for complementary therapies directed at the mechanisms underlying adaptive cell-state transitions.
Collapse
Affiliation(s)
- Gustavo S França
- Institute for Computational Medicine, NYU Grossman School of Medicine, New York, NY, USA
- Institute for Systems Genetics, NYU Grossman School of Medicine, New York, NY, USA
| | - Maayan Baron
- Institute for Computational Medicine, NYU Grossman School of Medicine, New York, NY, USA
| | - Benjamin R King
- Institute for Systems Genetics, NYU Grossman School of Medicine, New York, NY, USA
- Bristol-Myers Squibb Company, Lawrenceville, NJ, USA
| | - Jozef P Bossowski
- Department of Pathology, NYU Grossman School of Medicine, New York, NY, USA
| | - Alicia Bjornberg
- Department of Cancer Physiology, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL, USA
| | - Maayan Pour
- Institute for Computational Medicine, NYU Grossman School of Medicine, New York, NY, USA
- Institute for Systems Genetics, NYU Grossman School of Medicine, New York, NY, USA
| | - Anjali Rao
- Institute for Computational Medicine, NYU Grossman School of Medicine, New York, NY, USA
| | - Ayushi S Patel
- Institute for Computational Medicine, NYU Grossman School of Medicine, New York, NY, USA
- Institute for Systems Genetics, NYU Grossman School of Medicine, New York, NY, USA
| | - Selim Misirlioglu
- Laura and Isaac Perlmutter Cancer Center, NYU Grossman School of Medicine, New York, NY, USA
| | - Dalia Barkley
- Institute for Computational Medicine, NYU Grossman School of Medicine, New York, NY, USA
| | - Kwan Ho Tang
- Laura and Isaac Perlmutter Cancer Center, NYU Grossman School of Medicine, New York, NY, USA
- Translational Medicine, Oncology R&D, AstraZeneca, Boston, MA, USA
| | - Igor Dolgalev
- Applied Bioinformatics Laboratories, NYU Grossman School of Medicine, New York, NY, USA
| | - Deborah A Liberman
- Institute for Computational Medicine, NYU Grossman School of Medicine, New York, NY, USA
- Institute for Systems Genetics, NYU Grossman School of Medicine, New York, NY, USA
| | - Gal Avital
- Institute for Computational Medicine, NYU Grossman School of Medicine, New York, NY, USA
| | - Felicia Kuperwaser
- Institute for Computational Medicine, NYU Grossman School of Medicine, New York, NY, USA
- Institute for Systems Genetics, NYU Grossman School of Medicine, New York, NY, USA
| | - Marta Chiodin
- Institute for Computational Medicine, NYU Grossman School of Medicine, New York, NY, USA
| | - Douglas A Levine
- Laura and Isaac Perlmutter Cancer Center, NYU Grossman School of Medicine, New York, NY, USA
- Merck & Co., Rahway, NJ, USA
| | - Thales Papagiannakopoulos
- Laura and Isaac Perlmutter Cancer Center, NYU Grossman School of Medicine, New York, NY, USA
- Bristol-Myers Squibb Company, Lawrenceville, NJ, USA
| | - Andriy Marusyk
- Department of Cancer Physiology, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL, USA
| | - Timothée Lionnet
- Institute for Systems Genetics, NYU Grossman School of Medicine, New York, NY, USA
- Department of Cell Biology, NYU Grossman School of Medicine, New York, NY, USA
| | - Itai Yanai
- Institute for Computational Medicine, NYU Grossman School of Medicine, New York, NY, USA.
- Institute for Systems Genetics, NYU Grossman School of Medicine, New York, NY, USA.
- Laura and Isaac Perlmutter Cancer Center, NYU Grossman School of Medicine, New York, NY, USA.
- Department of Biochemistry and Molecular Pharmacology, NYU Grossman School of Medicine, New York, NY, USA.
| |
Collapse
|
27
|
Liu W, Pan Y, Teng Z, Xu J. scDMAE: A Generative Denoising Model Adopted Mask Strategy for scRNA-Seq Data Recovery. IEEE J Biomed Health Inform 2024; 28:3772-3780. [PMID: 38568766 DOI: 10.1109/jbhi.2024.3383921] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/05/2024]
Abstract
The advent of single-cell RNA sequencing (scRNA-seq) technology has revolutionized gene expression studies at the single-cell level. However, the presence of technical noise and data sparsity in scRNA-seq often undermines the accuracy of subsequent analyses. Existing methods for denoising and imputing scRNA-seq data often rely on stringent assumptions about data distribution, limiting the effectiveness of data recovery. In this study, we propose the scDMAE model for denoising and recovery of scRNA-seq data. First, the model fuses gene expression features and topological features to discern the primary expression patterns of genes in cells. Then, an autoencoder with a masking strategy is used to model dropout events and separate potential noise in the data. Finally, the model incorporates the original raw data to recover the true biological expression value. By conducting experiments on various types of scRNA-Seq datasets, scDMAE demonstrates superior performance compared to other comparative methods based on six distinct evaluation metrics in downstream analysis. The scDMAE method can accurately cluster similar cell populations, identify differential genes and infer cell trajectories.
Collapse
|
28
|
Kang Y, Zhang H, Guan J. scINRB: single-cell gene expression imputation with network regularization and bulk RNA-seq data. Brief Bioinform 2024; 25:bbae148. [PMID: 38600665 PMCID: PMC11006796 DOI: 10.1093/bib/bbae148] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Revised: 02/26/2024] [Accepted: 03/18/2024] [Indexed: 04/12/2024] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) facilitates the study of cell type heterogeneity and the construction of cell atlas. However, due to its limitations, many genes may be detected to have zero expressions, i.e. dropout events, leading to bias in downstream analyses and hindering the identification and characterization of cell types and cell functions. Although many imputation methods have been developed, their performances are generally lower than expected across different kinds and dimensions of data and application scenarios. Therefore, developing an accurate and robust single-cell gene expression data imputation method is still essential. Considering to maintain the original cell-cell and gene-gene correlations and leverage bulk RNA sequencing (bulk RNA-seq) data information, we propose scINRB, a single-cell gene expression imputation method with network regularization and bulk RNA-seq data. scINRB adopts network-regularized non-negative matrix factorization to ensure that the imputed data maintains the cell-cell and gene-gene similarities and also approaches the gene average expression calculated from bulk RNA-seq data. To evaluate the performance, we test scINRB on simulated and experimental datasets and compare it with other commonly used imputation methods. The results show that scINRB recovers gene expression accurately even in the case of high dropout rates and dimensions, preserves cell-cell and gene-gene similarities and improves various downstream analyses including visualization, clustering and trajectory inference.
Collapse
Affiliation(s)
- Yue Kang
- Department of Automation, Xiamen University, Xiamen, Fujian, China
| | - Hongyu Zhang
- Department of Automation, Xiamen University, Xiamen, Fujian, China
| | - Jinting Guan
- Department of Automation, Xiamen University, Xiamen, Fujian, China
- National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, Fujian, China
| |
Collapse
|
29
|
Tanvir RB, Islam MM, Sobhan M, Luo D, Mondal AM. MOGAT: A Multi-Omics Integration Framework Using Graph Attention Networks for Cancer Subtype Prediction. Int J Mol Sci 2024; 25:2788. [PMID: 38474033 DOI: 10.3390/ijms25052788] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2023] [Revised: 02/15/2024] [Accepted: 02/22/2024] [Indexed: 03/14/2024] Open
Abstract
Accurate cancer subtype prediction is crucial for personalized medicine. Integrating multi-omics data represents a viable approach to comprehending the intricate pathophysiology of complex diseases like cancer. Conventional machine learning techniques are not ideal for analyzing the complex interrelationships among different categories of omics data. Numerous models have been suggested using graph-based learning to uncover veiled representations and network formations unique to distinct types of omics data to heighten predictions regarding cancers and characterize patients' profiles, amongst other applications aimed at improving disease management in medical research. The existing graph-based state-of-the-art multi-omics integration approaches for cancer subtype prediction, MOGONET, and SUPREME, use a graph convolutional network (GCN), which fails to consider the level of importance of neighboring nodes on a particular node. To address this gap, we hypothesize that paying attention to each neighbor or providing appropriate weights to neighbors based on their importance might improve the cancer subtype prediction. The natural choice to determine the importance of each neighbor of a node in a graph is to explore the graph attention network (GAT). Here, we propose MOGAT, a novel multi-omics integration approach, leveraging GAT models that incorporate graph-based learning with an attention mechanism. MOGAT utilizes a multi-head attention mechanism to extract appropriate information for a specific sample by assigning unique attention coefficients to neighboring samples. Based on our knowledge, our group is the first to explore GAT in multi-omics integration for cancer subtype prediction. To evaluate the performance of MOGAT in predicting cancer subtypes, we explored two sets of breast cancer data from TCGA and METABRIC. Our proposed approach, MOGAT, outperforms MOGONET by 32% to 46% and SUPREME by 2% to 16% in cancer subtype prediction in different scenarios, supporting our hypothesis. Our results also showed that GAT embeddings provide a better prognosis in differentiating the high-risk group from the low-risk group than raw features.
Collapse
Affiliation(s)
- Raihanul Bari Tanvir
- Knight Foundation School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA
| | - Md Mezbahul Islam
- Knight Foundation School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA
| | - Masrur Sobhan
- Knight Foundation School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA
| | - Dongsheng Luo
- Knight Foundation School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA
| | - Ananda Mohan Mondal
- Knight Foundation School of Computing and Information Sciences, Florida International University, Miami, FL 33199, USA
| |
Collapse
|
30
|
Wei Q, Islam MT, Zhou Y, Xing L. Self-supervised deep learning of gene-gene interactions for improved gene expression recovery. Brief Bioinform 2024; 25:bbae031. [PMID: 38349062 PMCID: PMC10939378 DOI: 10.1093/bib/bbae031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Revised: 11/18/2023] [Accepted: 01/11/2023] [Indexed: 02/15/2024] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) has emerged as a powerful tool to gain biological insights at the cellular level. However, due to technical limitations of the existing sequencing technologies, low gene expression values are often omitted, leading to inaccurate gene counts. Existing methods, including advanced deep learning techniques, struggle to reliably impute gene expressions due to a lack of mechanisms that explicitly consider the underlying biological knowledge of the system. In reality, it has long been recognized that gene-gene interactions may serve as reflective indicators of underlying biology processes, presenting discriminative signatures of the cells. A genomic data analysis framework that is capable of leveraging the underlying gene-gene interactions is thus highly desirable and could allow for more reliable identification of distinctive patterns of the genomic data through extraction and integration of intricate biological characteristics of the genomic data. Here we tackle the problem in two steps to exploit the gene-gene interactions of the system. We first reposition the genes into a 2D grid such that their spatial configuration reflects their interactive relationships. To alleviate the need for labeled ground truth gene expression datasets, a self-supervised 2D convolutional neural network is employed to extract the contextual features of the interactions from the spatially configured genes and impute the omitted values. Extensive experiments with both simulated and experimental scRNA-seq datasets are carried out to demonstrate the superior performance of the proposed strategy against the existing imputation methods.
Collapse
Affiliation(s)
- Qingyue Wei
- Institute for Computational and Mathematical Engineering, Stanford University, Stanford, 94305 CA, USA
| | - Md Tauhidul Islam
- Department of Radiation Oncology, Stanford University, Stanford, 94305 CA, USA
| | - Yuyin Zhou
- Department of Computer Science and Engineering, University of California, Santa Cruz, Santa Cruz, 95064 CA, USA
| | - Lei Xing
- Department of Radiation Oncology, Stanford University, Stanford, 94305 CA, USA
| |
Collapse
|
31
|
Lv T, Zhang Y, Li M, Kang Q, Fang S, Zhang Y, Brix S, Xu X. EAGS: efficient and adaptive Gaussian smoothing applied to high-resolved spatial transcriptomics. Gigascience 2024; 13:giad097. [PMID: 38373746 PMCID: PMC10939424 DOI: 10.1093/gigascience/giad097] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Revised: 09/12/2023] [Accepted: 10/13/2023] [Indexed: 02/21/2024] Open
Abstract
BACKGROUND The emergence of high-resolved spatial transcriptomics (ST) has facilitated the research of novel methods to investigate biological development, organism growth, and other complex biological processes. However, high-resolved and whole transcriptomics ST datasets require customized imputation methods to improve the signal-to-noise ratio and the data quality. FINDINGS We propose an efficient and adaptive Gaussian smoothing (EAGS) imputation method for high-resolved ST. The adaptive 2-factor smoothing of EAGS creates patterns based on the spatial and expression information of the cells, creates adaptive weights for the smoothing of cells in the same pattern, and then utilizes the weights to restore the gene expression profiles. We assessed the performance and efficiency of EAGS using simulated and high-resolved ST datasets of mouse brain and olfactory bulb. CONCLUSIONS Compared with other competitive methods, EAGS shows higher clustering accuracy, better biological interpretations, and significantly reduced computational consumption.
Collapse
Affiliation(s)
- Tongxuan Lv
- BGI Research, Shenzhen 518083, China
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| | | | - Mei Li
- BGI Research, Shenzhen 518083, China
- Department of Biotechnology and Biomedicine, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark
| | | | - Shuangsang Fang
- BGI Research, Shenzhen 518083, China
- BGI Research, Beijing 102601, China
| | | | | | - Xun Xu
- BGI Research, Shenzhen 518083, China
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| |
Collapse
|
32
|
Dong S, Liu Y, Gong Y, Dong X, Zeng X. scCAN: Clustering With Adaptive Neighbor-Based Imputation Method for Single-Cell RNA-Seq Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:95-105. [PMID: 38285569 DOI: 10.1109/tcbb.2023.3337231] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/31/2024]
Abstract
Single-cell RNA sequencing (scRNA-seq) is widely used to study cellular heterogeneity in different samples. However, due to technical deficiencies, dropout events often result in zero gene expression values in the gene expression matrix. In this paper, we propose a new imputation method called scCAN, based on adaptive neighborhood clustering, to estimate the zero value of dropouts. Our method continuously updates cell-cell similarity information by simultaneously learning similarity relationships, clustering structures, and imposing new rank constraints on the Laplacian matrix of the similarity matrix, improving the imputation of dropout zero values. To evaluate the performance of this method, we used four simulated and eight real scRNA-seq data for downstream analyses, including cell clustering, recovered gene expression, and reconstructed cell trajectories. Our method improves the performance of the downstream analysis and is better than other imputation methods.
Collapse
|
33
|
Zheng W, Min W, Wang S. TsImpute: an accurate two-step imputation method for single-cell RNA-seq data. Bioinformatics 2023; 39:btad731. [PMID: 38039139 PMCID: PMC10724850 DOI: 10.1093/bioinformatics/btad731] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Revised: 11/22/2023] [Accepted: 11/30/2023] [Indexed: 12/03/2023] Open
Abstract
MOTIVATION Single-cell RNA sequencing (scRNA-seq) technology has enabled discovering gene expression patterns at single cell resolution. However, due to technical limitations, there are usually excessive zeros, called "dropouts," in scRNA-seq data, which may mislead the downstream analysis. Therefore, it is crucial to impute these dropouts to recover the biological information. RESULTS We propose a two-step imputation method called tsImpute to impute scRNA-seq data. At the first step, tsImpute adopts zero-inflated negative binomial distribution to discriminate dropouts from true zeros and performs initial imputation by calculating the expected expression level. At the second step, it conducts clustering with this modified expression matrix, based on which the final distance weighted imputation is performed. Numerical results based on both simulated and real data show that tsImpute achieves favorable performance in terms of gene expression recovery, cell clustering, and differential expression analysis. AVAILABILITY AND IMPLEMENTATION The R package of tsImpute is available at https://github.com/ZhengWeihuaYNU/tsImpute.
Collapse
Affiliation(s)
- Weihua Zheng
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming 650504, China
| | - Wenwen Min
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming 650504, China
- Yunnan Key Laboratory of Intelligent Systems and Computing, Yunnan University, Kunming 650504, China
| | - Shunfang Wang
- Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming 650504, China
- Yunnan Key Laboratory of Intelligent Systems and Computing, Yunnan University, Kunming 650504, China
| |
Collapse
|
34
|
Zheng L, Allen GI. Graphical Model Inference with Erosely Measured Data. J Am Stat Assoc 2023; 119:2282-2293. [PMID: 39328784 PMCID: PMC11424035 DOI: 10.1080/01621459.2023.2256503] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2022] [Revised: 06/15/2023] [Accepted: 08/13/2023] [Indexed: 09/28/2024]
Abstract
In this paper, we investigate the Gaussian graphical model inference problem in a novel setting that we call erose measurements, referring to irregularly measured or observed data. For graphs, this results in different node pairs having vastly different sample sizes which frequently arises in data integration, genomics, neuroscience, and sensor networks. Existing works characterize the graph selection performance using the minimum pairwise sample size, which provides little insights for erosely measured data, and no existing inference method is applicable. We aim to fill in this gap by proposing the first inference method that characterizes the different uncertainty levels over the graph caused by the erose measurements, named GI-JOE (Graph Inference when Joint Observations are Erose). Specifically, we develop an edge-wise inference method and an affiliated FDR control procedure, where the variance of each edge depends on the sample sizes associated with corresponding neighbors. We prove statistical validity under erose measurements, thanks to careful localized edge-wise analysis and disentangling the dependencies across the graph. Finally, through simulation studies and a real neuroscience data example, we demonstrate the advantages of our inference methods for graph selection from erosely measured data.
Collapse
Affiliation(s)
- Lili Zheng
- Department of Electrical and Computer Engineering, Rice University
| | - Genevera I Allen
- Department of Electrical and Computer Engineering, Rice University
- Department of Computer Science, Rice University
- Department of Statistics, Rice University
- Department of Pediatrics-Neurology, Baylor College of Medicine
- Jan and Dan Duncan Neurological Research Institute, Texas Children's Hospital
| |
Collapse
|
35
|
Huang L, Song M, Shen H, Hong H, Gong P, Deng HW, Zhang C. Deep Learning Methods for Omics Data Imputation. BIOLOGY 2023; 12:1313. [PMID: 37887023 PMCID: PMC10604785 DOI: 10.3390/biology12101313] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Revised: 09/28/2023] [Accepted: 10/02/2023] [Indexed: 10/28/2023]
Abstract
One common problem in omics data analysis is missing values, which can arise due to various reasons, such as poor tissue quality and insufficient sample volumes. Instead of discarding missing values and related data, imputation approaches offer an alternative means of handling missing data. However, the imputation of missing omics data is a non-trivial task. Difficulties mainly come from high dimensionality, non-linear or non-monotonic relationships within features, technical variations introduced by sampling methods, sample heterogeneity, and the non-random missingness mechanism. Several advanced imputation methods, including deep learning-based methods, have been proposed to address these challenges. Due to its capability of modeling complex patterns and relationships in large and high-dimensional datasets, many researchers have adopted deep learning models to impute missing omics data. This review provides a comprehensive overview of the currently available deep learning-based methods for omics imputation from the perspective of deep generative model architectures such as autoencoder, variational autoencoder, generative adversarial networks, and Transformer, with an emphasis on multi-omics data imputation. In addition, this review also discusses the opportunities that deep learning brings and the challenges that it might face in this field.
Collapse
Affiliation(s)
- Lei Huang
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS 39406, USA
| | - Meng Song
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS 39406, USA
| | - Hui Shen
- Center for Biomedical Informatics and Genomics, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Huixiao Hong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR 72079, USA
| | - Ping Gong
- Environmental Laboratory, U.S. Army Engineer Research and Development Center, Vicksburg, MS 39180, USA
| | - Hong-Wen Deng
- Center for Biomedical Informatics and Genomics, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Chaoyang Zhang
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS 39406, USA
| |
Collapse
|
36
|
Wang T, Zhao H, Xu Y, Wang Y, Shang X, Peng J, Xiao B. scMultiGAN: cell-specific imputation for single-cell transcriptomes with multiple deep generative adversarial networks. Brief Bioinform 2023; 24:bbad384. [PMID: 37903416 PMCID: PMC11020228 DOI: 10.1093/bib/bbad384] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2023] [Revised: 09/13/2023] [Accepted: 10/03/2023] [Indexed: 11/01/2023] Open
Abstract
The emergence of single-cell RNA sequencing (scRNA-seq) technology has revolutionized the identification of cell types and the study of cellular states at a single-cell level. Despite its significant potential, scRNA-seq data analysis is plagued by the issue of missing values. Many existing imputation methods rely on simplistic data distribution assumptions while ignoring the intrinsic gene expression distribution specific to cells. This work presents a novel deep-learning model, named scMultiGAN, for scRNA-seq imputation, which utilizes multiple collaborative generative adversarial networks (GAN). Unlike traditional GAN-based imputation methods that generate missing values based on random noises, scMultiGAN employs a two-stage training process and utilizes multiple GANs to achieve cell-specific imputation. Experimental results show the efficacy of scMultiGAN in imputation accuracy, cell clustering, differential gene expression analysis and trajectory analysis, significantly outperforming existing state-of-the-art techniques. Additionally, scMultiGAN is scalable to large scRNA-seq datasets and consistently performs well across sequencing platforms. The scMultiGAN code is freely available at https://github.com/Galaxy8172/scMultiGAN.
Collapse
Affiliation(s)
- Tao Wang
- School of Computer Science, Northwestern Polytechnical University, 1 Dongxiang Rd., 710072 Xi’an, China
- Key Laboratory of Big Data Storage and Management, Ministry of Industry and Information Technology, Northwestern Polytechnical University, 1 Dongxiang Rd., 710072 Xi’an, China
| | - Hui Zhao
- School of Automation, Northwestern Polytechnical University, 1 Dongxiang Rd., 710072 Xi’an, China
| | - Yungang Xu
- Department of Cell Biology and Genetics, School of Basic Medical Sciences, Xi’an Jiaotong University Health Science Center, No.28, West Xianning Road, 710061 Xi’an, China
| | - Yongtian Wang
- School of Computer Science, Northwestern Polytechnical University, 1 Dongxiang Rd., 710072 Xi’an, China
- Key Laboratory of Big Data Storage and Management, Ministry of Industry and Information Technology, Northwestern Polytechnical University, 1 Dongxiang Rd., 710072 Xi’an, China
| | - Xuequn Shang
- School of Computer Science, Northwestern Polytechnical University, 1 Dongxiang Rd., 710072 Xi’an, China
- Key Laboratory of Big Data Storage and Management, Ministry of Industry and Information Technology, Northwestern Polytechnical University, 1 Dongxiang Rd., 710072 Xi’an, China
| | - Jiajie Peng
- School of Computer Science, Northwestern Polytechnical University, 1 Dongxiang Rd., 710072 Xi’an, China
- Key Laboratory of Big Data Storage and Management, Ministry of Industry and Information Technology, Northwestern Polytechnical University, 1 Dongxiang Rd., 710072 Xi’an, China
| | - Bing Xiao
- School of Automation, Northwestern Polytechnical University, 1 Dongxiang Rd., 710072 Xi’an, China
| |
Collapse
|
37
|
Erfanian N, Heydari AA, Feriz AM, Iañez P, Derakhshani A, Ghasemigol M, Farahpour M, Razavi SM, Nasseri S, Safarpour H, Sahebkar A. Deep learning applications in single-cell genomics and transcriptomics data analysis. Biomed Pharmacother 2023; 165:115077. [PMID: 37393865 DOI: 10.1016/j.biopha.2023.115077] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Revised: 06/22/2023] [Accepted: 06/23/2023] [Indexed: 07/04/2023] Open
Abstract
Traditional bulk sequencing methods are limited to measuring the average signal in a group of cells, potentially masking heterogeneity, and rare populations. The single-cell resolution, however, enhances our understanding of complex biological systems and diseases, such as cancer, the immune system, and chronic diseases. However, the single-cell technologies generate massive amounts of data that are often high-dimensional, sparse, and complex, thus making analysis with traditional computational approaches difficult and unfeasible. To tackle these challenges, many are turning to deep learning (DL) methods as potential alternatives to the conventional machine learning (ML) algorithms for single-cell studies. DL is a branch of ML capable of extracting high-level features from raw inputs in multiple stages. Compared to traditional ML, DL models have provided significant improvements across many domains and applications. In this work, we examine DL applications in genomics, transcriptomics, spatial transcriptomics, and multi-omics integration, and address whether DL techniques will prove to be advantageous or if the single-cell omics domain poses unique challenges. Through a systematic literature review, we have found that DL has not yet revolutionized the most pressing challenges of the single-cell omics field. However, using DL models for single-cell omics has shown promising results (in many cases outperforming the previous state-of-the-art models) in data preprocessing and downstream analysis. Although developments of DL algorithms for single-cell omics have generally been gradual, recent advances reveal that DL can offer valuable resources in fast-tracking and advancing research in single-cell.
Collapse
Affiliation(s)
- Nafiseh Erfanian
- Student Research Committee, Birjand University of Medical Sciences, Birjand, Iran
| | - A Ali Heydari
- Department of Applied Mathematics, University of California, Merced, CA, USA; Health Sciences Research Institute, University of California, Merced, CA, USA
| | - Adib Miraki Feriz
- Student Research Committee, Birjand University of Medical Sciences, Birjand, Iran
| | - Pablo Iañez
- Cellular Systems Genomics Group, Josep Carreras Research Institute, Barcelona, Spain
| | - Afshin Derakhshani
- Department of Biochemistry and Molecular Biology, University of Calgary, Calgary, AB, Canada
| | | | - Mohsen Farahpour
- Department of Electronics, Faculty of Electrical and Computer Engineering, University of Birjand, Birjand, Iran
| | - Seyyed Mohammad Razavi
- Department of Electronics, Faculty of Electrical and Computer Engineering, University of Birjand, Birjand, Iran
| | - Saeed Nasseri
- Cellular and Molecular Research Center, Birjand University of Medical Sciences, Birjand, Iran
| | - Hossein Safarpour
- Cellular and Molecular Research Center, Birjand University of Medical Sciences, Birjand, Iran.
| | - Amirhossein Sahebkar
- Biotechnology Research Center, Pharmaceutical Technology Institute, Mashhad University of Medical Sciences, Mashhad, Iran; Applied Biomedical Research Center, Mashhad University of Medical Sciences, Mashhad, Iran; Department of Biotechnology, School of Pharmacy, Mashhad University of Medical Sciences, Mashhad, Iran.
| |
Collapse
|
38
|
Abstract
Missing values are a notable challenge when analyzing mass spectrometry-based proteomics data. While the field is still actively debating the best practices, the challenge increased with the emergence of mass spectrometry-based single-cell proteomics and the dramatic increase in missing values. A popular approach to deal with missing values is to perform imputation. Imputation has several drawbacks for which alternatives exist, but currently, imputation is still a practical solution widely adopted in single-cell proteomics data analysis. This perspective discusses the advantages and drawbacks of imputation. We also highlight 5 main challenges linked to missing value management in single-cell proteomics. Future developments should aim to solve these challenges, whether it is through imputation or data modeling. The perspective concludes with recommendations for reporting missing values, for reporting methods that deal with missing values, and for proper encoding of missing values.
Collapse
Affiliation(s)
- Christophe Vanderaa
- Computational Biology and Bioinformatics Unit (CBIO), de Duve Institute, UCLouvain, 1200 Brussels, Belgium
| | - Laurent Gatto
- Computational Biology and Bioinformatics Unit (CBIO), de Duve Institute, UCLouvain, 1200 Brussels, Belgium
| |
Collapse
|
39
|
Shi Y, Wan J, Zhang X, Yin Y. CL-Impute: A contrastive learning-based imputation for dropout single-cell RNA-seq data. Comput Biol Med 2023; 164:107263. [PMID: 37531858 DOI: 10.1016/j.compbiomed.2023.107263] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2023] [Revised: 06/27/2023] [Accepted: 07/16/2023] [Indexed: 08/04/2023]
Abstract
BACKGROUND Single-cell RNA-sequencing (scRNA-seq) technology has revolutionized the study of cell heterogeneity and biological interpretation at the single-cell level. However, the dropout events commonly present in scRNA-seq data can markedly reduce the reliability of downstream analysis. Existing imputation methods often overlook the discrepancy between the established cell relationship from dropout noisy data and reality, which limits their performances due to the learned untrustworthy cell representations. METHOD Here, we propose a novel approach called the CL-Impute (Contrastive Learning-based Impute) model for estimating missing genes without relying on preconstructed cell relationships. CL-Impute utilizes contrastive learning and a self-attention network to address this challenge. Specifically, the proposed CL-Impute model leverages contrastive learning to learn cell representations from the self-perspective of dropout events, whereas the self-attention network captures cell relationships from the global-perspective. RESULTS Experimental results on four benchmark datasets, including quantitative assessment, cell clustering, gene identification, and trajectory inference, demonstrate the superior performance of CL-Impute compared with that of existing state-of-the-art imputation methods. Furthermore, our experiment reveals that combining contrastive learning and masking cell augmentation enables the model to learn actual latent features from noisy data with a high rate of dropout events, enhancing the reliability of imputed values. CONCLUSIONS CL-Impute is a novel contrastive learning-based method to impute scRNA-seq data in the context of high dropout rate. The source code of CL-Impute is available at https://github.com/yuchen21-web/Imputation-for-scRNA-seq.
Collapse
Affiliation(s)
- Yuchen Shi
- School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, 310018, China; Key Laboratory of Complex Systems Modeling and Simulation Ministry of Education, Ministry of Education, China
| | - Jian Wan
- School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, 310018, China; School of Information and Electronic Engineering, Zhejiang University of Science and Technology, Hangzhou, 310023, China
| | - Xin Zhang
- School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, 310018, China; Key Laboratory of Complex Systems Modeling and Simulation Ministry of Education, Ministry of Education, China.
| | - Yuyu Yin
- School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, 310018, China; Key Laboratory of Complex Systems Modeling and Simulation Ministry of Education, Ministry of Education, China.
| |
Collapse
|
40
|
Xi NM, Li JJ. Exploring the optimization of autoencoder design for imputing single-cell RNA sequencing data. Comput Struct Biotechnol J 2023; 21:4079-4095. [PMID: 37671239 PMCID: PMC10475479 DOI: 10.1016/j.csbj.2023.07.041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Revised: 07/22/2023] [Accepted: 07/31/2023] [Indexed: 09/07/2023] Open
Abstract
Autoencoders are the backbones of many imputation methods that aim to relieve the sparsity issue in single-cell RNA sequencing (scRNA-seq) data. The imputation performance of an autoencoder relies on both the neural network architecture and the hyperparameter choice. So far, literature in the single-cell field lacks a formal discussion on how to design the neural network and choose the hyperparameters. Here, we conducted an empirical study to answer this question. Our study used many real and simulated scRNA-seq datasets to examine the impacts of the neural network architecture, the activation function, and the regularization strategy on imputation accuracy and downstream analyses. Our results show that (i) deeper and narrower autoencoders generally lead to better imputation performance; (ii) the sigmoid and tanh activation functions consistently outperform other commonly used functions including ReLU; (iii) regularization improves the accuracy of imputation and downstream cell clustering and DE gene analyses. Notably, our results differ from common practices in the computer vision field regarding the activation function and the regularization strategy. Overall, our study offers practical guidance on how to optimize the autoencoder design for scRNA-seq data imputation.
Collapse
Affiliation(s)
- Nan Miles Xi
- Department of Mathematics and Statistics, Loyola University Chicago, Chicago, IL 60660, USA
| | - Jingyi Jessica Li
- Department of Statistics and Data Science, University of California, Los Angeles, CA 90095-1554, USA
- Department of Human Genetics, University of California, Los Angeles, CA 90095-7088, USA
- Department of Computational Medicine, University of California, Los Angeles, CA 90095-1766, USA
- Department of Biostatistics, University of California, Los Angeles, CA 90095-1772, USA
| |
Collapse
|
41
|
Chen S, Zheng R, Tian L, Wu FX, Li M. A posterior probability based Bayesian method for single-cell RNA-seq data imputation. Methods 2023; 216:21-38. [PMID: 37315825 DOI: 10.1016/j.ymeth.2023.06.004] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Revised: 05/19/2023] [Accepted: 06/07/2023] [Indexed: 06/16/2023] Open
Abstract
Single-cell RNA-sequencing (scRNA-seq) data suffer from a lot of zeros. Such dropout events impede the downstream data analyses. We propose BayesImpute to infer and impute dropouts from the scRNA-seq data. Using the expression rate and coefficient of variation of the genes within the cell subpopulation, BayesImpute first determines likely dropouts, and then constructs the posterior distribution for each gene and uses the posterior mean to impute dropout values. Some simulated and real experiments show that BayesImpute can effectively identify dropout events and reduce the introduction of false positive signals. Additionally, BayesImpute successfully recovers the true expression levels of missing values, restores the gene-to-gene and cell-to-cell correlation coefficient, and maintains the biological information in bulk RNA-seq data. Furthermore, BayesImpute boosts the clustering and visualization of cell subpopulations and improves the identification of differentially expressed genes. We further demonstrate that, in comparison to other statistical-based imputation methods, BayesImpute is scalable and fast with minimal memory usage.
Collapse
Affiliation(s)
- Siqi Chen
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Ruiqing Zheng
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Luyi Tian
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Fang-Xiang Wu
- Department of Mechanical Engineering and Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, SK S7N 5A9, Canada
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha 410083, China.
| |
Collapse
|
42
|
Cheng Y, Ma X, Yuan L, Sun Z, Wang P. Evaluating imputation methods for single-cell RNA-seq data. BMC Bioinformatics 2023; 24:302. [PMID: 37507764 PMCID: PMC10386301 DOI: 10.1186/s12859-023-05417-7] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2020] [Accepted: 07/18/2023] [Indexed: 07/30/2023] Open
Abstract
BACKGROUND Single-cell RNA sequencing (scRNA-seq) enables the high-throughput profiling of gene expression at the single-cell level. However, overwhelming dropouts within data may obscure meaningful biological signals. Various imputation methods have recently been developed to address this problem. Therefore, it is important to perform a systematic evaluation of different imputation algorithms. RESULTS In this study, we evaluated 11 of the most recent imputation methods on 12 real biological datasets from immunological studies and 4 simulated datasets. The performance of these methods was compared, based on numerical recovery, cell clustering and marker gene analysis. Most of the methods brought some benefits on numerical recovery. To some extent, the performance of imputation methods varied among protocols. In the cell clustering analysis, no method performed consistently well across all datasets. Some methods performed poorly on real datasets but excellent on simulated datasets. Surprisingly and importantly, some methods had a negative effect on cell clustering. In marker gene analysis, some methods identified potentially novel cell subsets. However, not all of the marker genes were successfully imputed in gene expression, suggesting that imputation challenges remain. CONCLUSIONS In summary, different imputation methods showed different effects on different datasets, suggesting that imputation may have dataset specificity. Our study reveals the benefits and limitations of various imputation methods and provides a data-driven guidance for scRNA-seq data analysis.
Collapse
Affiliation(s)
- Yi Cheng
- School of Intelligence Science and Technology, Key Laboratory of Machine Perception (MOE), Peking University, Beijing, 100871, China
| | - Xiuli Ma
- School of Intelligence Science and Technology, Key Laboratory of Machine Perception (MOE), Peking University, Beijing, 100871, China.
| | - Lang Yuan
- School of Intelligence Science and Technology, Key Laboratory of Machine Perception (MOE), Peking University, Beijing, 100871, China
| | - Zhaoguo Sun
- School of Intelligence Science and Technology, Key Laboratory of Machine Perception (MOE), Peking University, Beijing, 100871, China
| | - Pingzhang Wang
- Department of Immunology, NHC Key Laboratory of Medical Immunology (Peking University), School of Basic Medical Sciences, Peking University Health Science Center, Beijing, China.
- Peking University Center for Human Disease Genomics, Beijing, 100191, China.
| |
Collapse
|
43
|
Hu Y, Zhao Y, Schunk CT, Ma Y, Derr T, Zhou XM. ADEPT: Autoencoder with differentially expressed genes and imputation for robust spatial transcriptomics clustering. iScience 2023; 26:106792. [PMID: 37235055 PMCID: PMC10205785 DOI: 10.1016/j.isci.2023.106792] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2023] [Revised: 04/06/2023] [Accepted: 04/26/2023] [Indexed: 05/28/2023] Open
Abstract
Advancements in spatial transcriptomics (ST) have enabled an in-depth understanding of complex tissues by quantifying gene expression at spatially localized spots. Several notable clustering methods have been introduced to utilize both spatial and transcriptional information in the analysis of ST datasets. However, data quality across different ST sequencing techniques and types of datasets influence the performance of different methods and benchmarks. To harness spatial context and transcriptional profile in ST data, we developed a graph-based, multi-stage framework for robust clustering, called ADEPT. To control and stabilize data quality, ADEPT relies on a graph autoencoder backbone and performs an iterative clustering on imputed, differentially expressed genes-based matrices to minimize the variance of clustering results. ADEPT outperformed other popular methods on ST data generated by different platforms across analyses such as spatial domain identification, visualization, spatial trajectory inference, and data denoising.
Collapse
Affiliation(s)
- Yunfei Hu
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA
| | - Yuying Zhao
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA
| | - Curtis T. Schunk
- Department of Biomedical Engineering, Vanderbilt University, Nashville, TN, USA
| | - Yingxiang Ma
- Data Science Institute, Vanderbilt University, Nashville, TN, USA
| | - Tyler Derr
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA
- Data Science Institute, Vanderbilt University, Nashville, TN, USA
| | - Xin Maizie Zhou
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA
- Department of Biomedical Engineering, Vanderbilt University, Nashville, TN, USA
- Data Science Institute, Vanderbilt University, Nashville, TN, USA
| |
Collapse
|
44
|
Zhu H, Liu T, Wang Z. scHiMe: predicting single-cell DNA methylation levels based on single-cell Hi-C data. Brief Bioinform 2023:7193585. [PMID: 37302805 PMCID: PMC10359091 DOI: 10.1093/bib/bbad223] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2023] [Revised: 05/10/2023] [Accepted: 05/23/2023] [Indexed: 06/13/2023] Open
Abstract
Recently a biochemistry experiment named methyl-3C was developed to simultaneously capture the chromosomal conformations and DNA methylation levels on individual single cells. However, the number of data sets generated from this experiment is still small in the scientific community compared with the greater amount of single-cell Hi-C data generated from separate single cells. Therefore, a computational tool to predict single-cell methylation levels based on single-cell Hi-C data on the same individual cells is needed. We developed a graph transformer named scHiMe to accurately predict the base-pair-specific (bp-specific) methylation levels based on both single-cell Hi-C data and DNA nucleotide sequences. We benchmarked scHiMe for predicting the bp-specific methylation levels on all of the promoters of the human genome, all of the promoter regions together with the corresponding first exon and intron regions, and random regions on the whole genome. Our evaluation showed a high consistency between the predicted and methyl-3C-detected methylation levels. Moreover, the predicted DNA methylation levels resulted in accurate classifications of cells into different cell types, which indicated that our algorithm successfully captured the cell-to-cell variability in the single-cell Hi-C data. scHiMe is freely available at http://dna.cs.miami.edu/scHiMe/.
Collapse
Affiliation(s)
- Hao Zhu
- Department of Computer Science, University of Miami, 330M Ungar Building, 1365 Memorial Drive, Coral Gables, 33124-4245, FL, USA
| | - Tong Liu
- Department of Computer Science, University of Miami, 330M Ungar Building, 1365 Memorial Drive, Coral Gables, 33124-4245, FL, USA
| | - Zheng Wang
- Department of Computer Science, University of Miami, 330M Ungar Building, 1365 Memorial Drive, Coral Gables, 33124-4245, FL, USA
| |
Collapse
|
45
|
Nie X, Qin D, Zhou X, Duo H, Hao Y, Li B, Liang G. Clustering ensemble in scRNA-seq data analysis: Methods, applications and challenges. Comput Biol Med 2023; 159:106939. [PMID: 37075602 DOI: 10.1016/j.compbiomed.2023.106939] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2023] [Revised: 03/31/2023] [Accepted: 04/14/2023] [Indexed: 04/21/2023]
Abstract
With the rapid development of single-cell RNA-sequencing techniques, various computational methods and tools were proposed to analyze these high-throughput data, which led to an accelerated reveal of potential biological information. As one of the core steps of single-cell transcriptome data analysis, clustering plays a crucial role in identifying cell types and interpreting cellular heterogeneity. However, the results generated by different clustering methods showed distinguishing, and those unstable partitions can affect the accuracy of the analysis to a certain extent. To overcome this challenge and obtain more accurate results, currently clustering ensemble is frequently applied to cluster analysis of single-cell transcriptome datasets, and the results generated by all clustering ensembles are nearly more reliable than those from most of the single clustering partitions. In this review, we summarize applications and challenges of the clustering ensemble method in single-cell transcriptome data analysis, and provide constructive thoughts and references for researchers in this field.
Collapse
Affiliation(s)
- Xiner Nie
- Key Laboratory of Biorheological Science and Technology, Ministry of Education, Bioengineering College, Chongqing University, Chongqing, 400044, China; College of Life Sciences, Chongqing Normal University, Chongqing, 400044, PR China
| | - Dan Qin
- Department of Biology, College of Science, Northeastern University, Boston, MA, 02115, USA
| | - Xinyi Zhou
- College of Life Sciences, Chongqing Normal University, Chongqing, 400044, PR China
| | - Hongrui Duo
- College of Life Sciences, Chongqing Normal University, Chongqing, 400044, PR China
| | - Youjin Hao
- College of Life Sciences, Chongqing Normal University, Chongqing, 400044, PR China
| | - Bo Li
- College of Life Sciences, Chongqing Normal University, Chongqing, 400044, PR China.
| | - Guizhao Liang
- Key Laboratory of Biorheological Science and Technology, Ministry of Education, Bioengineering College, Chongqing University, Chongqing, 400044, China.
| |
Collapse
|
46
|
Qiu Y, Yan C, Zhao P, Zou Q. SSNMDI: a novel joint learning model of semi-supervised non-negative matrix factorization and data imputation for clustering of single-cell RNA-seq data. Brief Bioinform 2023; 24:7147025. [PMID: 37122068 DOI: 10.1093/bib/bbad149] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2022] [Revised: 02/18/2023] [Accepted: 03/28/2023] [Indexed: 05/02/2023] Open
Abstract
MOTIVATION Single-cell RNA sequencing (scRNA-seq) technology attracts extensive attention in the biomedical field. It can be used to measure gene expression and analyze the transcriptome at the single-cell level, enabling the identification of cell types based on unsupervised clustering. Data imputation and dimension reduction are conducted before clustering because scRNA-seq has a high 'dropout' rate, noise and linear inseparability. However, independence of dimension reduction, imputation and clustering cannot fully characterize the pattern of the scRNA-seq data, resulting in poor clustering performance. Herein, we propose a novel and accurate algorithm, SSNMDI, that utilizes a joint learning approach to simultaneously perform imputation, dimensionality reduction and cell clustering in a non-negative matrix factorization (NMF) framework. In addition, we integrate the cell annotation as prior information, then transform the joint learning into a semi-supervised NMF model. Through experiments on 14 datasets, we demonstrate that SSNMDI has a faster convergence speed, better dimensionality reduction performance and a more accurate cell clustering performance than previous methods, providing an accurate and robust strategy for analyzing scRNA-seq data. Biological analysis are also conducted to validate the biological significance of our method, including pseudotime analysis, gene ontology and survival analysis. We believe that we are among the first to introduce imputation, partial label information, dimension reduction and clustering to the single-cell field. AVAILABILITY AND IMPLEMENTATION The source code for SSNMDI is available at https://github.com/yushanqiu/SSNMDI.
Collapse
Affiliation(s)
- Yushan Qiu
- College of Mathematics and Statistics, Shenzhen University, 518000, Guangdong, China
| | - Chang Yan
- College of Mathematics and Statistics, Shenzhen University, 518000, Guangdong, China
| | - Pu Zhao
- College of Life and Health Sciences, Northeastern University, Shenyang, 110169, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 610056, China
| |
Collapse
|
47
|
Subkhankulova T, Camargo Sosa K, Uroshlev LA, Nikaido M, Shriever N, Kasianov AS, Yang X, Rodrigues FSLM, Carney TJ, Bavister G, Schwetlick H, Dawes JHP, Rocco A, Makeev VJ, Kelsh RN. Zebrafish pigment cells develop directly from persistent highly multipotent progenitors. Nat Commun 2023; 14:1258. [PMID: 36878908 PMCID: PMC9988989 DOI: 10.1038/s41467-023-36876-4] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2021] [Accepted: 02/17/2023] [Indexed: 03/08/2023] Open
Abstract
Neural crest cells are highly multipotent stem cells, but it remains unclear how their fate restriction to specific fates occurs. The direct fate restriction model hypothesises that migrating cells maintain full multipotency, whilst progressive fate restriction envisages fully multipotent cells transitioning to partially-restricted intermediates before committing to individual fates. Using zebrafish pigment cell development as a model, we show applying NanoString hybridization single cell transcriptional profiling and RNAscope in situ hybridization that neural crest cells retain broad multipotency throughout migration and even in post-migratory cells in vivo, with no evidence for partially-restricted intermediates. We find that leukocyte tyrosine kinase early expression marks a multipotent stage, with signalling driving iridophore differentiation through repression of fate-specific transcription factors for other fates. We reconcile the direct and progressive fate restriction models by proposing that pigment cell development occurs directly, but dynamically, from a highly multipotent state, consistent with our recently-proposed Cyclical Fate Restriction model.
Collapse
Affiliation(s)
| | - Karen Camargo Sosa
- Department of Life Sciences, University of Bath, Claverton Down, Bath, BA2 7AY, UK
| | - Leonid A Uroshlev
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Ul. Gubkina 3, Moscow, 119991, Russia
| | - Masataka Nikaido
- Department of Life Sciences, University of Bath, Claverton Down, Bath, BA2 7AY, UK
- Graduate School of Science, University of Hyogo, Ako-gun, Hyogo Pref., 678-1297, Japan
| | - Noah Shriever
- Department of Life Sciences, University of Bath, Claverton Down, Bath, BA2 7AY, UK
| | - Artem S Kasianov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Ul. Gubkina 3, Moscow, 119991, Russia
- Department of Medical and Biological Physics, Moscow Institute of Physics and Technology, 9 Institutskiy per., Dolgoprudny, Moscow Region, 141701, Russia
- A.A. Kharkevich Institute for Information Transmission Problems (IITP), Russian Academy of Sciences, Bolshoy Karetny per. 19, build.1, Moscow, 127051, Russia
| | - Xueyan Yang
- Department of Life Sciences, University of Bath, Claverton Down, Bath, BA2 7AY, UK
- The MOE Key Laboratory of Contemporary Anthropology, School of Life Sciences, Fudan University, Shanghai, 200438, PR China
| | | | - Thomas J Carney
- Department of Life Sciences, University of Bath, Claverton Down, Bath, BA2 7AY, UK
- Lee Kong Chian School of Medicine, Experimental Medicine Building, Yunnan Garden Campus, Nanyang Technological University, 59 Nanyang Drive, Yunnan Garden, 636921, Singapore
| | - Gemma Bavister
- Department of Life Sciences, University of Bath, Claverton Down, Bath, BA2 7AY, UK
| | - Hartmut Schwetlick
- Department of Mathematical Sciences, University of Bath, Claverton Down, Bath, BA2 7AY, UK
| | - Jonathan H P Dawes
- Department of Mathematical Sciences, University of Bath, Claverton Down, Bath, BA2 7AY, UK
| | - Andrea Rocco
- Department of Microbial Sciences, FHMS, University of Surrey, GU2 7XH, Guildford, UK
- Department of Physics, FEPS, University of Surrey, GU2 7XH, Guildford, UK
| | - Vsevolod J Makeev
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Ul. Gubkina 3, Moscow, 119991, Russia
- Department of Medical and Biological Physics, Moscow Institute of Physics and Technology, 9 Institutskiy per., Dolgoprudny, Moscow Region, 141701, Russia
- Laboratory 'Regulatory Genomics', Institute of Fundamental Medicine and Biology, Kazan Federal University, 18 Kremlyovskaya street, Kazan, 420008, Russia
| | - Robert N Kelsh
- Department of Life Sciences, University of Bath, Claverton Down, Bath, BA2 7AY, UK.
| |
Collapse
|
48
|
Cheng X, Yan C, Jiang H, Qiu Y. scHOIS: Determining Cell Heterogeneity Through Hierarchical Clustering Based on Optimal Imputation Strategy. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1431-1444. [PMID: 37815942 DOI: 10.1109/tcbb.2022.3203592] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/12/2023]
Abstract
Advances in single-cell RNA sequencing (scRNA-seq) technology provide an unbiased and high-throughput analysis of each cell at single-cell resolution, and further facilitate the development of cellular heterogeneity analysis. Despite the promise of scRNA-seq, the data generated by this method are sparse and noisy because of the presence of dropout events, which can greatly impact downstream analyses such as differential gene expression, cell type annotation, and linage trajectory reconstruction. The development of effective and robust computational methods to address both dropout and clustering are thus urgently needed. In this study, we propose a flexible, accurate two-stage algorithm for single cell heterogeneity analysis via hierarchical clustering based on an optimal imputation strategy, called scHOIS. At the first stage, masked non-negative matrix factorization is applied to approximate the original observed scRNA-seq data, with optimal rank determined by variance analysis. At the second stage, hierarchical clustering is applied to group the imputed cells using Pearson correlation to measure similarity, with the optimal number of clusters determined by integrating three classical indexes. We performed extensive experiments on real-world datasets, which showed that scHOIS effectively and robustly distinguished cellular differences and that the clustering performance of this algorithm was superior to that of other state-of-the-art methods.
Collapse
|
49
|
Xiong Z, Luo J, Shi W, Liu Y, Xu Z, Wang B. scGCL: an imputation method for scRNA-seq data based on graph contrastive learning. Bioinformatics 2023; 39:7056638. [PMID: 36825817 PMCID: PMC9991516 DOI: 10.1093/bioinformatics/btad098] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2022] [Revised: 01/14/2023] [Accepted: 02/24/2023] [Indexed: 02/25/2023] Open
Abstract
MOTIVATION Single-cell RNA-sequencing (scRNA-seq) is widely used to reveal cellular heterogeneity, complex disease mechanisms and cell differentiation processes. Due to high sparsity and complex gene expression patterns, scRNA-seq data present a large number of dropout events, affecting downstream tasks such as cell clustering and pseudo-time analysis. Restoring the expression levels of genes is essential for reducing technical noise and facilitating downstream analysis. However, existing scRNA-seq data imputation methods ignore the topological structure information of scRNA-seq data and cannot comprehensively utilize the relationships between cells. RESULTS Here, we propose a single-cell Graph Contrastive Learning method for scRNA-seq data imputation, named scGCL, which integrates graph contrastive learning and Zero-inflated Negative Binomial (ZINB) distribution to estimate dropout values. scGCL summarizes global and local semantic information through contrastive learning and selects positive samples to enhance the representation of target nodes. To capture the global probability distribution, scGCL introduces an autoencoder based on the ZINB distribution, which reconstructs the scRNA-seq data based on the prior distribution. Through extensive experiments, we verify that scGCL outperforms existing state-of-the-art imputation methods in clustering performance and gene imputation on 14 scRNA-seq datasets. Further, we find that scGCL can enhance the expression patterns of specific genes in Alzheimer's disease datasets. AVAILABILITY AND IMPLEMENTATION The code and data of scGCL are available on Github: https://github.com/zehaoxiong123/scGCL. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zehao Xiong
- College of Computer Science and Electronic Engineering, Hunan University, Changsha 410083, China
| | - Jiawei Luo
- College of Computer Science and Electronic Engineering, Hunan University, Changsha 410083, China
| | - Wanwan Shi
- College of Computer Science and Electronic Engineering, Hunan University, Changsha 410083, China
| | - Ying Liu
- College of Computer Science and Electronic Engineering, Hunan University, Changsha 410083, China
| | - Zhongyuan Xu
- College of Computer Science and Electronic Engineering, Hunan University, Changsha 410083, China
| | - Bo Wang
- College of Computer Science and Electronic Engineering, Hunan University, Changsha 410083, China
| |
Collapse
|
50
|
Bhadani R, Chen Z, An L. Attention-Based Graph Neural Network for Label Propagation in Single-Cell Omics. Genes (Basel) 2023; 14:506. [PMID: 36833434 PMCID: PMC9957137 DOI: 10.3390/genes14020506] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2022] [Revised: 02/13/2023] [Accepted: 02/13/2023] [Indexed: 02/19/2023] Open
Abstract
Single-cell data analysis has been at forefront of development in biology and medicine since sequencing data have been made available. An important challenge in single-cell data analysis is the identification of cell types. Several methods have been proposed for cell-type identification. However, these methods do not capture the higher-order topological relationship between different samples. In this work, we propose an attention-based graph neural network that captures the higher-order topological relationship between different samples and performs transductive learning for predicting cell types. The evaluation of our method on both simulation and publicly available datasets demonstrates the superiority of our method, scAGN, in terms of prediction accuracy. In addition, our method works best for highly sparse datasets in terms of F1 score, precision score, recall score, and Matthew's correlation coefficients as well. Further, our method's runtime complexity is consistently faster compared to other methods.
Collapse
Affiliation(s)
- Rahul Bhadani
- Department of Electrical & Computer Engineering, The University of Arizona, Tucson, AZ 85721, USA
- Interdisciplinary Program in Statistics and Data Science, The University of Arizona, Tucson, AZ 85721, USA
| | - Zhuo Chen
- Interdisciplinary Program in Statistics and Data Science, The University of Arizona, Tucson, AZ 85721, USA
| | - Lingling An
- Interdisciplinary Program in Statistics and Data Science, The University of Arizona, Tucson, AZ 85721, USA
- Department of Biosystems Engineering, The University of Arizona, Tucson, AZ 85721, USA
- Department of Epidemiology and Biostatistics, The University of Arizona, Tucson, AZ 85721, USA
| |
Collapse
|