1
|
Zhang T, Zhao Z, Ren J, Zhang Z, Zhang H, Wang G. cfDiffusion: diffusion-based efficient generation of high quality scRNA-seq data with classifier-free guidance. Brief Bioinform 2024; 26:bbaf071. [PMID: 39987461 PMCID: PMC11846686 DOI: 10.1093/bib/bbaf071] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2024] [Revised: 01/23/2025] [Accepted: 02/05/2025] [Indexed: 02/25/2025] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) technology provides a powerful means to measure gene expression at the individual cell level, thereby uncovering the intricate cellular heterogeneity that underlies various biological processes, including embryonic development, tumor metastasis, and microbial reproduction. However, the variable amounts of data generated across different cell types within tissues can compromise the accuracy of downstream analyses. Traditional approaches for generating scRNA-seq simulation data often rely on predefined data distributions, which can negatively impact the quality of the simulated data. Furthermore, these methods typically focus on simulating single-attribute cells, necessitating substantial additional data for the simulation of multi-attribute cells, which can lead to increased training times. To address these limitations, we propose cfDiffusion, a novel method grounded in diffusion models that incorporates Classifier-Free Guidance and a high-level feature caching mechanism. By leveraging Classifier-Free Guidance, cfDiffusion significantly reduces the training costs associated with model development compared to traditional Classifier Guidance methods. The integration of a caching mechanism further enhances efficiency by shortening inference times. While the inference duration of cfDiffusion remains longer than that of scDiffusion, it exhibits superior expressiveness and efficiency in generating multi-attribute single-cell data. Evaluated across datasets from multiple sequencing platforms, cfDiffusion consistently outperforms state-of-the-art models across various performance metrics. Additionally, cfDiffusion enables the simulation of single-cell data along a pseudo-time scale, facilitating advanced analyses such as tracking cell differentiation, investigating intercellular communication, and elucidating cellular heterogeneity.
Collapse
Affiliation(s)
- Tianjiao Zhang
- College of Computer and Control Engineering, Northeast Forestry University, No. 26, Hexing Road, Xiangfang District, Harbin 150040, China
| | - Zhongqian Zhao
- College of Computer and Control Engineering, Northeast Forestry University, No. 26, Hexing Road, Xiangfang District, Harbin 150040, China
| | - Jixiang Ren
- College of Computer and Control Engineering, Northeast Forestry University, No. 26, Hexing Road, Xiangfang District, Harbin 150040, China
| | - Ziheng Zhang
- College of Computer and Control Engineering, Northeast Forestry University, No. 26, Hexing Road, Xiangfang District, Harbin 150040, China
| | - Hongfei Zhang
- College of Computer and Control Engineering, Northeast Forestry University, No. 26, Hexing Road, Xiangfang District, Harbin 150040, China
| | - Guohua Wang
- College of Computer and Control Engineering, Northeast Forestry University, No. 26, Hexing Road, Xiangfang District, Harbin 150040, China
- Faculty of Computing, Harbin Institute of Technology, No. 92 Xidazhi Street, Nangang District, Harbin 150001, China
| |
Collapse
|
2
|
Luo E, Hao M, Wei L, Zhang X. scDiffusion: conditional generation of high-quality single-cell data using diffusion model. Bioinformatics 2024; 40:btae518. [PMID: 39171840 PMCID: PMC11368386 DOI: 10.1093/bioinformatics/btae518] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2024] [Revised: 08/10/2024] [Accepted: 08/20/2024] [Indexed: 08/23/2024] Open
Abstract
MOTIVATION Single-cell RNA sequencing (scRNA-seq) data are important for studying the laws of life at single-cell level. However, it is still challenging to obtain enough high-quality scRNA-seq data. To mitigate the limited availability of data, generative models have been proposed to computationally generate synthetic scRNA-seq data. Nevertheless, the data generated with current models are not very realistic yet, especially when we need to generate data with controlled conditions. In the meantime, diffusion models have shown their power in generating data with high fidelity, providing a new opportunity for scRNA-seq generation. RESULTS In this study, we developed scDiffusion, a generative model combining the diffusion model and foundation model to generate high-quality scRNA-seq data with controlled conditions. We designed multiple classifiers to guide the diffusion process simultaneously, enabling scDiffusion to generate data under multiple condition combinations. We also proposed a new control strategy called Gradient Interpolation. This strategy allows the model to generate continuous trajectories of cell development from a given cell state. Experiments showed that scDiffusion could generate single-cell gene expression data closely resembling real scRNA-seq data. Also, scDiffusion can conditionally produce data on specific cell types including rare cell types. Furthermore, we could use the multiple-condition generation of scDiffusion to generate cell type that was out of the training data. Leveraging the Gradient Interpolation strategy, we generated a continuous developmental trajectory of mouse embryonic cells. These experiments demonstrate that scDiffusion is a powerful tool for augmenting the real scRNA-seq data and can provide insights into cell fate research. AVAILABILITY AND IMPLEMENTATION scDiffusion is openly available at the GitHub repository https://github.com/EperLuo/scDiffusion or Zenodo https://zenodo.org/doi/10.5281/zenodo.13268742.
Collapse
Affiliation(s)
- Erpai Luo
- MOE Key Lab of Bioinformatics and Bioinformatics Division of BNRIST, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Minsheng Hao
- MOE Key Lab of Bioinformatics and Bioinformatics Division of BNRIST, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Lei Wei
- MOE Key Lab of Bioinformatics and Bioinformatics Division of BNRIST, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Xuegong Zhang
- MOE Key Lab of Bioinformatics and Bioinformatics Division of BNRIST, Department of Automation, Tsinghua University, Beijing 100084, China
- School of Life Sciences and School of Medicine, Center for Synthetic and Systems Biology, Tsinghua University, Beijing 100084, China
| |
Collapse
|
3
|
Guan Q, Yan X, Wu Y, Zhou D, Hu J. Biclustering analysis on tree-shaped time-series single cell gene expression data of Caenorhabditis elegans. BMC Bioinformatics 2024; 25:183. [PMID: 38724908 PMCID: PMC11080145 DOI: 10.1186/s12859-024-05800-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2024] [Accepted: 05/01/2024] [Indexed: 05/13/2024] Open
Abstract
BACKGROUND In recent years, gene clustering analysis has become a widely used tool for studying gene functions, efficiently categorizing genes with similar expression patterns to aid in identifying gene functions. Caenorhabditis elegans is commonly used in embryonic research due to its consistent cell lineage from fertilized egg to adulthood. Biologists use 4D confocal imaging to observe gene expression dynamics at the single-cell level. However, on one hand, the observed tree-shaped time-series datasets have characteristics such as non-pairwise data points between different individuals. On the other hand, the influence of cell type heterogeneity should also be considered during clustering, aiming to obtain more biologically significant clustering results. RESULTS A biclustering model is proposed for tree-shaped single-cell gene expression data of Caenorhabditis elegans. Detailedly, a tree-shaped piecewise polynomial function is first employed to fit non-pairwise gene expression time series data. Then, four factors are considered in the objective function, including Pearson correlation coefficients capturing gene correlations, p-values from the Kolmogorov-Smirnov test measuring the similarity between cells, as well as gene expression size and bicluster overlapping size. After that, Genetic Algorithm is utilized to optimize the function. CONCLUSION The results on the small-scale dataset analysis validate the feasibility and effectiveness of our model and are superior to existing classical biclustering models. Besides, gene enrichment analysis is employed to assess the results on the complete real dataset analysis, confirming that the discovered biclustering results hold significant biological relevance.
Collapse
Affiliation(s)
- Qi Guan
- School of Mathematical Sciences, Xiamen University, Xiamen, 361005, Fujian, China
| | - Xianzhong Yan
- School of Mathematical Sciences, Xiamen University, Xiamen, 361005, Fujian, China
| | - Yida Wu
- School of Mathematical Sciences, Xiamen University, Xiamen, 361005, Fujian, China
| | - Da Zhou
- School of Mathematical Sciences, Xiamen University, Xiamen, 361005, Fujian, China
| | - Jie Hu
- School of Mathematical Sciences, Xiamen University, Xiamen, 361005, Fujian, China.
| |
Collapse
|
4
|
Vallevik VB, Babic A, Marshall SE, Elvatun S, Brøgger HMB, Alagaratnam S, Edwin B, Veeraragavan NR, Befring AK, Nygård JF. Can I trust my fake data - A comprehensive quality assessment framework for synthetic tabular data in healthcare. Int J Med Inform 2024; 185:105413. [PMID: 38493547 DOI: 10.1016/j.ijmedinf.2024.105413] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 02/17/2024] [Accepted: 03/11/2024] [Indexed: 03/19/2024]
Abstract
BACKGROUND Ensuring safe adoption of AI tools in healthcare hinges on access to sufficient data for training, testing and validation. Synthetic data has been suggested in response to privacy concerns and regulatory requirements and can be created by training a generator on real data to produce a dataset with similar statistical properties. Competing metrics with differing taxonomies for quality evaluation have been proposed, resulting in a complex landscape. Optimising quality entails balancing considerations that make the data fit for use, yet relevant dimensions are left out of existing frameworks. METHOD We performed a comprehensive literature review on the use of quality evaluation metrics on synthetic data within the scope of synthetic tabular healthcare data using deep generative methods. Based on this and the collective team experiences, we developed a conceptual framework for quality assurance. The applicability was benchmarked against a practical case from the Dutch National Cancer Registry. CONCLUSION We present a conceptual framework for quality assuranceof synthetic data for AI applications in healthcare that aligns diverging taxonomies, expands on common quality dimensions to include the dimensions of Fairness and Carbon footprint, and proposes stages necessary to support real-life applications. Building trust in synthetic data by increasing transparency and reducing the safety risk will accelerate the development and uptake of trustworthy AI tools for the benefit of patients. DISCUSSION Despite the growing emphasis on algorithmic fairness and carbon footprint, these metrics were scarce in the literature review. The overwhelming focus was on statistical similarity using distance metrics while sequential logic detection was scarce. A consensus-backed framework that includes all relevant quality dimensions can provide assurance for safe and responsible real-life applications of synthetic data. As the choice of appropriate metrics are highly context dependent, further research is needed on validation studies to guide metric choices and support the development of technical standards.
Collapse
Affiliation(s)
- Vibeke Binz Vallevik
- University of Oslo, Boks 1072 Blindern, NO-0316 Oslo, Norway; DNV AS, Veritasveien 1, 1322 Høvik, Norway.
| | | | | | - Severin Elvatun
- Cancer Registry of Norway, Ullernchausseen 64, 0379 Oslo, Norway
| | - Helga M B Brøgger
- DNV AS, Veritasveien 1, 1322 Høvik, Norway; Oslo University Hospital, Sognsvannsveien 20, 0372 Oslo, Norway
| | | | - Bjørn Edwin
- University of Oslo, Boks 1072 Blindern, NO-0316 Oslo, Norway; The Intervention Centre and Department of HPB Surgery, Oslo University Hospital and Institute of Clinical Medicine, Faculty of Medicine, University of Oslo, Oslo, Norway
| | | | | | - Jan F Nygård
- Cancer Registry of Norway, Ullernchausseen 64, 0379 Oslo, Norway; UiT - The Arctic University of Norway, Tromsø, Norway
| |
Collapse
|