1
|
Cui X, Yin Q, Gao Z, Li Z, Chen X, Lv H, Chen S, Liu Q, Zeng W, Jiang R. CREATE: cell-type-specific cis-regulatory element identification via discrete embedding. Nat Commun 2025; 16:4607. [PMID: 40382355 PMCID: PMC12085597 DOI: 10.1038/s41467-025-59780-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2024] [Accepted: 05/02/2025] [Indexed: 05/20/2025] Open
Abstract
Cis-regulatory elements (CREs), including enhancers, silencers, promoters and insulators, play pivotal roles in orchestrating gene regulatory mechanisms that drive complex biological traits. However, current approaches for CRE identification are predominantly sequence-based and typically focus on individual CRE types, limiting insights into their cell-type-specific functions and regulatory dynamics. Here, we present CREATE, a multimodal deep learning framework based on Vector Quantized Variational AutoEncoder, tailored for comprehensive CRE identification and characterization. CREATE integrates genomic sequences, chromatin accessibility, and chromatin interaction data to generate discrete CRE embeddings, enabling accurate multi-class classification and robust characterization of CREs. CREATE excels in identifying cell-type-specific CREs, and provides quantitative and interpretable insights into CRE-specific features, uncovering the underlying regulatory codes. By facilitating large-scale prediction of CREs in specific cell types, CREATE enhances the recognition of disease- or phenotype-associated biological variabilities of CREs, thus advancing our understanding of gene regulatory landscapes and their roles in health and disease.
Collapse
Affiliation(s)
- Xuejian Cui
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, China
| | - Qijin Yin
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, China
| | - Zijing Gao
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, China
| | - Zhen Li
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, China
| | - Xiaoyang Chen
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, China
| | - Hairong Lv
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, China
| | - Shengquan Chen
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, China
| | - Qiao Liu
- Department of Statistics, Stanford University, Stanford, CA, USA
| | - Wanwen Zeng
- Department of Statistics, Stanford University, Stanford, CA, USA.
| | - Rui Jiang
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, China.
| |
Collapse
|
2
|
Li S, Hua H, Chen S. Graph neural networks for single-cell omics data: a review of approaches and applications. Brief Bioinform 2025; 26:bbaf109. [PMID: 40091193 PMCID: PMC11911123 DOI: 10.1093/bib/bbaf109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2024] [Revised: 02/09/2025] [Accepted: 02/25/2025] [Indexed: 03/19/2025] Open
Abstract
Rapid advancement of sequencing technologies now allows for the utilization of precise signals at single-cell resolution in various omics studies. However, the massive volume, ultra-high dimensionality, and high sparsity nature of single-cell data have introduced substantial difficulties to traditional computational methods. The intricate non-Euclidean networks of intracellular and intercellular signaling molecules within single-cell datasets, coupled with the complex, multimodal structures arising from multi-omics joint analysis, pose significant challenges to conventional deep learning operations reliant on Euclidean geometries. Graph neural networks (GNNs) have extended deep learning to non-Euclidean data, allowing cells and their features in single-cell datasets to be modeled as nodes within a graph structure. GNNs have been successfully applied across a broad range of tasks in single-cell data analysis. In this survey, we systematically review 107 successful applications of GNNs and their six variants in various single-cell omics tasks. We begin by outlining the fundamental principles of GNNs and their six variants, followed by a systematic review of GNN-based models applied in single-cell epigenomics, transcriptomics, spatial transcriptomics, proteomics, and multi-omics. In each section dedicated to a specific omics type, we have summarized the publicly available single-cell datasets commonly utilized in the articles reviewed in that section, totaling 77 datasets. Finally, we summarize the potential shortcomings of current research and explore directions for future studies. We anticipate that this review will serve as a guiding resource for researchers to deepen the application of GNNs in single-cell omics.
Collapse
Affiliation(s)
- Sijie Li
- School of Mathematical Sciences and The Key Laboratory of Pure Mathematics and Combinatorics, Ministry of Education (LPMC), Nankai University, No. 94 Weijin Road, Nankai District, Tianjin 300071, China
| | - Heyang Hua
- School of Mathematical Sciences and The Key Laboratory of Pure Mathematics and Combinatorics, Ministry of Education (LPMC), Nankai University, No. 94 Weijin Road, Nankai District, Tianjin 300071, China
| | - Shengquan Chen
- School of Mathematical Sciences and The Key Laboratory of Pure Mathematics and Combinatorics, Ministry of Education (LPMC), Nankai University, No. 94 Weijin Road, Nankai District, Tianjin 300071, China
| |
Collapse
|
3
|
Cui X, Chen X, Li Z, Gao Z, Chen S, Jiang R. Discrete latent embedding of single-cell chromatin accessibility sequencing data for uncovering cell heterogeneity. NATURE COMPUTATIONAL SCIENCE 2024; 4:346-359. [PMID: 38730185 DOI: 10.1038/s43588-024-00625-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/25/2023] [Accepted: 04/05/2024] [Indexed: 05/12/2024]
Abstract
Single-cell epigenomic data has been growing continuously at an unprecedented pace, but their characteristics such as high dimensionality and sparsity pose substantial challenges to downstream analysis. Although deep learning models-especially variational autoencoders-have been widely used to capture low-dimensional feature embeddings, the prevalent Gaussian assumption somewhat disagrees with real data, and these models tend to struggle to incorporate reference information from abundant cell atlases. Here we propose CASTLE, a deep generative model based on the vector-quantized variational autoencoder framework to extract discrete latent embeddings that interpretably characterize single-cell chromatin accessibility sequencing data. We validate the performance and robustness of CASTLE for accurate cell-type identification and reasonable visualization compared with state-of-the-art methods. We demonstrate the advantages of CASTLE for effective incorporation of existing massive reference datasets in a weakly supervised or supervised manner. We further demonstrate CASTLE's capacity for intuitively distilling cell-type-specific feature spectra that unveil cell heterogeneity and biological implications quantitatively.
Collapse
Affiliation(s)
- Xuejian Cui
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, China
| | - Xiaoyang Chen
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, China
| | - Zhen Li
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, China
| | - Zijing Gao
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, China
| | - Shengquan Chen
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, China.
| | - Rui Jiang
- Ministry of Education Key Laboratory of Bioinformatics, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Center for Synthetic and Systems Biology, Department of Automation, Tsinghua University, Beijing, China.
| |
Collapse
|
4
|
Li S, Li Y, Sun Y, Li Y, Chen X, Tang S, Chen S. EpiCarousel: memory- and time-efficient identification of metacells for atlas-level single-cell chromatin accessibility data. Bioinformatics 2024; 40:btae191. [PMID: 38588573 PMCID: PMC11037479 DOI: 10.1093/bioinformatics/btae191] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Revised: 03/02/2024] [Accepted: 04/05/2024] [Indexed: 04/10/2024] Open
Abstract
SUMMARY Recent technical advancements in single-cell chromatin accessibility sequencing (scCAS) have brought new insights to the characterization of epigenetic heterogeneity. As single-cell genomics experiments scale up to hundreds of thousands of cells, the demand for computational resources for downstream analysis grows intractably large and exceeds the capabilities of most researchers. Here, we propose EpiCarousel, a tailored Python package based on lazy loading, parallel processing, and community detection for memory- and time-efficient identification of metacells, i.e. the emergence of homogenous cells, in large-scale scCAS data. Through comprehensive experiments on five datasets of various protocols, sample sizes, dimensions, number of cell types, and degrees of cell-type imbalance, EpiCarousel outperformed baseline methods in systematic evaluation of memory usage, computational time, and multiple downstream analyses including cell type identification. Moreover, EpiCarousel executes preprocessing and downstream cell clustering on the atlas-level dataset with 707 043 cells and 1 154 611 peaks within 2 h consuming <75 GB of RAM and provides superior performance for characterizing cell heterogeneity than state-of-the-art methods. AVAILABILITY AND IMPLEMENTATION The EpiCarousel software is well-documented and freely available at https://github.com/biox-nku/epicarousel. It can be seamlessly interoperated with extensive scCAS analysis toolkits.
Collapse
Affiliation(s)
- Sijie Li
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin 300071, China
| | - Yuxi Li
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin 300071, China
| | - Yu Sun
- Institute of Health Service and Transfusion Medicine, Beijing 100850, China
| | - Yaru Li
- Institute of Health Service and Transfusion Medicine, Beijing 100850, China
| | - Xiaoyang Chen
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division of BNRIST, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Songming Tang
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin 300071, China
| | - Shengquan Chen
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin 300071, China
| |
Collapse
|