1
|
Jiang S, Wang C, Sun Q, Zhang Z. A robust multi-scale clustering framework for single-cell RNA-seq data analysis. Sci Rep 2025; 15:18543. [PMID: 40425750 PMCID: PMC12116994 DOI: 10.1038/s41598-025-03603-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2025] [Accepted: 05/21/2025] [Indexed: 05/29/2025] Open
Abstract
Recent advancements in single-cell RNA sequencing (scRNA-seq) technology have unlocked novel opportunities for deep exploration of gene expression patterns. However, the inherent high dimensionality, sparsity, and noise in scRNA-seq data pose significant challenges for existing clustering methods, especially in accurately identifying and classifying diverse cell types. To address these challenges, we introduce a new method, single-cell Multi-Scale Clustering Framework (scMSCF), which combines multi-dimensional PCA for dimensionality reduction, K-means clustering, and a weighted ensemble meta-clustering approach, enhanced by a self-attention-driven Transformer model to optimize clustering performance. scMSCF constructs an initial clustering framework using a multi-layer dimensionality reduction strategy to establish a robust consensus on clustering structure. A voting mechanism within the meta-clustering process selects high-confidence cells from the initial clustering results to provide precise training labels for the Transformer model. This approach enables the model to capture complex dependencies in gene expression data, thereby enhancing clustering accuracy. Comprehensive testing across eight single-cell RNA sequencing datasets demonstrates that scMSCF surpasses existing methods, achieving on average 10-15% higher ARI, NMI, and ACC scores. For example, on the PBMC5k dataset, scMSCF improves ARI from 0.72 to 0.86, demonstrating its ability to accurately identify diverse cell populations. The source code for our algorithm is publicly available at https://github.com/DEREKJ24/scMSCF .
Collapse
Affiliation(s)
- Songrun Jiang
- College of Computer Science and Technology, Changchun Normal University, Changchun, 130000, China
| | - Chunyan Wang
- College of Computer Science and Technology, Changchun Normal University, Changchun, 130000, China.
| | - Qiucheng Sun
- College of Computer Science and Technology, Changchun Normal University, Changchun, 130000, China.
| | - Zhi Zhang
- College of Computer Science and Technology, Changchun Normal University, Changchun, 130000, China
| |
Collapse
|
2
|
Liao X, Li Y, Li S, Wen L, Li X, Yu B. Enhanced Integration of Single-Cell Multi-Omics Data Using Graph Attention Networks. ACS Synth Biol 2025; 14:931-942. [PMID: 39888834 DOI: 10.1021/acssynbio.4c00864] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2025]
Abstract
The continuous advancement of single-cell multimodal omics (scMulti-omics) technologies offers unprecedented opportunities to measure various modalities, including RNA expression, protein abundance, gene perturbation, DNA methylation, and chromatin accessibility at single-cell resolution. These advances hold significant potential for breakthroughs by integrating diverse omics modalities. However, the data generated from different omics layers often face challenges due to high dimensionality, heterogeneity, and sparsity, which can adversely impact the accuracy and efficiency of data integration analyses. To address these challenges, we propose a high-precision analysis method called scMGAT (single-cell multiomics data analysis based on multihead graph attention networks). This method effectively coordinates reliable information across multiomics data sets using a multihead attention mechanism, allowing for better management of the heterogeneous characteristics inherent in scMulti-omics data. We evaluated scMGAT's performance on eight sets of real scMulti-omics data, including samples from both human and mouse. The experimental results demonstrate that scMGAT significantly enhances the quality of multiomics data and improves the accuracy of cell-type annotation compared to state-of-the-art methods. scMGAT is now freely accessible at https://github.com/Xingyu-Liao/scMGAT.
Collapse
Affiliation(s)
- Xingyu Liao
- School of Computer Science, Northwestern Polytechnical University (NPU), Chang'an Campus, Xi'an, Shaanxi 710072, P.R. China
| | - Yanyan Li
- School of Computer Science, Northwestern Polytechnical University (NPU), Chang'an Campus, Xi'an, Shaanxi 710072, P.R. China
| | - Shuangyi Li
- School of Data Science, Qingdao University of Science and Technology, Qingdao 266061, P.R. China
| | - Long Wen
- School of Computer Science, Northwestern Polytechnical University (NPU), Chang'an Campus, Xi'an, Shaanxi 710072, P.R. China
| | - Xingyi Li
- School of Computer Science, Northwestern Polytechnical University (NPU), Chang'an Campus, Xi'an, Shaanxi 710072, P.R. China
| | - Bin Yu
- School of Data Science, Qingdao University of Science and Technology, Qingdao 266061, P.R. China
| |
Collapse
|
3
|
Chen J, Sun Q, Wang C, Gao C. scCCTR: An iterative selection-based semi-supervised clustering model for single-cell RNA-seq data. Comput Struct Biotechnol J 2025; 27:1090-1102. [PMID: 40165824 PMCID: PMC11957811 DOI: 10.1016/j.csbj.2025.03.018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2024] [Revised: 02/28/2025] [Accepted: 03/10/2025] [Indexed: 04/02/2025] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) enables the analysis of the genome, transcriptome, and epigenome at the single-cell level, providing a critical tool for understanding cellular heterogeneity and diversity. Cell clustering, a key step in scRNA-seq data analysis, reveals population structure by grouping cells with similar expression patterns. However, due to the high dimensionality and sparsity of scRNA-seq data, the performance of existing clustering algorithms remains suboptimal. In this study, we propose a novel clustering algorithm, scCCTR, which performs semi-supervised classification by guiding a deep learning model through iterative selection of high-confidence cells and labels. The algorithm consists of two main components: an iterative selection module and a semi-supervised classification module. In the iterative selection module, scCCTR progressively selects high-confidence cells that exhibit core group features and iteratively optimizes feature representations, constructing a consensus clustering result throughout the iterations. In the semi-supervised classification module, scCCTR uses the selected core data to train a Transformer neural network, which leverages a multi-head attention mechanism to focus on critical information, thereby achieving higher clustering precision. We compared scCCTR with several established cell clustering methods on real datasets, and the results demonstrate that scCCTR outperforms existing methods in terms of accuracy and effectiveness for both cell clustering and visualization. (The code of scCCTR is free available for academic https://github.com/chenjiejie387/scCCTR).
Collapse
Affiliation(s)
- Jie Chen
- School of Computer Science and Technology, Changchun Normal University, Changchun, 130032, China
| | - Qiucheng Sun
- School of Computer Science and Technology, Changchun Normal University, Changchun, 130032, China
| | - Chunyan Wang
- School of Computer Science and Technology, Changchun Normal University, Changchun, 130032, China
| | - Changbo Gao
- School of Computer Science and Technology, Changchun Normal University, Changchun, 130032, China
| |
Collapse
|
4
|
Hackenberg M, Brunn N, Vogel T, Binder H. Infusing structural assumptions into dimensionality reduction for single-cell RNA sequencing data to identify small gene sets. Commun Biol 2025; 8:414. [PMID: 40069486 PMCID: PMC11897155 DOI: 10.1038/s42003-025-07872-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2024] [Accepted: 03/03/2025] [Indexed: 03/15/2025] Open
Abstract
Dimensionality reduction greatly facilitates the exploration of cellular heterogeneity in single-cell RNA sequencing data. While most of such approaches are data-driven, it can be useful to incorporate biologically plausible assumptions about the underlying structure or the experimental design. We propose the boosting autoencoder (BAE) approach, which combines the advantages of unsupervised deep learning for dimensionality reduction and boosting for formalizing assumptions. Specifically, our approach selects small sets of genes that explain latent dimensions. As illustrative applications, we explore the diversity of neural cell identities and temporal patterns of embryonic development.
Collapse
Grants
- Deutsche Forschungsgemeinschaft (DFG, German Research Foundation): Project-ID 322977937, GRK 2344
- Deutsche Forschungsgemeinschaft (DFG, German Research Foundation): Project-ID 322977937, GRK 2344 ; Deutsche Forschungsgemeinschaft (DFG, German Research Foundation): Project-ID 499552394, SFB 1597
- Deutsche Forschungsgemeinschaft (DFG, German Research Foundation): Project-ID 322977937, GRK 2344; Deutsche Forschungsgemeinschaft (DFG, German Research Foundation): Project-ID 499552394, SFB 1597
Collapse
Affiliation(s)
- Maren Hackenberg
- Institute of Medical Biometry and Statistics (IMBI), Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany.
- Freiburg Center for Data Analysis, Modeling and AI, University of Freiburg, Freiburg, Germany.
| | - Niklas Brunn
- Institute of Medical Biometry and Statistics (IMBI), Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany.
- Freiburg Center for Data Analysis, Modeling and AI, University of Freiburg, Freiburg, Germany.
| | - Tanja Vogel
- Institute of Anatomy and Cell Biology, Department Molecular Embryology, Faculty of Medicine, University of Freiburg, Freiburg, Germany
| | - Harald Binder
- Institute of Medical Biometry and Statistics (IMBI), Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany
- Freiburg Center for Data Analysis, Modeling and AI, University of Freiburg, Freiburg, Germany
- Centre for Integrative Biological Signaling Studies (CIBSS), University of Freiburg, Freiburg, Germany
| |
Collapse
|
5
|
Li S, Hua H, Chen S. Graph neural networks for single-cell omics data: a review of approaches and applications. Brief Bioinform 2025; 26:bbaf109. [PMID: 40091193 PMCID: PMC11911123 DOI: 10.1093/bib/bbaf109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2024] [Revised: 02/09/2025] [Accepted: 02/25/2025] [Indexed: 03/19/2025] Open
Abstract
Rapid advancement of sequencing technologies now allows for the utilization of precise signals at single-cell resolution in various omics studies. However, the massive volume, ultra-high dimensionality, and high sparsity nature of single-cell data have introduced substantial difficulties to traditional computational methods. The intricate non-Euclidean networks of intracellular and intercellular signaling molecules within single-cell datasets, coupled with the complex, multimodal structures arising from multi-omics joint analysis, pose significant challenges to conventional deep learning operations reliant on Euclidean geometries. Graph neural networks (GNNs) have extended deep learning to non-Euclidean data, allowing cells and their features in single-cell datasets to be modeled as nodes within a graph structure. GNNs have been successfully applied across a broad range of tasks in single-cell data analysis. In this survey, we systematically review 107 successful applications of GNNs and their six variants in various single-cell omics tasks. We begin by outlining the fundamental principles of GNNs and their six variants, followed by a systematic review of GNN-based models applied in single-cell epigenomics, transcriptomics, spatial transcriptomics, proteomics, and multi-omics. In each section dedicated to a specific omics type, we have summarized the publicly available single-cell datasets commonly utilized in the articles reviewed in that section, totaling 77 datasets. Finally, we summarize the potential shortcomings of current research and explore directions for future studies. We anticipate that this review will serve as a guiding resource for researchers to deepen the application of GNNs in single-cell omics.
Collapse
Affiliation(s)
- Sijie Li
- School of Mathematical Sciences and The Key Laboratory of Pure Mathematics and Combinatorics, Ministry of Education (LPMC), Nankai University, No. 94 Weijin Road, Nankai District, Tianjin 300071, China
| | - Heyang Hua
- School of Mathematical Sciences and The Key Laboratory of Pure Mathematics and Combinatorics, Ministry of Education (LPMC), Nankai University, No. 94 Weijin Road, Nankai District, Tianjin 300071, China
| | - Shengquan Chen
- School of Mathematical Sciences and The Key Laboratory of Pure Mathematics and Combinatorics, Ministry of Education (LPMC), Nankai University, No. 94 Weijin Road, Nankai District, Tianjin 300071, China
| |
Collapse
|
6
|
Tian S, Ji C, Ni J, Wang Y, Zheng C. Using Multi-Encoder Semi-Implicit Graph Variational Autoencoder to Analyze Single-Cell RNA Sequencing Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:2280-2291. [PMID: 39255084 DOI: 10.1109/tcbb.2024.3458170] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/12/2024]
Abstract
Rapid advances in single-cell RNA sequencing (scRNA-seq) have made it possible to characterize cell states at a high resolution view for large scale library. scRNA-seq data contains a great deal of biological information, which can be mainly used to discover cell subtypes and track cell development. However, traditional methods face many challenges in addressing scRNA-seq data with high dimensions and high sparsity. For better analysis of scRNA-seq data, we propose a new framework called MSVGAE based on variational graph auto-encoder and graph attention networks. Specifically, we introduce multiple encoders to learn features at different scales and control for uninformative features. Moreover, different noises are added to encoders to promote the propagation of graph structural information and distribution uncertainty. Therefore, some complex posterior distributions can be captured by our model. MSVGAE maps scRNA-seq data with high dimensions and high noise into the low-dimensional latent space, which is beneficial for downstream tasks. In particular, MSVGAE can handle extremely sparse data. Before the experiment, we create 24 simulated datasets to simulate various biological scenarios and collect 8 real-world datasets. The experimental results of clustering, visualization and marker genes analysis indicate that MSVGAE model has excellent accuracy and robustness in analyzing scRNA-seq data.
Collapse
|
7
|
Liu L, Wu X, Yu J, Zhang Y, Niu K, Yu A. scVGATAE: A Variational Graph Attentional Autoencoder Model for Clustering Single-Cell RNA-seq Data. BIOLOGY 2024; 13:713. [PMID: 39336140 PMCID: PMC11428844 DOI: 10.3390/biology13090713] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/02/2024] [Revised: 09/06/2024] [Accepted: 09/07/2024] [Indexed: 09/30/2024]
Abstract
Single-cell RNA sequencing (scRNA-seq) is now a successful technology for identifying cell heterogeneity, revealing new cell subpopulations, and predicting developmental trajectories. A crucial component in scRNA-seq is the precise identification of cell subsets. Although many unsupervised clustering methods have been developed for clustering cell subpopulations, the performance of these methods is prone to be affected by dropout, high dimensionality, and technical noise. Additionally, most existing methods are time-consuming and fail to fully consider the potential correlations between cells. In this paper, we propose a novel unsupervised clustering method called scVGATAE (Single-cell Variational Graph Attention Autoencoder) for scRNA-seq data. This method constructs a reliable cell graph through network denoising, utilizes a novel variational graph autoencoder model integrated with graph attention networks to aggregate neighbor information and learn the distribution of the low-dimensional representations of cells, and adaptively determines the model training iterations for various datasets. Finally, the obtained low-dimensional representations of cells are clustered using kmeans. Experiments on nine public datasets show that scVGATAE outperforms classical and state-of-the-art clustering methods.
Collapse
Affiliation(s)
- Lijun Liu
- School of Science, Dalian Minzu University, Dalian 116600, China
| | - Xiaoyang Wu
- School of Science, Dalian Minzu University, Dalian 116600, China
| | - Jun Yu
- School of Science, Dalian Minzu University, Dalian 116600, China
| | - Yuduo Zhang
- School of Science, Dalian Minzu University, Dalian 116600, China
| | - Kaixing Niu
- School of Science, Dalian Minzu University, Dalian 116600, China
| | - Anli Yu
- School of Science, Dalian Minzu University, Dalian 116600, China
| |
Collapse
|
8
|
Xu L, Li Z, Ren J, Liu S, Xu Y. Single-cell RNA sequencing data analysis utilizing multi-type graph neural networks. Comput Biol Med 2024; 179:108921. [PMID: 39059210 DOI: 10.1016/j.compbiomed.2024.108921] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2024] [Revised: 07/08/2024] [Accepted: 07/16/2024] [Indexed: 07/28/2024]
Abstract
Single-cell RNA sequencing (scRNA-seq) is the sequencing technology of a single cell whose expression reflects the overall characteristics of the individual cell, facilitating the research of problems at the cellular level. However, the problems of scRNA-seq such as dimensionality reduction processing of massive data, technical noise in data, and visualization of single-cell type clustering cause great difficulties for analyzing and processing scRNA-seq data. In this paper, we propose a new single-cell data analysis model using denoising autoencoder and multi-type graph neural networks (scDMG), which learns cell-cell topology information and latent representation of scRNA-seq data. scDMG introduces the zero-inflated negative binomial (ZINB) model into a denoising autoencoder (DAE) to perform dimensionality reduction and denoising on the raw data. scDMG integrates multiple-type graph neural networks as the encoder to further train the preprocessed data, which better deals with various types of scRNA-seq datasets, resolves dropout events in scRNA-seq data, and enables preliminary classification of scRNA-seq data. By employing TSNE and PCA algorithms for the trained data and invoking Louvain algorithm, scDMG has better dimensionality reduction and clustering optimization. Compared with other mainstream scRNA-seq clustering algorithms, scDMG outperforms other state-of-the-art methods in various clustering performance metrics and shows better scalability, shorter runtime, and great clustering results.
Collapse
Affiliation(s)
- Li Xu
- College of Computer Science and Technology, Harbin Engineering University, Harbin, 150001, Heilongjiang, China
| | - Zhenpeng Li
- College of Computer Science and Technology, Harbin Engineering University, Harbin, 150001, Heilongjiang, China.
| | - Jiaxu Ren
- College of Computer Science and Technology, Harbin Engineering University, Harbin, 150001, Heilongjiang, China
| | - Shuaipeng Liu
- College of Computer Science and Technology, Harbin Engineering University, Harbin, 150001, Heilongjiang, China
| | - Yiming Xu
- College of Engineering, Tokyo Institute of Technology, Tokyo, 226-0026, Tokyo, Japan
| |
Collapse
|
9
|
Chen H, Ryu J, Vinyard ME, Lerer A, Pinello L. SIMBA: single-cell embedding along with features. Nat Methods 2024; 21:1003-1013. [PMID: 37248389 PMCID: PMC11166568 DOI: 10.1038/s41592-023-01899-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2022] [Accepted: 04/26/2023] [Indexed: 05/31/2023]
Abstract
Most current single-cell analysis pipelines are limited to cell embeddings and rely heavily on clustering, while lacking the ability to explicitly model interactions between different feature types. Furthermore, these methods are tailored to specific tasks, as distinct single-cell problems are formulated differently. To address these shortcomings, here we present SIMBA, a graph embedding method that jointly embeds single cells and their defining features, such as genes, chromatin-accessible regions and DNA sequences, into a common latent space. By leveraging the co-embedding of cells and features, SIMBA allows for the study of cellular heterogeneity, clustering-free marker discovery, gene regulation inference, batch effect removal and omics data integration. We show that SIMBA provides a single framework that allows diverse single-cell problems to be formulated in a unified way and thus simplifies the development of new analyses and extension to new single-cell modalities. SIMBA is implemented as a comprehensive Python library ( https://simba-bio.readthedocs.io ).
Collapse
Affiliation(s)
- Huidong Chen
- Molecular Pathology Unit, Center for Cancer Research, Massachusetts General Hospital, Boston, MA, USA
- Department of Pathology, Harvard Medical School, Boston, MA, USA
- Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - Jayoung Ryu
- Molecular Pathology Unit, Center for Cancer Research, Massachusetts General Hospital, Boston, MA, USA
- Department of Pathology, Harvard Medical School, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Michael E Vinyard
- Molecular Pathology Unit, Center for Cancer Research, Massachusetts General Hospital, Boston, MA, USA
- Department of Pathology, Harvard Medical School, Boston, MA, USA
- Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA, USA
| | - Adam Lerer
- Facebook AI Research, New York, NY, USA.
| | - Luca Pinello
- Molecular Pathology Unit, Center for Cancer Research, Massachusetts General Hospital, Boston, MA, USA.
- Department of Pathology, Harvard Medical School, Boston, MA, USA.
- Broad Institute of Harvard and MIT, Cambridge, MA, USA.
| |
Collapse
|
10
|
An S, Shi J, Liu R, Chen Y, Wang J, Hu S, Xia X, Dong G, Bo X, He Z, Ying X. scDAC: deep adaptive clustering of single-cell transcriptomic data with coupled autoencoder and Dirichlet process mixture model. Bioinformatics 2024; 40:btae198. [PMID: 38603616 PMCID: PMC11256937 DOI: 10.1093/bioinformatics/btae198] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Revised: 03/20/2024] [Accepted: 04/10/2024] [Indexed: 04/13/2024] Open
Abstract
MOTIVATION Clustering analysis for single-cell RNA sequencing (scRNA-seq) data is an important step in revealing cellular heterogeneity. Many clustering methods have been proposed to discover heterogenous cell types from scRNA-seq data. However, adaptive clustering with accurate cluster number reflecting intrinsic biology nature from large-scale scRNA-seq data remains quite challenging. RESULTS Here, we propose a single-cell Deep Adaptive Clustering (scDAC) model by coupling the Autoencoder (AE) and the Dirichlet Process Mixture Model (DPMM). By jointly optimizing the model parameters of AE and DPMM, scDAC achieves adaptive clustering with accurate cluster numbers on scRNA-seq data. We verify the performance of scDAC on five subsampled datasets with different numbers of cell types and compare it with 15 widely used clustering methods across nine scRNA-seq datasets. Our results demonstrate that scDAC can adaptively find accurate numbers of cell types or subtypes and outperforms other methods. Moreover, the performance of scDAC is robust to hyperparameter changes. AVAILABILITY AND IMPLEMENTATION The scDAC is implemented in Python. The source code is available at https://github.com/labomics/scDAC.
Collapse
Affiliation(s)
- Sijing An
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing 100850, China
| | - Jinhui Shi
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing 100850, China
| | - Runyan Liu
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing 100850, China
| | - Yaowen Chen
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing 100850, China
| | - Jing Wang
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing 100850, China
| | - Shuofeng Hu
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing 100850, China
| | - Xinyu Xia
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing 100850, China
| | - Guohua Dong
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing 100850, China
| | - Xiaochen Bo
- Department of Bioinformatics, Institute of Health Service and Transfusion Medicine, Beijing 100850, China
| | - Zhen He
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing 100850, China
| | - Xiaomin Ying
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing 100850, China
| |
Collapse
|
11
|
Buterez D, Janet JP, Kiddle SJ, Oglic D, Lió P. Transfer learning with graph neural networks for improved molecular property prediction in the multi-fidelity setting. Nat Commun 2024; 15:1517. [PMID: 38409255 PMCID: PMC11258334 DOI: 10.1038/s41467-024-45566-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Accepted: 01/25/2024] [Indexed: 02/28/2024] Open
Abstract
We investigate the potential of graph neural networks for transfer learning and improving molecular property prediction on sparse and expensive to acquire high-fidelity data by leveraging low-fidelity measurements as an inexpensive proxy for a targeted property of interest. This problem arises in discovery processes that rely on screening funnels for trading off the overall costs against throughput and accuracy. Typically, individual stages in these processes are loosely connected and each one generates data at different scale and fidelity. We consider this setup holistically and demonstrate empirically that existing transfer learning techniques for graph neural networks are generally unable to harness the information from multi-fidelity cascades. Here, we propose several effective transfer learning strategies and study them in transductive and inductive settings. Our analysis involves a collection of more than 28 million unique experimental protein-ligand interactions across 37 targets from drug discovery by high-throughput screening and 12 quantum properties from the dataset QMugs. The results indicate that transfer learning can improve the performance on sparse tasks by up to eight times while using an order of magnitude less high-fidelity training data. Moreover, the proposed methods consistently outperform existing transfer learning strategies for graph-structured data on drug discovery and quantum mechanics datasets.
Collapse
Affiliation(s)
- David Buterez
- Department of Computer Science and Technology, University of Cambridge, Cambridge, UK.
| | - Jon Paul Janet
- Molecular AI, BioPharmaceuticals R&D, AstraZeneca, Gothenburg, Sweden
| | - Steven J Kiddle
- Data Science & Advanced Analytics, Data Science & AI, R&D, AstraZeneca, Cambridge, UK
| | - Dino Oglic
- Centre for AI, BioPharmaceuticals R&D, AstraZeneca, Cambridge, UK
| | - Pietro Lió
- Department of Computer Science and Technology, University of Cambridge, Cambridge, UK
| |
Collapse
|
12
|
Shahir JA, Stanley N, Purvis JE. Cellograph: a semi-supervised approach to analyzing multi-condition single-cell RNA-sequencing data using graph neural networks. BMC Bioinformatics 2024; 25:25. [PMID: 38221640 PMCID: PMC10788980 DOI: 10.1186/s12859-024-05641-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 01/04/2024] [Indexed: 01/16/2024] Open
Abstract
With the growing number of single-cell datasets collected under more complex experimental conditions, there is an opportunity to leverage single-cell variability to reveal deeper insights into how cells respond to perturbations. Many existing approaches rely on discretizing the data into clusters for differential gene expression (DGE), effectively ironing out any information unveiled by the single-cell variability across cell-types. In addition, DGE often assumes a statistical distribution that, if erroneous, can lead to false positive differentially expressed genes. Here, we present Cellograph: a semi-supervised framework that uses graph neural networks to quantify the effects of perturbations at single-cell granularity. Cellograph not only measures how prototypical cells are of each condition but also learns a latent space that is amenable to interpretable data visualization and clustering. The learned gene weight matrix from training reveals pertinent genes driving the differences between conditions. We demonstrate the utility of our approach on publicly-available datasets including cancer drug therapy, stem cell reprogramming, and organoid differentiation. Cellograph outperforms existing methods for quantifying the effects of experimental perturbations and offers a novel framework to analyze single-cell data using deep learning.
Collapse
Affiliation(s)
- Jamshaid A Shahir
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Curriculum in Bioinformatics and Computational Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Computational Medicine Program, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Natalie Stanley
- Curriculum in Bioinformatics and Computational Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Computational Medicine Program, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Jeremy E Purvis
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
- Curriculum in Bioinformatics and Computational Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
- Computational Medicine Program, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
| |
Collapse
|
13
|
Buterez D, Janet JP, Kiddle SJ, Oglic D, Liò P. Modelling local and general quantum mechanical properties with attention-based pooling. Commun Chem 2023; 6:262. [PMID: 38030692 PMCID: PMC10686994 DOI: 10.1038/s42004-023-01045-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Accepted: 10/27/2023] [Indexed: 12/01/2023] Open
Abstract
Atom-centred neural networks represent the state-of-the-art for approximating the quantum chemical properties of molecules, such as internal energies. While the design of machine learning architectures that respect chemical principles has continued to advance, the final atom pooling operation that is necessary to convert from atomic to molecular representations in most models remains relatively undeveloped. The most common choices, sum and average pooling, compute molecular representations that are naturally a good fit for many physical properties, while satisfying properties such as permutation invariance which are desirable from a geometric deep learning perspective. However, there are growing concerns that such simplistic functions might have limited representational power, while also being suboptimal for physical properties that are highly localised or intensive. Based on recent advances in graph representation learning, we investigate the use of a learnable pooling function that leverages an attention mechanism to model interactions between atom representations. The proposed pooling operation is a drop-in replacement requiring no changes to any of the other architectural components. Using SchNet and DimeNet++ as starting models, we demonstrate consistent uplifts in performance compared to sum and mean pooling and a recent physics-aware pooling operation designed specifically for orbital energies, on several datasets, properties, and levels of theory, with up to 85% improvements depending on the specific task.
Collapse
Affiliation(s)
- David Buterez
- Department of Computer Science and Technology, University of Cambridge, Cambridge, CB3 0FD, UK.
| | - Jon Paul Janet
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg, 431 50, Sweden
| | - Steven J Kiddle
- Data Science & Advanced Analytics, Data Science & AI, R&D, AstraZeneca, Cambridge, CB2 8PA, UK
| | - Dino Oglic
- Center for AI, Data Science & AI, R&D, AstraZeneca, Cambridge, CB2 8PA, UK
| | - Pietro Liò
- Department of Computer Science and Technology, University of Cambridge, Cambridge, CB3 0FD, UK
| |
Collapse
|
14
|
Choi SR, Lee M. Transformer Architecture and Attention Mechanisms in Genome Data Analysis: A Comprehensive Review. BIOLOGY 2023; 12:1033. [PMID: 37508462 PMCID: PMC10376273 DOI: 10.3390/biology12071033] [Citation(s) in RCA: 36] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Revised: 07/18/2023] [Accepted: 07/21/2023] [Indexed: 07/30/2023]
Abstract
The emergence and rapid development of deep learning, specifically transformer-based architectures and attention mechanisms, have had transformative implications across several domains, including bioinformatics and genome data analysis. The analogous nature of genome sequences to language texts has enabled the application of techniques that have exhibited success in fields ranging from natural language processing to genomic data. This review provides a comprehensive analysis of the most recent advancements in the application of transformer architectures and attention mechanisms to genome and transcriptome data. The focus of this review is on the critical evaluation of these techniques, discussing their advantages and limitations in the context of genome data analysis. With the swift pace of development in deep learning methodologies, it becomes vital to continually assess and reflect on the current standing and future direction of the research. Therefore, this review aims to serve as a timely resource for both seasoned researchers and newcomers, offering a panoramic view of the recent advancements and elucidating the state-of-the-art applications in the field. Furthermore, this review paper serves to highlight potential areas of future investigation by critically evaluating studies from 2019 to 2023, thereby acting as a stepping-stone for further research endeavors.
Collapse
Affiliation(s)
| | - Minhyeok Lee
- School of Electrical and Electronics Engineering, Chung-Ang University, Seoul 06974, Republic of Korea;
| |
Collapse
|
15
|
Davalos OA, Heydari AA, Fertig EJ, Sindi SS, Hoyer KK. Boosting Single-Cell RNA Sequencing Analysis with Simple Neural Attention. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.29.542760. [PMID: 37398136 PMCID: PMC10312486 DOI: 10.1101/2023.05.29.542760] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/04/2023]
Abstract
A limitation of current deep learning (DL) approaches for single-cell RNA sequencing (scRNAseq) analysis is the lack of interpretability. Moreover, existing pipelines are designed and trained for specific tasks used disjointly for different stages of analysis. We present scANNA, a novel interpretable DL model for scRNAseq studies that leverages neural attention to learn gene associations. After training, the learned gene importance (interpretability) is used to perform downstream analyses (e.g., global marker selection and cell-type classification) without retraining. ScANNA's performance is comparable to or better than state-of-the-art methods designed and trained for specific standard scRNAseq analyses even though scANNA was not trained for these tasks explicitly. ScANNA enables researchers to discover meaningful results without extensive prior knowledge or training separate task-specific models, saving time and enhancing scRNAseq analyses.
Collapse
Affiliation(s)
- Oscar A. Davalos
- Quantitative and Systems Biology Graduate Program, University of California, Merced, CA, USA
| | - A. Ali Heydari
- Department of Applied Mathematics, University of California, Merced, CA, USA
- Health Sciences Research Institute, University of California, Merced, CA, USA
| | - Elana J. Fertig
- Department of Oncology, Division of Biostatistics and Bioinformatics, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, MD, USA
| | - Suzanne S. Sindi
- Department of Applied Mathematics, University of California, Merced, CA, USA
- Health Sciences Research Institute, University of California, Merced, CA, USA
| | - Katrina K. Hoyer
- Health Sciences Research Institute, University of California, Merced, CA, USA
- Department of Molecular and Cell Biology, School of Natural Sciences, University of California, Merced, CA, USA
| |
Collapse
|
16
|
Matsushima A, Pineda SS, Crittenden JR, Lee H, Galani K, Mantero J, Tombaugh G, Kellis M, Heiman M, Graybiel AM. Transcriptional vulnerabilities of striatal neurons in human and rodent models of Huntington's disease. Nat Commun 2023; 14:282. [PMID: 36650127 PMCID: PMC9845362 DOI: 10.1038/s41467-022-35752-x] [Citation(s) in RCA: 33] [Impact Index Per Article: 16.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2022] [Accepted: 12/23/2022] [Indexed: 01/19/2023] Open
Abstract
Striatal projection neurons (SPNs), which progressively degenerate in human patients with Huntington's disease (HD), are classified along two axes: the canonical direct-indirect pathway division and the striosome-matrix compartmentation. It is well established that the indirect-pathway SPNs are susceptible to neurodegeneration and transcriptomic disturbances, but less is known about how the striosome-matrix axis is compromised in HD in relation to the canonical axis. Here we show, using single-nucleus RNA-sequencing data from male Grade 1 HD patient post-mortem brain samples and male zQ175 and R6/2 mouse models, that the two axes are multiplexed and differentially compromised in HD. In human HD, striosomal indirect-pathway SPNs are the most depleted SPN population. In mouse HD models, the transcriptomic distinctiveness of striosome-matrix SPNs is diminished more than that of direct-indirect pathway SPNs. Furthermore, the loss of striosome-matrix distinction is more prominent within indirect-pathway SPNs. These results open the possibility that the canonical direct-indirect pathway and striosome-matrix compartments are differentially compromised in late and early stages of disease progression, respectively, differentially contributing to the symptoms, thus calling for distinct therapeutic strategies.
Collapse
Affiliation(s)
- Ayano Matsushima
- McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Sergio Sebastian Pineda
- Picower Institute for Learning and Memory, Massachusetts Institute of Technology, Cambridge, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Electrical Engineering and Computer Science, MIT, Cambridge, MA, USA
- MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA
| | - Jill R Crittenden
- McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Hyeseung Lee
- Picower Institute for Learning and Memory, Massachusetts Institute of Technology, Cambridge, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Kyriakitsa Galani
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA
| | - Julio Mantero
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA
| | | | - Manolis Kellis
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Electrical Engineering and Computer Science, MIT, Cambridge, MA, USA
- MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA
| | - Myriam Heiman
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA, USA
- Picower Institute for Learning and Memory, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Ann M Graybiel
- McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, MA, USA.
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA, USA.
| |
Collapse
|
17
|
Li MM, Huang K, Zitnik M. Graph representation learning in biomedicine and healthcare. Nat Biomed Eng 2022; 6:1353-1369. [PMID: 36316368 PMCID: PMC10699434 DOI: 10.1038/s41551-022-00942-x] [Citation(s) in RCA: 79] [Impact Index Per Article: 26.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2021] [Accepted: 08/09/2022] [Indexed: 11/11/2022]
Abstract
Networks-or graphs-are universal descriptors of systems of interacting elements. In biomedicine and healthcare, they can represent, for example, molecular interactions, signalling pathways, disease co-morbidities or healthcare systems. In this Perspective, we posit that representation learning can realize principles of network medicine, discuss successes and current limitations of the use of representation learning on graphs in biomedicine and healthcare, and outline algorithmic strategies that leverage the topology of graphs to embed them into compact vectorial spaces. We argue that graph representation learning will keep pushing forward machine learning for biomedicine and healthcare applications, including the identification of genetic variants underlying complex traits, the disentanglement of single-cell behaviours and their effects on health, the assistance of patients in diagnosis and treatment, and the development of safe and effective medicines.
Collapse
Affiliation(s)
- Michelle M Li
- Bioinformatics and Integrative Genomics Program, Harvard Medical School, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Kexin Huang
- Health Data Science Program, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Marinka Zitnik
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Harvard Data Science Initiative, Cambridge, MA, USA.
| |
Collapse
|
18
|
Brendel M, Su C, Bai Z, Zhang H, Elemento O, Wang F. Application of Deep Learning on Single-cell RNA Sequencing Data Analysis: A Review. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:814-835. [PMID: 36528240 PMCID: PMC10025684 DOI: 10.1016/j.gpb.2022.11.011] [Citation(s) in RCA: 33] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/23/2022] [Revised: 08/17/2022] [Accepted: 11/24/2022] [Indexed: 12/23/2022]
Abstract
Single-cell RNA sequencing (scRNA-seq) has become a routinely used technique to quantify the gene expression profile of thousands of single cells simultaneously. Analysis of scRNA-seq data plays an important role in the study of cell states and phenotypes, and has helped elucidate biological processes, such as those occurring during the development of complex organisms, and improved our understanding of disease states, such as cancer, diabetes, and coronavirus disease 2019 (COVID-19). Deep learning, a recent advance of artificial intelligence that has been used to address many problems involving large datasets, has also emerged as a promising tool for scRNA-seq data analysis, as it has a capacity to extract informative and compact features from noisy, heterogeneous, and high-dimensional scRNA-seq data to improve downstream analysis. The present review aims at surveying recently developed deep learning techniques in scRNA-seq data analysis, identifying key steps within the scRNA-seq data analysis pipeline that have been advanced by deep learning, and explaining the benefits of deep learning over more conventional analytic tools. Finally, we summarize the challenges in current deep learning approaches faced within scRNA-seq data and discuss potential directions for improvements in deep learning algorithms for scRNA-seq data analysis.
Collapse
Affiliation(s)
- Matthew Brendel
- Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA; Institute for Computational Biomedicine, Caryl and Israel Englander Institute for Precision Medicine, Department of Physiology and Biophysics, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA
| | - Chang Su
- Department of Health Service Administration and Policy, Temple University, Philadelphia, PA 19122, USA.
| | - Zilong Bai
- Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA
| | - Hao Zhang
- Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA
| | - Olivier Elemento
- Institute for Computational Biomedicine, Caryl and Israel Englander Institute for Precision Medicine, Department of Physiology and Biophysics, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA
| | - Fei Wang
- Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA.
| |
Collapse
|
19
|
Ding Q, Yang W, Luo M, Xu C, Xu Z, Pang F, Cai Y, Anashkina AA, Su X, Chen N, Jiang Q. CBLRR: a cauchy-based bounded constraint low-rank representation method to cluster single-cell RNA-seq data. Brief Bioinform 2022; 23:6649282. [DOI: 10.1093/bib/bbac300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 06/17/2022] [Accepted: 07/02/2022] [Indexed: 11/14/2022] Open
Abstract
Abstract
The rapid development of single-cel+l RNA sequencing (scRNA-seq) technology provides unprecedented opportunities for exploring biological phenomena at the single-cell level. The discovery of cell types is one of the major applications for researchers to explore the heterogeneity of cells. Some computational methods have been proposed to solve the problem of scRNA-seq data clustering. However, the unavoidable technical noise and notorious dropouts also reduce the accuracy of clustering methods. Here, we propose the cauchy-based bounded constraint low-rank representation (CBLRR), which is a low-rank representation-based method by introducing cauchy loss function (CLF) and bounded nuclear norm regulation, aiming to alleviate the above issue. Specifically, as an effective loss function, the CLF is proven to enhance the robustness of the identification of cell types. Then, we adopt the bounded constraint to ensure the entry values of single-cell data within the restricted interval. Finally, the performance of CBLRR is evaluated on 15 scRNA-seq datasets, and compared with other state-of-the-art methods. The experimental results demonstrate that CBLRR performs accurately and robustly on clustering scRNA-seq data. Furthermore, CBLRR is an effective tool to cluster cells, and provides great potential for downstream analysis of single-cell data. The source code of CBLRR is available online at https://github.com/Ginnay/CBLRR.
Collapse
Affiliation(s)
- Qian Ding
- School of Life Science and Technology, Harbin Institute of Technology , Harbin, Heilongjiang, China
| | - Wenyi Yang
- School of Life Science and Technology, Harbin Institute of Technology , Harbin, Heilongjiang, China
| | - Meng Luo
- School of Life Science and Technology, Harbin Institute of Technology , Harbin, Heilongjiang, China
| | - Chang Xu
- School of Life Science and Technology, Harbin Institute of Technology , Harbin, Heilongjiang, China
| | - Zhaochun Xu
- School of Life Science and Technology, Harbin Institute of Technology , Harbin, Heilongjiang, China
| | - Fenglan Pang
- School of Life Science and Technology, Harbin Institute of Technology , Harbin, Heilongjiang, China
| | - Yideng Cai
- School of Life Science and Technology, Harbin Institute of Technology , Harbin, Heilongjiang, China
| | - Anastasia A Anashkina
- Engelhardt Institute of Molecular Biology, Russian Academy of Sciences , Moscow, Russia
| | - Xi Su
- Foshan Maternity & Child Healthcare Hospital, Southern Medical University , Foshan, Guangdong, China
| | - Na Chen
- Department of Hematology, Shandong Provincial Hospital Affiliated to Shandong First Medical University , Jinan, Shandong, China
| | - Qinghua Jiang
- School of Life Science and Technology, Harbin Institute of Technology , Harbin, Heilongjiang, China
| |
Collapse
|