1
|
Orsoni M, Giovagnoli S, Garofalo S, Mazzoni N, Spinoso M, Benassi M. Comparing factor mixture modeling and conditional Gaussian mixture variational autoencoders for cognitive profile clustering. Front Psychol 2025; 16:1474292. [PMID: 40417028 PMCID: PMC12098581 DOI: 10.3389/fpsyg.2025.1474292] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2024] [Accepted: 04/21/2025] [Indexed: 05/27/2025] Open
Abstract
Introduction Understanding individual cognitive profiles is crucial for developing personalized educational interventions, as cognitive differences can significantly impact how students learn. While traditional methods like factor mixture modeling (FMM) have proven robust for identifying latent cognitive structures, recent advancements in deep learning may offer the potential to capture more intricate and complex cognitive patterns. Methods This study compares FMM (specifically, FMM-1 and FMM-2 models using age as a covariate) with a Conditional Gaussian Mixture Variational Autoencoder (CGMVAE). The comparison utilizes six cognitive dimensions obtained from the PROFFILO assessment game. Results The FMM-1 model, identified as the superior FMM solution, yielded two well-separated clusters (Silhouette score = 0.959). These clusters represent distinct average cognitive levels, with age significantly predicting class membership. In contrast, the CGMVAE identified ten more nuanced cognitive profiles, exhibiting clear developmental trajectories across different age groups. Notably, one dominant cluster (Cluster 9) showed an increase in representation from 44 to 54% with advancing age, indicating a normative developmental pattern. Other clusters displayed diverse profiles, ranging from subtle domain-specific strengths to atypical profiles characterized by significant deficits balanced by compensatory abilities. Discussion These findings highlight a trade-off between the methodologies. FMM provides clear, interpretable groupings suitable for broad classification purposes. Conversely, CGMVAE reveals subtle, non-linear variations in cognitive profiles, potentially reflecting complex developmental pathways. Despite practical challenges associated with CGMVAE's complexity and potential cluster overlap, its capacity to uncover nuanced cognitive patterns demonstrates significant promise for informing the development of highly tailored educational strategies.
Collapse
Affiliation(s)
- Matteo Orsoni
- Department of Psychology, University of Bologna, Bologna, Italy
| | | | | | | | | | | |
Collapse
|
2
|
Zhao PA, Li R, Adewunmi T, Garber J, Gustafson C, Kim J, Malone J, Savage A, Skene P, Li XJ. SPARROW reveals microenvironment-zone-specific cell states in healthy and diseased tissues. Cell Syst 2025; 16:101235. [PMID: 40112778 DOI: 10.1016/j.cels.2025.101235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2024] [Revised: 10/23/2024] [Accepted: 02/19/2025] [Indexed: 03/22/2025]
Abstract
Spatially resolved transcriptomics technologies have advanced our understanding of cellular characteristics within tissue contexts. However, current analytical tools often treat cell-type inference and cellular neighborhood identification as separate and hard clustering processes, limiting comparability across scales and samples. SPARROW addresses these challenges by jointly learning latent embeddings and soft clusterings of cell types and cellular organization. It outperformed state-of-the-art methods in cell-type inference and microenvironment zone delineation and uncovered zone-specific cell states in human and mouse tissues that competing methods missed. By integrating spatially resolved transcriptomics and single-cell RNA sequencing (scRNA-seq) data in a shared latent space, SPARROW achieves single-cell spatial resolution and whole-transcriptome coverage, enabling the discovery of both established and unknown microenvironment zone-specific ligand-receptor interactions in the human tonsil. Overall, SPARROW is a computational framework that provides a comprehensive characterization of tissue features across scales, samples, and conditions.
Collapse
Affiliation(s)
- Peiyao A Zhao
- Allen Institute for Immunology, Seattle, WA 98109, USA.
| | - Ruoxin Li
- Allen Institute for Immunology, Seattle, WA 98109, USA
| | - Temi Adewunmi
- Allen Institute for Immunology, Seattle, WA 98109, USA
| | | | | | - June Kim
- Allen Institute for Immunology, Seattle, WA 98109, USA
| | | | - Adam Savage
- Allen Institute for Immunology, Seattle, WA 98109, USA
| | - Peter Skene
- Allen Institute for Immunology, Seattle, WA 98109, USA
| | - Xiao-Jun Li
- Allen Institute for Immunology, Seattle, WA 98109, USA.
| |
Collapse
|
3
|
Hu D, Guan R, Liang K, Yu H, Quan H, Zhao Y, Liu X, He K. scEGG: an exogenous gene-guided clustering method for single-cell transcriptomic data. Brief Bioinform 2024; 25:bbae483. [PMID: 39344711 PMCID: PMC11440090 DOI: 10.1093/bib/bbae483] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2024] [Revised: 09/05/2024] [Accepted: 09/12/2024] [Indexed: 10/01/2024] Open
Abstract
In recent years, there has been significant advancement in the field of single-cell data analysis, particularly in the development of clustering methods. Despite these advancements, most algorithms continue to focus primarily on analyzing the provided single-cell matrix data. However, within medical contexts, single-cell data often encompasses a wealth of exogenous information, such as gene networks. Overlooking this aspect could result in information loss and produce clustering outcomes lacking significant clinical relevance. To address this limitation, we introduce an innovative deep clustering method for single-cell data that leverages exogenous gene information to generate discriminative cell representations. Specifically, an attention-enhanced graph autoencoder has been developed to efficiently capture topological signal patterns among cells. Concurrently, a random walk on an exogenous protein-protein interaction network enabled the acquisition of the gene's embeddings. Ultimately, the clustering process entailed integrating and reconstructing gene-cell cooperative embeddings, which yielded a discriminative representation. Extensive experiments have demonstrated the effectiveness of the proposed method. This research provides enhanced insights into the characteristics of cells, thus laying the foundation for the early diagnosis and treatment of diseases. The datasets and code can be publicly accessed in the repository at https://github.com/DayuHuu/scEGG.
Collapse
Affiliation(s)
- Dayu Hu
- School of Computer, National University of Defense Technology, No. 109 Deya Road, 410073 Changsha, Hunan, China
| | - Renxiang Guan
- School of Computer, National University of Defense Technology, No. 109 Deya Road, 410073 Changsha, Hunan, China
| | - Ke Liang
- School of Computer, National University of Defense Technology, No. 109 Deya Road, 410073 Changsha, Hunan, China
| | - Hao Yu
- School of Computer, National University of Defense Technology, No. 109 Deya Road, 410073 Changsha, Hunan, China
| | - Hao Quan
- College of Medicine and Biological Information Engineering, Northeastern University, No.195 Chuangxin Road, 110169 Shenyang, Liaoning, China
| | - Yawei Zhao
- Medical Big Data Research Center, Chinese PLA General Hospital, No. 28 Fuxing Road, 100853 Beijing, China
| | - Xinwang Liu
- School of Computer, National University of Defense Technology, No. 109 Deya Road, 410073 Changsha, Hunan, China
| | - Kunlun He
- Medical Big Data Research Center, Chinese PLA General Hospital, No. 28 Fuxing Road, 100853 Beijing, China
| |
Collapse
|
4
|
Jin W, Xia Y, Thela SR, Liu Y, Chen L. In silico generation and augmentation of regulatory variants from massively parallel reporter assay using conditional variational autoencoder. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.25.600715. [PMID: 38979263 PMCID: PMC11230389 DOI: 10.1101/2024.06.25.600715] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/10/2024]
Abstract
Predicting the functional consequences of genetic variants in non-coding regions is a challenging problem. Massively parallel reporter assays (MPRAs), which are an in vitro high-throughput method, can simultaneously test thousands of variants by evaluating the existence of allele specific regulatory activity. Nevertheless, the identified labelled variants by MPRAs, which shows differential allelic regulatory effects on the gene expression are usually limited to the scale of hundreds, limiting their potential to be used as the training set for achieving a robust genome-wide prediction. To address the limitation, we propose a deep generative model, MpraVAE, to in silico generate and augment the training sample size of labelled variants. By benchmarking on several MPRA datasets, we demonstrate that MpraVAE significantly improves the prediction performance for MPRA regulatory variants compared to the baseline method, conventional data augmentation approaches as well as existing variant scoring methods. Taking autoimmune diseases as one example, we apply MpraVAE to perform a genome-wide prediction of regulatory variants and find that predicted regulatory variants are more enriched than background variants in enhancers, active histone marks, open chromatin regions in immune-related cell types, and chromatin states associated with promoter, enhancer activity and binding sites of cMyC and Pol II that regulate gene expression. Importantly, predicted regulatory variants are found to link immune-related genes by leveraging chromatin loop and accessible chromatin, demonstrating the importance of MpraVAE in genetic and gene discovery for complex traits.
Collapse
Affiliation(s)
- Weijia Jin
- Department of Biostatistics, University of Florida, Gainesville, FL, 32603, USA
| | - Yi Xia
- Department of Biostatistics, University of Florida, Gainesville, FL, 32603, USA
| | - Sai Ritesh Thela
- Department of Biostatistics, University of Florida, Gainesville, FL, 32603, USA
| | - Yunlong Liu
- Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Li Chen
- Department of Biostatistics, University of Florida, Gainesville, FL, 32603, USA
| |
Collapse
|
5
|
Li R, Shi F, Song L, Yu Z. scGAL: unmask tumor clonal substructure by jointly analyzing independent single-cell copy number and scRNA-seq data. BMC Genomics 2024; 25:393. [PMID: 38649804 PMCID: PMC11034052 DOI: 10.1186/s12864-024-10319-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2023] [Accepted: 04/17/2024] [Indexed: 04/25/2024] Open
Abstract
BACKGROUND Accurately deciphering clonal copy number substructure can provide insights into the evolutionary mechanism of cancer, and clustering single-cell copy number profiles has become an effective means to unmask intra-tumor heterogeneity (ITH). However, copy numbers inferred from single-cell DNA sequencing (scDNA-seq) data are error-prone due to technically confounding factors such as amplification bias and allele-dropout, and this makes it difficult to precisely identify the ITH. RESULTS We introduce a hybrid model called scGAL to infer clonal copy number substructure. It combines an autoencoder with a generative adversarial network to jointly analyze independent single-cell copy number profiles and gene expression data from same cell line. Under an adversarial learning framework, scGAL exploits complementary information from gene expression data to relieve the effects of noise in copy number data, and learns latent representations of scDNA-seq cells for accurate inference of the ITH. Evaluation results on three real cancer datasets suggest scGAL is able to accurately infer clonal architecture and surpasses other similar methods. In addition, assessment of scGAL on various simulated datasets demonstrates its high robustness against the changes of data size and distribution. scGAL can be accessed at: https://github.com/zhyu-lab/scgal . CONCLUSIONS Joint analysis of independent single-cell copy number and gene expression data from a same cell line can effectively exploit complementary information from individual omics, and thus gives more refined indication of clonal copy number substructure.
Collapse
Affiliation(s)
- Ruixiang Li
- School of Information Engineering, Ningxia University, Yinchuan, 750021, China
| | - Fangyuan Shi
- School of Information Engineering, Ningxia University, Yinchuan, 750021, China
- Collaborative Innovation Center for Ningxia Big Data and Artificial Intelligence Co-founded by Ningxia Municipality and Ministry of Education, Ningxia University, Yinchuan, 750021, China
| | - Lijuan Song
- School of Information Engineering, Ningxia University, Yinchuan, 750021, China
- Collaborative Innovation Center for Ningxia Big Data and Artificial Intelligence Co-founded by Ningxia Municipality and Ministry of Education, Ningxia University, Yinchuan, 750021, China
| | - Zhenhua Yu
- School of Information Engineering, Ningxia University, Yinchuan, 750021, China.
- Collaborative Innovation Center for Ningxia Big Data and Artificial Intelligence Co-founded by Ningxia Municipality and Ministry of Education, Ningxia University, Yinchuan, 750021, China.
| |
Collapse
|
6
|
Liu F, Shi F, Du F, Cao X, Yu Z. CoT: a transformer-based method for inferring tumor clonal copy number substructure from scDNA-seq data. Brief Bioinform 2024; 25:bbae187. [PMID: 38670159 PMCID: PMC11052634 DOI: 10.1093/bib/bbae187] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Revised: 03/08/2024] [Accepted: 04/16/2024] [Indexed: 04/28/2024] Open
Abstract
Single-cell DNA sequencing (scDNA-seq) has been an effective means to unscramble intra-tumor heterogeneity, while joint inference of tumor clones and their respective copy number profiles remains a challenging task due to the noisy nature of scDNA-seq data. We introduce a new bioinformatics method called CoT for deciphering clonal copy number substructure. The backbone of CoT is a Copy number Transformer autoencoder that leverages multi-head attention mechanism to explore correlations between different genomic regions, and thus capture global features to create latent embeddings for the cells. CoT makes it convenient to first infer cell subpopulations based on the learned embeddings, and then estimate single-cell copy numbers through joint analysis of read counts data for the cells belonging to the same cluster. This exploitation of clonal substructure information in copy number analysis helps to alleviate the effect of read counts non-uniformity, and yield robust estimations of the tumor copy numbers. Performance evaluation on synthetic and real datasets showcases that CoT outperforms the state of the arts, and is highly useful for deciphering clonal copy number substructure.
Collapse
Affiliation(s)
- Furui Liu
- School of Information Engineering, Ningxia University, 750021, Ningxia, China
| | - Fangyuan Shi
- School of Information Engineering, Ningxia University, 750021, Ningxia, China
- Collaborative Innovation Center for Ningxia Big Data and Artificial Intelligence Co-founded by Ningxia Municipality and Ministry of Education, Ningxia University, 750021, Ningxia, China
| | - Fang Du
- School of Information Engineering, Ningxia University, 750021, Ningxia, China
- Collaborative Innovation Center for Ningxia Big Data and Artificial Intelligence Co-founded by Ningxia Municipality and Ministry of Education, Ningxia University, 750021, Ningxia, China
| | - Xiangmei Cao
- Basic Medical School, Ningxia Medical University, 750001, Ningxia, China
| | - Zhenhua Yu
- School of Information Engineering, Ningxia University, 750021, Ningxia, China
- Collaborative Innovation Center for Ningxia Big Data and Artificial Intelligence Co-founded by Ningxia Municipality and Ministry of Education, Ningxia University, 750021, Ningxia, China
| |
Collapse
|
7
|
Atitey K, Motsinger-Reif AA, Anchang B. Model-based evaluation of spatiotemporal data reduction methods with unknown ground truth through optimal visualization and interpretability metrics. Brief Bioinform 2023; 25:bbad455. [PMID: 38113074 PMCID: PMC10729792 DOI: 10.1093/bib/bbad455] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Revised: 11/06/2023] [Accepted: 11/20/2023] [Indexed: 12/21/2023] Open
Abstract
Optimizing and benchmarking data reduction methods for dynamic or spatial visualization and interpretation (DSVI) face challenges due to many factors, including data complexity, lack of ground truth, time-dependent metrics, dimensionality bias and different visual mappings of the same data. Current studies often focus on independent static visualization or interpretability metrics that require ground truth. To overcome this limitation, we propose the MIBCOVIS framework, a comprehensive and interpretable benchmarking and computational approach. MIBCOVIS enhances the visualization and interpretability of high-dimensional data without relying on ground truth by integrating five robust metrics, including a novel time-ordered Markov-based structural metric, into a semi-supervised hierarchical Bayesian model. The framework assesses method accuracy and considers interaction effects among metric features. We apply MIBCOVIS using linear and nonlinear dimensionality reduction methods to evaluate optimal DSVI for four distinct dynamic and spatial biological processes captured by three single-cell data modalities: CyTOF, scRNA-seq and CODEX. These data vary in complexity based on feature dimensionality, unknown cell types and dynamic or spatial differences. Unlike traditional single-summary score approaches, MIBCOVIS compares accuracy distributions across methods. Our findings underscore the joint evaluation of visualization and interpretability, rather than relying on separate metrics. We reveal that prioritizing average performance can obscure method feature performance. Additionally, we explore the impact of data complexity on visualization and interpretability. Specifically, we provide optimal parameters and features and recommend methods, like the optimized variational contractive autoencoder, for targeted DSVI for various data complexities. MIBCOVIS shows promise for evaluating dynamic single-cell atlases and spatiotemporal data reduction models.
Collapse
Affiliation(s)
- Komlan Atitey
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, 111 T W Alexander Dr, David P Rall Building, Research Triangle Park, NC 27709, USA
| | - Alison A Motsinger-Reif
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, 111 T W Alexander Dr, David P Rall Building, Research Triangle Park, NC 27709, USA
| | - Benedict Anchang
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, 111 T W Alexander Dr, David P Rall Building, Research Triangle Park, NC 27709, USA
| |
Collapse
|
8
|
Kim G, Chun H. Similarity-assisted variational autoencoder for nonlinear dimension reduction with application to single-cell RNA sequencing data. BMC Bioinformatics 2023; 24:432. [PMID: 37964243 PMCID: PMC10647110 DOI: 10.1186/s12859-023-05552-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Accepted: 10/30/2023] [Indexed: 11/16/2023] Open
Abstract
BACKGROUND Deep generative models naturally become nonlinear dimension reduction tools to visualize large-scale datasets such as single-cell RNA sequencing datasets for revealing latent grouping patterns or identifying outliers. The variational autoencoder (VAE) is a popular deep generative method equipped with encoder/decoder structures. The encoder and decoder are useful when a new sample is mapped to the latent space and a data point is generated from a point in a latent space. However, the VAE tends not to show grouping pattern clearly without additional annotation information. On the other hand, similarity-based dimension reduction methods such as t-SNE or UMAP present clear grouping patterns even though these methods do not have encoder/decoder structures. RESULTS To bridge this gap, we propose a new approach that adopts similarity information in the VAE framework. In addition, for biological applications, we extend our approach to a conditional VAE to account for covariate effects in the dimension reduction step. In the simulation study and real single-cell RNA sequencing data analyses, our method shows great performance compared to existing state-of-the-art methods by producing clear grouping structures using an inferred encoder and decoder. Our method also successfully adjusts for covariate effects, resulting in more useful dimension reduction. CONCLUSIONS Our method is able to produce clearer grouping patterns than those of other regularized VAE methods by utilizing similarity information encoded in the data via the highly celebrated UMAP loss function.
Collapse
Affiliation(s)
- Gwangwoo Kim
- Graduate School of Data Science, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea
| | - Hyonho Chun
- Department of Mathematical Sciences, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea.
| |
Collapse
|