1
|
Wei X, Wu J, Li G, Liu J, Wu X, He C. scPEDSSC: proximity enhanced deep sparse subspace clustering method for scRNA-seq data. PLoS Comput Biol 2025; 21:e1012924. [PMID: 40294099 PMCID: PMC12036905 DOI: 10.1371/journal.pcbi.1012924] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2024] [Accepted: 03/03/2025] [Indexed: 04/30/2025] Open
Abstract
It is a significant step for single cell analysis to identify cell types through clustering single-cell RNA sequencing (scRNA-seq) data. However, great challenges still remain due to the inherent high-dimensionality, noise, and sparsity of scRNA-seq data. In this study, scPEDSSC, a deep sparse subspace clustering method based on proximity enhancement, is put forward. The self-expression matrix (SEM), learned from the deep auto-encoder with two part generalized gamma (TPGG) distribution, are adopted to generate the similarity matrix along with its second power. Compared with eight state-of-the-art single-cell clustering methods on twelve real biological datasets, the proposed method scPEDSSC can achieve superior performance in most datasets, which has been verified through a number of experiments.
Collapse
Affiliation(s)
- Xiaopeng Wei
- Guangxi Key Lab of Multi-source Information Mining & Security, Guangxi Normal University, Guilin, Guangxi, China
- College of Computer Science and Engineering, Guangxi Normal University, Guilin, Guangxi, China
| | - Jingli Wu
- Guangxi Key Lab of Multi-source Information Mining & Security, Guangxi Normal University, Guilin, Guangxi, China
- College of Computer Science and Engineering, Guangxi Normal University, Guilin, Guangxi, China
- Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin, Guangxi, China
| | - Gaoshi Li
- Guangxi Key Lab of Multi-source Information Mining & Security, Guangxi Normal University, Guilin, Guangxi, China
- College of Computer Science and Engineering, Guangxi Normal University, Guilin, Guangxi, China
| | - Jiafei Liu
- Guangxi Key Lab of Multi-source Information Mining & Security, Guangxi Normal University, Guilin, Guangxi, China
- College of Computer Science and Engineering, Guangxi Normal University, Guilin, Guangxi, China
| | - Xi Wu
- Guangxi Key Lab of Multi-source Information Mining & Security, Guangxi Normal University, Guilin, Guangxi, China
- College of Computer Science and Engineering, Guangxi Normal University, Guilin, Guangxi, China
| | - Chang He
- Guangxi Key Lab of Multi-source Information Mining & Security, Guangxi Normal University, Guilin, Guangxi, China
- College of Computer Science and Engineering, Guangxi Normal University, Guilin, Guangxi, China
| |
Collapse
|
2
|
Lan J, Zhuo X, Ye S, Deng J. A semi-supervised non-negative matrix factorization model for scRNA-seq data analysis. Appl Soft Comput 2025; 174:112982. [DOI: 10.1016/j.asoc.2025.112982] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/02/2025]
|
3
|
Liu X, Chapple RH, Bennett D, Wright WC, Sanjali A, Culp E, Zhang Y, Pan M, Geeleher P. CSI-GEP: A GPU-based unsupervised machine learning approach for recovering gene expression programs in atlas-scale single-cell RNA-seq data. CELL GENOMICS 2025; 5:100739. [PMID: 39788105 PMCID: PMC11770216 DOI: 10.1016/j.xgen.2024.100739] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/13/2024] [Revised: 11/06/2024] [Accepted: 12/13/2024] [Indexed: 01/12/2025]
Abstract
Exploratory analysis of single-cell RNA sequencing (scRNA-seq) typically relies on hard clustering over two-dimensional projections like uniform manifold approximation and projection (UMAP). However, such methods can severely distort the data and have many arbitrary parameter choices. Methods that can model scRNA-seq data as non-discrete "gene expression programs" (GEPs) can better preserve the data's structure, but currently, they are often not scalable, not consistent across repeated runs, and lack an established method for choosing key parameters. Here, we developed a GPU-based unsupervised learning approach, "consensus and scalable inference of gene expression programs" (CSI-GEP). We show that CSI-GEP can recover ground truth GEPs in real and simulated atlas-scale scRNA-seq datasets, significantly outperforming cutting-edge methods, including GPT-based neural networks. We applied CSI-GEP to a whole mouse brain atlas of 2.2 million cells, disentangling endothelial cell types missed by other methods, and to an integrated scRNA-seq atlas of human tumors and cell lines, discovering mesenchymal-like GEPs unique to cancer cells growing in culture.
Collapse
Affiliation(s)
- Xueying Liu
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - Richard H Chapple
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - Declan Bennett
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - William C Wright
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - Ankita Sanjali
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - Erielle Culp
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA; Department of Genetics, Genomics, and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| | - Yinwen Zhang
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - Min Pan
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - Paul Geeleher
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA.
| |
Collapse
|
4
|
Anter JM, Yakimovich A. Artificial Intelligence Methods in Infection Biology Research. Methods Mol Biol 2025; 2890:291-333. [PMID: 39890733 DOI: 10.1007/978-1-0716-4326-6_15] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2025]
Abstract
Despite unprecedented achievements, the domain-specific application of artificial intelligence (AI) in the realm of infection biology was still in its infancy just a couple of years ago. This is largely attributable to the proneness of the infection biology community to shirk quantitative techniques. The so-called "sorting machine" paradigm was prevailing at that time, meaning that AI applications were primarily confined to the automation of tedious laboratory tasks. However, fueled by the severe acute respiratory syndrome coronavirus 2 pandemic, AI-driven applications in infection biology made giant leaps beyond mere automation. Instead, increasingly sophisticated tasks were successfully tackled, thereby ushering in the transition to the "Swiss army knife" paradigm. Incentivized by the urgent need to subdue a raging pandemic, AI achieved maturity in infection biology and became a versatile tool. In this chapter, the maturation of AI in the field of infection biology from the "sorting machine" paradigm to the "Swiss army knife" paradigm is outlined. Successful applications are illustrated for the three data modalities in the domain, that is, images, molecular data, and language data, with a particular emphasis on disentangling host-pathogen interactions. Along the way, fundamental terminology mentioned in the same breath as AI is elaborated on, and relationships between the subfields these terms represent are established. Notably, in order to dispel the fears of infection biologists toward quantitative methodologies and lower the initial hurdle, this chapter features a hands-on guide on software installation, virtual environment setup, data preparation, and utilization of pretrained models at its very end.
Collapse
Affiliation(s)
- Jacob Marcel Anter
- Center for Advanced Systems Understanding (CASUS), Görlitz, Germany
- Helmholtz-Zentrum Dresden-Rossendorf e. V. (HZDR), Dresden, Germany
| | - Artur Yakimovich
- Center for Advanced Systems Understanding (CASUS), Görlitz, Germany.
- Helmholtz-Zentrum Dresden-Rossendorf e. V. (HZDR), Dresden, Germany.
- Institute of Computer Science, University of Wrocław, Wrocław, Poland.
| |
Collapse
|
5
|
Xu Y, Lv D, Zou X, Wu L, Xu X, Zhao X. BFAST: joint dimension reduction and spatial clustering with Bayesian factor analysis for zero-inflated spatial transcriptomics data. Brief Bioinform 2024; 25:bbae594. [PMID: 39552067 PMCID: PMC11570543 DOI: 10.1093/bib/bbae594] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2024] [Revised: 09/03/2024] [Accepted: 11/01/2024] [Indexed: 11/19/2024] Open
Abstract
The development of spatially resolved transcriptomics (ST) technologies has made it possible to measure gene expression profiles coupled with cellular spatial context and assist biologists in comprehensively characterizing cellular phenotype heterogeneity and tissue microenvironment. Spatial clustering is vital for biological downstream analysis. However, due to high noise and dropout events, clustering spatial transcriptomics data poses numerous challenges due to the lack of effective algorithms. Here we develop a novel method, jointly performing dimension reduction and spatial clustering with Bayesian Factor Analysis for zero-inflated Spatial Transcriptomics data (BFAST). BFAST has showcased exceptional performance on simulation data and real spatial transcriptomics datasets, as proven by benchmarking against currently available methods. It effectively extracts more biologically informative low-dimensional features compared to traditional dimensionality reduction approaches, thereby enhancing the accuracy and precision of clustering.
Collapse
Affiliation(s)
- Yang Xu
- BGI-Research, 313, Gaoteng Avenue, Jiulongpo, Chongqing 400039, China
- BGI-Research, 9, Yunhua Road, Yantian, Shenzhen 518083, China
| | - Dian Lv
- BGI-Research, 313, Gaoteng Avenue, Jiulongpo, Chongqing 400039, China
- BGI-Research, 9, Yunhua Road, Yantian, Shenzhen 518083, China
| | - Xuanxuan Zou
- BGI-Research, 313, Gaoteng Avenue, Jiulongpo, Chongqing 400039, China
- BGI-Research, 9, Yunhua Road, Yantian, Shenzhen 518083, China
| | - Liang Wu
- BGI-Research, 313, Gaoteng Avenue, Jiulongpo, Chongqing 400039, China
- BGI-Research, 9, Yunhua Road, Yantian, Shenzhen 518083, China
| | - Xun Xu
- BGI-Research, 9, Yunhua Road, Yantian, Shenzhen 518083, China
| | - Xin Zhao
- BGI-Research, 313, Gaoteng Avenue, Jiulongpo, Chongqing 400039, China
- BGI-Research, 9, Yunhua Road, Yantian, Shenzhen 518083, China
| |
Collapse
|
6
|
Rana V, Peng J, Pan C, Lyu H, Cheng A, Kim M, Milenkovic O. Interpretable online network dictionary learning for inferring long-range chromatin interactions. PLoS Comput Biol 2024; 20:e1012095. [PMID: 38753877 PMCID: PMC11135774 DOI: 10.1371/journal.pcbi.1012095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2023] [Revised: 05/29/2024] [Accepted: 04/20/2024] [Indexed: 05/18/2024] Open
Abstract
Dictionary learning (DL), implemented via matrix factorization (MF), is commonly used in computational biology to tackle ubiquitous clustering problems. The method is favored due to its conceptual simplicity and relatively low computational complexity. However, DL algorithms produce results that lack interpretability in terms of real biological data. Additionally, they are not optimized for graph-structured data and hence often fail to handle them in a scalable manner. In order to address these limitations, we propose a novel DL algorithm called online convex network dictionary learning (online cvxNDL). Unlike classical DL algorithms, online cvxNDL is implemented via MF and designed to handle extremely large datasets by virtue of its online nature. Importantly, it enables the interpretation of dictionary elements, which serve as cluster representatives, through convex combinations of real measurements. Moreover, the algorithm can be applied to data with a network structure by incorporating specialized subnetwork sampling techniques. To demonstrate the utility of our approach, we apply cvxNDL on 3D-genome RNAPII ChIA-Drop data with the goal of identifying important long-range interaction patterns (long-range dictionary elements). ChIA-Drop probes higher-order interactions, and produces data in the form of hypergraphs whose nodes represent genomic fragments. The hyperedges represent observed physical contacts. Our hypergraph model analysis has the objective of creating an interpretable dictionary of long-range interaction patterns that accurately represent global chromatin physical contact maps. Through the use of dictionary information, one can also associate the contact maps with RNA transcripts and infer cellular functions. To accomplish the task at hand, we focus on RNAPII-enriched ChIA-Drop data from Drosophila Melanogaster S2 cell lines. Our results offer two key insights. First, we demonstrate that online cvxNDL retains the accuracy of classical DL (MF) methods while simultaneously ensuring unique interpretability and scalability. Second, we identify distinct collections of proximal and distal interaction patterns involving chromatin elements shared by related processes across different chromosomes, as well as patterns unique to specific chromosomes. To associate the dictionary elements with biological properties of the corresponding chromatin regions, we employ Gene Ontology (GO) enrichment analysis and perform multiple RNA coexpression studies.
Collapse
Affiliation(s)
- Vishal Rana
- Department of Electrical and Computer Engineering, University of Illinois, Urbana-Champaign, Illinois, United States of America
| | - Jianhao Peng
- Department of Electrical and Computer Engineering, University of Illinois, Urbana-Champaign, Illinois, United States of America
| | - Chao Pan
- Department of Electrical and Computer Engineering, University of Illinois, Urbana-Champaign, Illinois, United States of America
| | - Hanbaek Lyu
- Department of Mathematics, University of Wisconsin - Madison, Madison, Wisconsin, United States of America
| | - Albert Cheng
- School of Biological and Health Systems Engineering, Arizona State University, Phoenix, Arizona, United States of America
| | - Minji Kim
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Olgica Milenkovic
- Department of Electrical and Computer Engineering, University of Illinois, Urbana-Champaign, Illinois, United States of America
| |
Collapse
|
7
|
Xu Y, Zhang W, Zheng X, Cai X. Combining Global-Constrained Concept Factorization and a Regularized Gaussian Graphical Model for Clustering Single-Cell RNA-seq Data. Interdiscip Sci 2024; 16:1-15. [PMID: 37815679 DOI: 10.1007/s12539-023-00587-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2023] [Revised: 09/14/2023] [Accepted: 09/17/2023] [Indexed: 10/11/2023]
Abstract
Single-cell RNA sequencing technology is one of the most cost-effective ways to uncover transcriptomic heterogeneity. With the rapid rise of this technology, enormous amounts of scRNA-seq data have been produced. Due to the high dimensionality, noise, sparsity and missing features of the available scRNA-seq data, accurately clustering the scRNA-seq data for downstream analysis is a significant challenge. Many computational methods have been designed to address this issue; nevertheless, the efficacy of the available methods is still inadequate. In addition, most similarity-based methods require a number of clusters as input, which is difficult to achieve in real applications. In this study, we developed a novel computational method for clustering scRNA-seq data by considering both global and local information, named GCFG. This method characterizes the global properties of data by applying concept factorization, and the regularized Gaussian graphical model is utilized to evaluate the local embedding relationship of data. To learn the cell-cell similarity matrix, we integrated the two components, and an iterative optimization algorithm was developed. The categorization of single cells is obtained by applying Louvain, a modularity-based community discovery algorithm, to the similarity matrix. The behavior of the GCFG approach is assessed on 14 real scRNA-seq datasets in terms of ACC and ARI, and comparison results with 17 other competitive methods suggest that GCFG is effective and robust.
Collapse
Affiliation(s)
- Yaxin Xu
- School of Sciences, East China Jiaotong University, Nanchang, 330013, China
| | - Wei Zhang
- School of Sciences, East China Jiaotong University, Nanchang, 330013, China.
| | - Xiaoying Zheng
- Operations Research and Planning Department, Naval University of Engineering, Wuhan, 430033, China
| | - Xianxian Cai
- School of Sciences, East China Jiaotong University, Nanchang, 330013, China
| |
Collapse
|
8
|
Zhang H, Lu X, Lu B, Gullo G, Chen L. Measuring the composition of the tumor microenvironment with transcriptome analysis: past, present and future. Future Oncol 2024; 20:1207-1220. [PMID: 38362731 PMCID: PMC11318690 DOI: 10.2217/fon-2023-0658] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Accepted: 01/24/2024] [Indexed: 02/17/2024] Open
Abstract
Interactions between tumor cells and immune cells in the tumor microenvironment (TME) play a vital role the mechanisms of immune evasion, by which cancer cells escape immune elimination. Thus, the characterization and quantification of different components in the TME is a hot topic in molecular biology and drug discovery. Since the development of transcriptome sequencing in bulk tissue, single cells and spatial dimensions, there are increasing methods emerging to deconvolute and subtype the TME. This review discusses and compares such computational strategies and downstream subtyping analyses. Integrative analyses of the transcriptome with other data, such as epigenetics and T-cell receptor sequencing, are needed to obtain comprehensive knowledge of the dynamic TME.
Collapse
Affiliation(s)
- Han Zhang
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15206, USA
| | - Xinghua Lu
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15206, USA
- UPMC Hillman Cancer Center, Pittsburgh, PA 15232, USA
| | - Binfeng Lu
- Center for Discovery & Innovation, Hackensack Meridian Health, Nutley, NJ 07110, USA
| | - Giuseppe Gullo
- Department of Obstetrics & Gynecology, Villa Sofia Cervello Hospital, University of Palermo, 90146, Palermo, Italy
| | - Lujia Chen
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15206, USA
| |
Collapse
|
9
|
Lan W, Liu M, Chen J, Ye J, Zheng R, Zhu X, Peng W. JLONMFSC: Clustering scRNA-seq data based on joint learning of non-negative matrix factorization and subspace clustering. Methods 2024; 222:1-9. [PMID: 38128706 DOI: 10.1016/j.ymeth.2023.11.019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2023] [Revised: 11/07/2023] [Accepted: 11/29/2023] [Indexed: 12/23/2023] Open
Abstract
The development of single cell RNA sequencing (scRNA-seq) has provided new perspectives to study biological problems at the single cell level. One of the key issues in scRNA-seq data analysis is to divide cells into several clusters for discovering the heterogeneity and diversity of cells. However, the existing scRNA-seq data are high-dimensional, sparse, and noisy, which challenges the existing single-cell clustering methods. In this study, we propose a joint learning framework (JLONMFSC) for clustering scRNA-seq data. In our method, the dimension of the original data is reduced to minimize the effect of noise. In addition, the graph regularized matrix factorization is used to learn the local features. Further, the Low-Rank Representation (LRR) subspace clustering is utilized to learn the global features. Finally, the joint learning of local features and global features is performed to obtain the results of clustering. We compare the proposed algorithm with eight state-of-the-art algorithms for clustering performance on six datasets, and the experimental results demonstrate that the JLONMFSC achieves better performance in all datasets. The code is avalable at https://github.com/lanbiolab/JLONMFSC.
Collapse
Affiliation(s)
- Wei Lan
- School of Computer, Electronic and Information, Guangxi University, Nanning, China; Guangxi Key Laboratory of Multimedia Communications and Network Technology, Guangxi University, Nanning, China.
| | - Mingyang Liu
- School of Computer, Electronic and Information, Guangxi University, Nanning, China
| | - Jianwei Chen
- School of Computer, Electronic and Information, Guangxi University, Nanning, China
| | - Jin Ye
- School of Computer, Electronic and Information, Guangxi University, Nanning, China
| | - Ruiqing Zheng
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, China
| | - Xiaoshu Zhu
- School of Computer Science and Information Security, Guilin University of Science and Technology, Guilin, China
| | - Wei Peng
- School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China
| |
Collapse
|
10
|
Johnson JAI, Tsang AP, Mitchell JT, Zhou DL, Bowden J, Davis-Marcisak E, Sherman T, Liefeld T, Loth M, Goff LA, Zimmerman JW, Kinny-Köster B, Jaffee EM, Tamayo P, Mesirov JP, Reich M, Fertig EJ, Stein-O'Brien GL. Inferring cellular and molecular processes in single-cell data with non-negative matrix factorization using Python, R and GenePattern Notebook implementations of CoGAPS. Nat Protoc 2023; 18:3690-3731. [PMID: 37989764 PMCID: PMC10961825 DOI: 10.1038/s41596-023-00892-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2022] [Accepted: 07/21/2023] [Indexed: 11/23/2023]
Abstract
Non-negative matrix factorization (NMF) is an unsupervised learning method well suited to high-throughput biology. However, inferring biological processes from an NMF result still requires additional post hoc statistics and annotation for interpretation of learned features. Here, we introduce a suite of computational tools that implement NMF and provide methods for accurate and clear biological interpretation and analysis. A generalized discussion of NMF covering its benefits, limitations and open questions is followed by four procedures for the Bayesian NMF algorithm Coordinated Gene Activity across Pattern Subsets (CoGAPS). Each procedure will demonstrate NMF analysis to quantify cell state transitions in a public domain single-cell RNA-sequencing dataset. The first demonstrates PyCoGAPS, our new Python implementation that enhances runtime for large datasets, and the second allows its deployment in Docker. The third procedure steps through the same single-cell NMF analysis using our R CoGAPS interface. The fourth introduces a beginner-friendly CoGAPS platform using GenePattern Notebook, aimed at users with a working conceptual knowledge of data analysis but without a basic proficiency in the R or Python programming language. We also constructed a user-facing website to serve as a central repository for information and instructional materials about CoGAPS and its application programming interfaces. The expected timing to setup the packages and conduct a test run is around 15 min, and an additional 30 min to conduct analyses on a precomputed result. The expected runtime on the user's desired dataset can vary from hours to days depending on factors such as dataset size or input parameters.
Collapse
Affiliation(s)
- Jeanette A I Johnson
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA
- Convergence Institute, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA
| | - Ashley P Tsang
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Jacob T Mitchell
- Convergence Institute, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA
- Department of Genetic Medicine, Johns Hopkins University, Baltimore, MD, USA
| | - David L Zhou
- Department of Neuroscience, Johns Hopkins University, Baltimore, MD, USA
| | - Julia Bowden
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA
- Convergence Institute, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA
| | - Emily Davis-Marcisak
- Convergence Institute, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA
- Department of Genetic Medicine, Johns Hopkins University, Baltimore, MD, USA
| | - Thomas Sherman
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA
| | - Ted Liefeld
- Department of Medicine, Moores Cancer Center, University of California San Diego, San Diego, CA, USA
| | - Melanie Loth
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA
- Convergence Institute, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA
| | - Loyal A Goff
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
- Department of Neuroscience, Johns Hopkins University, Baltimore, MD, USA
- Kavli Neurodiscovery Institute, Johns Hopkins University, Baltimore, MD, USA
- Single Cell Training and Analysis Center, Johns Hopkins University, Baltimore, MD, USA
| | - Jacquelyn W Zimmerman
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA
- Convergence Institute, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA
| | - Ben Kinny-Köster
- Department of Surgery, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Elizabeth M Jaffee
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA
- Convergence Institute, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA
| | - Pablo Tamayo
- Department of Medicine, Moores Cancer Center, University of California San Diego, San Diego, CA, USA
| | - Jill P Mesirov
- Department of Medicine, Moores Cancer Center, University of California San Diego, San Diego, CA, USA
| | - Michael Reich
- Department of Medicine, Moores Cancer Center, University of California San Diego, San Diego, CA, USA
| | - Elana J Fertig
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA.
- Convergence Institute, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA.
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA.
- Single Cell Training and Analysis Center, Johns Hopkins University, Baltimore, MD, USA.
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, USA.
| | - Genevieve L Stein-O'Brien
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA.
- Convergence Institute, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University, Baltimore, MD, USA.
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA.
- Department of Neuroscience, Johns Hopkins University, Baltimore, MD, USA.
- Kavli Neurodiscovery Institute, Johns Hopkins University, Baltimore, MD, USA.
- Single Cell Training and Analysis Center, Johns Hopkins University, Baltimore, MD, USA.
| |
Collapse
|
11
|
Yoon SH, Nam JW. Clustering malignant cell states using universally variable genes. Brief Bioinform 2023; 25:bbad460. [PMID: 38084922 PMCID: PMC10783859 DOI: 10.1093/bib/bbad460] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2023] [Revised: 11/20/2023] [Accepted: 11/22/2023] [Indexed: 12/18/2023] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) has revealed important insights into the heterogeneity of malignant cells. However, sample-specific genomic alterations often confound such analysis, resulting in patient-specific clusters that are difficult to interpret. Here, we present a novel approach to address the issue. By normalizing gene expression variances to identify universally variable genes (UVGs), we were able to reduce the formation of sample-specific clusters and identify underlying molecular hallmarks in malignant cells. In contrast to highly variable genes vulnerable to a specific sample bias, UVGs led to better detection of clusters corresponding to distinct malignant cell states. Our results demonstrate the utility of this approach for analyzing scRNA-seq data and suggest avenues for further exploration of malignant cell heterogeneity.
Collapse
Affiliation(s)
- Sang-Ho Yoon
- Department of Life Science, College of Natural Sciences, Hanyang University, Seoul 04763, Republic of Korea
- Hanyang Institute of Advanced BioConvergence, Hanyang University, Seoul 04763, Republic of Korea
- Hanyang Institute of Bioscience and Biotechnology, Bio-BigData Research Center, Hanyang University, Seoul 04763, Republic of Korea
| | - Jin-Wu Nam
- Department of Life Science, College of Natural Sciences, Hanyang University, Seoul 04763, Republic of Korea
- Hanyang Institute of Advanced BioConvergence, Hanyang University, Seoul 04763, Republic of Korea
- Research Institute for Convergence of Basic Sciences, Hanyang University, Seoul 04763, Republic of Korea
- Hanyang Institute of Bioscience and Biotechnology, Bio-BigData Research Center, Hanyang University, Seoul 04763, Republic of Korea
| |
Collapse
|
12
|
Li R, Guan J, Wang Z, Zhou S. A new and effective two-step clustering approach for single cell RNA sequencing data. BMC Genomics 2023; 23:864. [PMID: 37946133 PMCID: PMC10636845 DOI: 10.1186/s12864-023-09577-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2021] [Accepted: 08/10/2023] [Indexed: 11/12/2023] Open
Abstract
BACKGROUND The rapid devolvement of single cell RNA sequencing (scRNA-seq) technology leads to huge amounts of scRNA-seq data, which greatly advance the research of many biomedical fields involving tissue heterogeneity, pathogenesis of disease and drug resistance etc. One major task in scRNA-seq data analysis is to cluster cells in terms of their expression characteristics. Up to now, a number of methods have been proposed to infer cell clusters, yet there is still much space to improve their performance. RESULTS In this paper, we develop a new two-step clustering approach to effectively cluster scRNA-seq data, which is called TSC - the abbreviation of Two-Step Clustering. Particularly, by dividing all cells into two types: core cells (those possibly lying around the centers of clusters) and non-core cells (those locating in the boundary areas of clusters), we first clusters the core cells by hierarchical clustering (the first step) and then assigns the non-core cells to the corresponding nearest clusters (the second step). Extensive experiments on 12 real scRNA-seq datasets show that TSC outperforms the state of the art methods. CONCLUSION TSC is an effective clustering method due to its two-steps clustering strategy, and it is a useful tool for scRNA-seq data analysis.
Collapse
Affiliation(s)
- Ruiyi Li
- Translational Medical Center for Stem Cell Therapy, Shanghai East Hospital, and School of Medicine, Tongji University, 1239 Siping Road, 200092, Shanghai, China
- Department of Computer Science and Technology, Tongji University, 4800 Caoan Road, 201804, Shanghai, China
| | - Jihong Guan
- Department of Computer Science and Technology, Tongji University, 4800 Caoan Road, 201804, Shanghai, China.
| | - Zhiye Wang
- Department of Computer Science and Technology, Tongji University, 4800 Caoan Road, 201804, Shanghai, China
| | - Shuigeng Zhou
- Shanghai Key Lab of Intelligent Information Processing, and School of Computer Science, Fudan University, 2005 Songhu Road, 200438, Shanghai, China.
| |
Collapse
|
13
|
Wu W, Zhang W, Hou W, Ma X. Multi-View Clustering With Graph Learning for scRNA-Seq Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:3535-3546. [PMID: 37486829 DOI: 10.1109/tcbb.2023.3298334] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/26/2023]
Abstract
Advances in single-cell biotechnologies have generated the single-cell RNA sequencing (scRNA-seq) of gene expression profiles at cell levels, providing an opportunity to study cellular distribution. Although significant efforts developed in their analysis, many problems remain in studying cell types distribution because of the heterogeneity, high dimensionality, and noise of scRNA-seq. In this study, a multi-view clustering with graph learning algorithm (MCGL) for scRNA-seq data is proposed, which consists of multi-view learning, graph learning, and cell type clustering. In order to avoid a single feature space of scRNA-seq being inadequate to comprehensively characterize the functions of cells, MCGL constructs the multiple feature spaces and utilizes multi-view learning to comprehensively characterize scRNA-seq data from different perspectives. MCGL adaptively learns the similarity graphs of cells that overcome the dependence on fixed similarity, transforming scRNA-seq analysis into the analysis of multi-view clustering. MCGL decomposes the networks of cells into view-specific and common networks in multi-view learning, which better characterizes the topological relationship of cells. MCGL simultaneously utilizes multiple types of cell-cell networks and fully exploits the connection relationship between cells through the complementarity between networks to improve clustering performance. The graph learning, graph factorization, and cell-type clustering processes are accomplished simultaneously under one optimization framework. The performance of the MCGL algorithm is validated with ten scRNA-seq datasets from different scales, and experimental results imply that the proposed algorithm significantly outperforms fourteen state-of-the-art scRNA-seq algorithms.
Collapse
|
14
|
Carbonetto P, Luo K, Sarkar A, Hung A, Tayeb K, Pott S, Stephens M. GoM DE: interpreting structure in sequence count data with differential expression analysis allowing for grades of membership. Genome Biol 2023; 24:236. [PMID: 37858253 PMCID: PMC10588049 DOI: 10.1186/s13059-023-03067-9] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2023] [Accepted: 09/20/2023] [Indexed: 10/21/2023] Open
Abstract
Parts-based representations, such as non-negative matrix factorization and topic modeling, have been used to identify structure from single-cell sequencing data sets, in particular structure that is not as well captured by clustering or other dimensionality reduction methods. However, interpreting the individual parts remains a challenge. To address this challenge, we extend methods for differential expression analysis by allowing cells to have partial membership to multiple groups. We call this grade of membership differential expression (GoM DE). We illustrate the benefits of GoM DE for annotating topics identified in several single-cell RNA-seq and ATAC-seq data sets.
Collapse
Affiliation(s)
- Peter Carbonetto
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Research Computing Center, University of Chicago, Chicago, IL, USA
| | - Kaixuan Luo
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
| | - Abhishek Sarkar
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Vesalius Therapeutics, Cambridge, MA, USA
| | - Anthony Hung
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Section of Genetic Medicine, University of Chicago, Chicago, IL, USA
| | - Karl Tayeb
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Committee on Genetics, Genomics and Systems Biology, University of Chicago, Chicago, IL, USA
| | - Sebastian Pott
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Section of Genetic Medicine, University of Chicago, Chicago, IL, USA
| | - Matthew Stephens
- Department of Human Genetics, University of Chicago, Chicago, IL, USA.
- Department of Statistics, University of Chicago, Chicago, IL, USA.
| |
Collapse
|
15
|
Wrobel TJ, Brilhaus D, Stefanski A, Stühler K, Weber APM, Linka N. Mapping the castor bean endosperm proteome revealed a metabolic interaction between plastid, mitochondria, and peroxisomes to optimize seedling growth. FRONTIERS IN PLANT SCIENCE 2023; 14:1182105. [PMID: 37868318 PMCID: PMC10588648 DOI: 10.3389/fpls.2023.1182105] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Accepted: 08/07/2023] [Indexed: 10/24/2023]
Abstract
In this work, we studied castor-oil plant Ricinus communis as a classical system for endosperm reserve breakdown. The seeds of castor beans consist of a centrally located embryo with the two thin cotyledons surrounded by the endosperm. The endosperm functions as major storage tissue and is packed with nutritional reserves, such as oil, proteins, and starch. Upon germination, mobilization of the storage reserves requires inter-organellar interplay of plastids, mitochondria, and peroxisomes to optimize growth for the developing seedling. To understand their metabolic interactions, we performed a large-scale organellar proteomic study on castor bean endosperm. Organelles from endosperm of etiolated seedlings were isolated and subjected to liquid chromatography-tandem mass spectrometry (LC-MS/MS). Computer-assisted deconvolution algorithms were applied to reliably assign the identified proteins to their correct subcellular localization and to determine the abundance of the different organelles in the heterogeneous protein samples. The data obtained were used to build a comprehensive metabolic model for plastids, mitochondria, and peroxisomes during storage reserve mobilization in castor bean endosperm.
Collapse
Affiliation(s)
- Thomas J. Wrobel
- Institute of Plant Biochemistry and Cluster of Excellence on Plant Sciences (CEPLAS), Heinrich Heine University, Düsseldorf, Germany
| | - Dominik Brilhaus
- Institute of Plant Biochemistry and Cluster of Excellence on Plant Sciences (CEPLAS), Heinrich Heine University, Düsseldorf, Germany
| | - Anja Stefanski
- Molecular Proteomics Laboratory, Biologisch-Medizinisches Forschungszentrum (BMFZ), Universitätsklinikum, Düsseldorf, Germany
| | - Kai Stühler
- Molecular Proteomics Laboratory, Biologisch-Medizinisches Forschungszentrum (BMFZ), Universitätsklinikum, Düsseldorf, Germany
| | - Andreas P. M. Weber
- Institute of Plant Biochemistry and Cluster of Excellence on Plant Sciences (CEPLAS), Heinrich Heine University, Düsseldorf, Germany
| | - Nicole Linka
- Institute of Plant Biochemistry and Cluster of Excellence on Plant Sciences (CEPLAS), Heinrich Heine University, Düsseldorf, Germany
| |
Collapse
|
16
|
Gunawan I, Vafaee F, Meijering E, Lock JG. An introduction to representation learning for single-cell data analysis. CELL REPORTS METHODS 2023; 3:100547. [PMID: 37671013 PMCID: PMC10475795 DOI: 10.1016/j.crmeth.2023.100547] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/07/2023]
Abstract
Single-cell-resolved systems biology methods, including omics- and imaging-based measurement modalities, generate a wealth of high-dimensional data characterizing the heterogeneity of cell populations. Representation learning methods are routinely used to analyze these complex, high-dimensional data by projecting them into lower-dimensional embeddings. This facilitates the interpretation and interrogation of the structures, dynamics, and regulation of cell heterogeneity. Reflecting their central role in analyzing diverse single-cell data types, a myriad of representation learning methods exist, with new approaches continually emerging. Here, we contrast general features of representation learning methods spanning statistical, manifold learning, and neural network approaches. We consider key steps involved in representation learning with single-cell data, including data pre-processing, hyperparameter optimization, downstream analysis, and biological validation. Interdependencies and contingencies linking these steps are also highlighted. This overview is intended to guide researchers in the selection, application, and optimization of representation learning strategies for current and future single-cell research applications.
Collapse
Affiliation(s)
- Ihuan Gunawan
- School of Biomedical Sciences, Faculty of Medicine and Health, University of New South Wales, Sydney, NSW, Australia
- School of Computer Science and Engineering, Faculty of Engineering, University of New South Wales, Sydney, NSW, Australia
| | - Fatemeh Vafaee
- School of Biotechnology and Biomolecular Sciences, Faculty of Science, University of New South Wales, Sydney, NSW, Australia
- UNSW Data Science Hub, University of New South Wales, Sydney, NSW, Australia
| | - Erik Meijering
- School of Computer Science and Engineering, Faculty of Engineering, University of New South Wales, Sydney, NSW, Australia
| | - John George Lock
- School of Biomedical Sciences, Faculty of Medicine and Health, University of New South Wales, Sydney, NSW, Australia
- UNSW Data Science Hub, University of New South Wales, Sydney, NSW, Australia
- Ingham Institute for Applied Medical Research, Liverpool, NSW, Australia
| |
Collapse
|
17
|
Zhang H, Lu X, Lu B, Chen L. scGEM: Unveiling the Nested Tree-Structured Gene Co-Expressing Modules in Single Cell Transcriptome Data. Cancers (Basel) 2023; 15:4277. [PMID: 37686554 PMCID: PMC10486867 DOI: 10.3390/cancers15174277] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2023] [Revised: 08/22/2023] [Accepted: 08/25/2023] [Indexed: 09/10/2023] Open
Abstract
BACKGROUND Single-cell transcriptome analysis has fundamentally changed biological research by allowing higher-resolution computational analysis of individual cells and subsets of cell types. However, few methods have met the need to recognize and quantify the underlying cellular programs that determine the specialization and differentiation of the cell types. METHODS In this study, we present scGEM, a nested tree-structured nonparametric Bayesian model, to reveal the gene co-expression modules (GEMs) reflecting transcriptome processes in single cells. RESULTS We show that scGEM can discover shared and specialized transcriptome signals across different cell types using peripheral blood mononuclear single cells and early brain development single cells. scGEM outperformed other methods in perplexity and topic coherence (p < 0.001) on our simulation data. Larger datasets, deeper trees and pre-trained models are shown to be positively associated with better scGEM performance. The GEMs obtained from triple-negative breast cancer single cells exhibited better correlations with lymphocyte infiltration (p = 0.009) and the cell cycle (p < 0.001) than other methods in additional validation on the bulk RNAseq dataset. CONCLUSIONS Altogether, we demonstrate that scGEM can be used to model the hidden cellular functions of single cells, thereby unveiling the specialization and generalization of transcriptomic programs across different types of cells.
Collapse
Affiliation(s)
- Han Zhang
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15206, USA; (H.Z.)
| | - Xinghua Lu
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15206, USA; (H.Z.)
- UPMC Hillman Cancer Center, Pittsburgh, PA 15232, USA
| | - Binfeng Lu
- Center for Discovery and Innovation, Hackensack Meridian Health, Nutley, NJ 07110, USA
| | - Lujia Chen
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15206, USA; (H.Z.)
| |
Collapse
|
18
|
Su Y, Lin R, Wang J, Tan D, Zheng C. Denoising adaptive deep clustering with self-attention mechanism on single-cell sequencing data. Brief Bioinform 2023; 24:7008799. [PMID: 36715275 DOI: 10.1093/bib/bbad021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2022] [Revised: 12/20/2022] [Accepted: 01/05/2023] [Indexed: 01/31/2023] Open
Abstract
A large number of works have presented the single-cell RNA sequencing (scRNA-seq) to study the diversity and biological functions of cells at the single-cell level. Clustering identifies unknown cell types, which is essential for downstream analysis of scRNA-seq samples. However, the high dimensionality, high noise and pervasive dropout rate of scRNA-seq samples have a significant challenge to the cluster analysis of scRNA-seq samples. Herein, we propose a new adaptive fuzzy clustering model based on the denoising autoencoder and self-attention mechanism called the scDASFK. It implements the comparative learning to integrate cell similar information into the clustering method and uses a deep denoising network module to denoise the data. scDASFK consists of a self-attention mechanism for further denoising where an adaptive clustering optimization function for iterative clustering is implemented. In order to make the denoised latent features better reflect the cell structure, we introduce a new adaptive feedback mechanism to supervise the denoising process through the clustering results. Experiments on 16 real scRNA-seq datasets show that scDASFK performs well in terms of clustering accuracy, scalability and stability. Overall, scDASFK is an effective clustering model with great potential for scRNA-seq samples analysis. Our scDASFK model codes are freely available at https://github.com/LRX2022/scDASFK.
Collapse
Affiliation(s)
- Yansen Su
- Key Lab of Intelligent Computing and Signal Processing of Ministry of Education, School of Artificial Intelligence, Anhui University, Hefei, 230601, China
| | - Rongxin Lin
- School of Computer Science and Technology, Anhui University, Hefei, 230601, China
| | - Jing Wang
- School of Computer Science and Technology, Anhui University, Hefei, 230601, China
| | - Dayu Tan
- Institutes of Physical Science and Information Technology, Anhui University, Hefei, 230601, China
| | - Chunhou Zheng
- Key Lab of Intelligent Computing and Signal Processing of Ministry of Education, School of Artificial Intelligence, Anhui University, Hefei, 230601, China
| |
Collapse
|
19
|
Cheng X, Yan C, Jiang H, Qiu Y. scHOIS: Determining Cell Heterogeneity Through Hierarchical Clustering Based on Optimal Imputation Strategy. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1431-1444. [PMID: 37815942 DOI: 10.1109/tcbb.2022.3203592] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/12/2023]
Abstract
Advances in single-cell RNA sequencing (scRNA-seq) technology provide an unbiased and high-throughput analysis of each cell at single-cell resolution, and further facilitate the development of cellular heterogeneity analysis. Despite the promise of scRNA-seq, the data generated by this method are sparse and noisy because of the presence of dropout events, which can greatly impact downstream analyses such as differential gene expression, cell type annotation, and linage trajectory reconstruction. The development of effective and robust computational methods to address both dropout and clustering are thus urgently needed. In this study, we propose a flexible, accurate two-stage algorithm for single cell heterogeneity analysis via hierarchical clustering based on an optimal imputation strategy, called scHOIS. At the first stage, masked non-negative matrix factorization is applied to approximate the original observed scRNA-seq data, with optimal rank determined by variance analysis. At the second stage, hierarchical clustering is applied to group the imputed cells using Pearson correlation to measure similarity, with the optimal number of clusters determined by integrating three classical indexes. We performed extensive experiments on real-world datasets, which showed that scHOIS effectively and robustly distinguished cellular differences and that the clustering performance of this algorithm was superior to that of other state-of-the-art methods.
Collapse
|
20
|
Jee DJ, Kong Y, Chun H. Deep Nonnegative Matrix Factorization Using a Variational Autoencoder With Application to Single-Cell RNA Sequencing Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:883-893. [PMID: 35511832 DOI: 10.1109/tcbb.2022.3172723] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Single-cell RNA sequencing is used to analyze the gene expression data of individual cells, thereby adding to existing knowledge of biological phenomena. Accordingly, this technology is widely used in numerous biomedical studies. Recently, the variational autoencoder has emerged and has been adopted for the analysis of single-cell data owing to its high capacity to manage large-scale data. Many different variants of the variational autoencoder have been applied, and have yielded superior results. However, because it is nonlinear, the model does not provide parameters that can be used to explain the underlying biological patterns. In this paper, we propose an interpretable nonnegative matrix factorization method that decomposes parameters into those shared across cells and those that are cell-specific. Effective nonlinear dimension reduction was achieved via a variational autoencoder applied to the cell-specific parameters. In addition to achieving nonlinear dimension reduction, our model could estimate the cell-type-specific gene expression. To improve the estimation accuracy, we introduced log-regularization, which reflects the single-cell property. Overall, our approach displayed excellent performance in a simulation study and in real data analyses, while maintaining good biological interpretability.
Collapse
|
21
|
Ning Z, Dai Z, Zhang H, Chen Y, Yuan Z. A clustering method for small scRNA-seq data based on subspace and weighted distance. PeerJ 2023; 11:e14706. [PMID: 36710872 PMCID: PMC9879162 DOI: 10.7717/peerj.14706] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Accepted: 12/15/2022] [Indexed: 01/24/2023] Open
Abstract
Background Identifying the cell types using unsupervised methods is essential for scRNA-seq research. However, conventional similarity measures introduce challenges to single-cell data clustering because of the high dimensional, high noise, and high dropout. Methods We proposed a clustering method for small ScRNA-seq data based on Subspace and Weighted Distance (SSWD), which follows the assumption that the sets of gene subspace composed of similar density-distributing genes can better distinguish cell groups. To accurately capture the intrinsic relationship among cells or genes, a new distance metric that combines Euclidean and Pearson distance through a weighting strategy was proposed. The relative Calinski-Harabasz (CH) index was used to estimate the cluster numbers instead of the CH index because it is comparable across degrees of freedom. Results We compared SSWD with seven prevailing methods on eight publicly scRNA-seq datasets. The experimental results show that the SSWD has better clustering accuracy and the partitioning ability of cell groups. SSWD can be downloaded at https://github.com/ningzilan/SSWD.
Collapse
Affiliation(s)
- Zilan Ning
- Hunan Engineering & Technology Research Centre for Agricultural Big Data Analysis & Decision-Making, Hunan Agricultural University, Changsha, Hunan, China,Hunan Agricultural University, College of Information and Intelligence, Changsha, Hunan, China
| | - Zhijun Dai
- Hunan Engineering & Technology Research Centre for Agricultural Big Data Analysis & Decision-Making, Hunan Agricultural University, Changsha, Hunan, China
| | - Hongyan Zhang
- Hunan Agricultural University, College of Information and Intelligence, Changsha, Hunan, China
| | - Yuan Chen
- Hunan Engineering & Technology Research Centre for Agricultural Big Data Analysis & Decision-Making, Hunan Agricultural University, Changsha, Hunan, China
| | - Zheming Yuan
- Hunan Engineering & Technology Research Centre for Agricultural Big Data Analysis & Decision-Making, Hunan Agricultural University, Changsha, Hunan, China
| |
Collapse
|
22
|
Wu W, Ma X. Network-Based Structural Learning Nonnegative Matrix Factorization Algorithm for Clustering of scRNA-Seq Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:566-575. [PMID: 35316190 DOI: 10.1109/tcbb.2022.3161131] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
Single-cell RNA sequencing (scRNA-seq) measures expression profiles at the single-cell level, which sheds light on revealing the heterogeneity and functional diversity among cell populations. The vast majority of current algorithms identify cell types by directly clustering transcriptional profiles, which ignore indirect relations among cells, resulting in an undesirable performance on cell type discovery and trajectory inference. Therefore, there is a critical need for inferring cell types and trajectories by exploiting the interactions among cells. In this study, we propose a network-based structural learning nonnegative matrix factorization algorithm (aka SLNMF) for the identification of cell types in scRNA-seq, which is transformed into a constrained optimization problem. SLNMF first constructs the similarity network for cells and then extracts latent features of the cells by exploiting the topological structure of the cell-cell network. To improve the clustering performance, the structural constraint is imposed on the model to learn the latent features of cells by preserving the structural information of the networks, thereby significantly improving the performance of algorithms. Finally, we track the trajectory of cells by exploring the relationships among cell types. Fourteen scRNA-seq datasets are adopted to validate the performance of algorithms with the number of single cells varying from 49 to 26,484. The experimental results demonstrate that SLNMF significantly outperforms fifteen state-of-the-art methods with 15.32% improvement in terms of accuracy, and it accurately identifies the trajectories of cells. The proposed model and methods provide an effective strategy to analyze scRNA-seq data. (The software is coded using matlab, and is freely available for academic https://github.com/xkmaxidian/SLNMF).
Collapse
|
23
|
Shu Z, Long Q, Zhang L, Yu Z, Wu XJ. Robust Graph Regularized NMF with Dissimilarity and Similarity Constraints for ScRNA-seq Data Clustering. J Chem Inf Model 2022; 62:6271-6286. [PMID: 36459053 DOI: 10.1021/acs.jcim.2c01305] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
Abstract
The notable progress in single-cell RNA sequencing (ScRNA-seq) technology is beneficial to accurately discover the heterogeneity and diversity of cells. Clustering is an extremely important step during the ScRNA-seq data analysis. However, it cannot achieve satisfactory performances by directly clustering ScRNA-seq data due to its high dimensionality and noise. To address these issues, we propose a novel ScRNA-seq data representation model, termed Robust Graph regularized Non-Negative Matrix Factorization with Dissimilarity and Similarity constraints (RGNMF-DS), for ScRNA-seq data clustering. To accurately characterize the structure information of the labeled samples and the unlabeled samples, respectively, the proposed RGNMF-DS model adopts a couple of complementary regularizers (i.e., similarity and dissimilar regularizers) to guide matrix decomposition. In addition, we construct a graph regularizer to discover the local geometric structure hidden in ScRNA-seq data. Moreover, we adopt the l2,1-norm to measure the reconstruction error and thereby effectively improve the robustness of the proposed RGNMF-DS model to the noises. Experimental results on several ScRNA-seq datasets have demonstrated that our proposed RGNMF-DS model outperforms other state-of-the-art competitors in clustering.
Collapse
Affiliation(s)
- Zhenqiu Shu
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650093, China
| | - Qinghan Long
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650093, China
| | - Luping Zhang
- Library of Kunming Medical University, Kunming 650031, China
| | - Zhengtao Yu
- Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650093, China
| | - Xiao-Jun Wu
- Jiangsu Provincial Engineering Laboratory of Pattern Recognition and Computational Intelligence, Jiangnan University, Wuxi 214122, China
| |
Collapse
|
24
|
Su M, Pan T, Chen QZ, Zhou WW, Gong Y, Xu G, Yan HY, Li S, Shi QZ, Zhang Y, He X, Jiang CJ, Fan SC, Li X, Cairns MJ, Wang X, Li YS. Data analysis guidelines for single-cell RNA-seq in biomedical studies and clinical applications. Mil Med Res 2022; 9:68. [PMID: 36461064 PMCID: PMC9716519 DOI: 10.1186/s40779-022-00434-8] [Citation(s) in RCA: 35] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Accepted: 11/18/2022] [Indexed: 12/03/2022] Open
Abstract
The application of single-cell RNA sequencing (scRNA-seq) in biomedical research has advanced our understanding of the pathogenesis of disease and provided valuable insights into new diagnostic and therapeutic strategies. With the expansion of capacity for high-throughput scRNA-seq, including clinical samples, the analysis of these huge volumes of data has become a daunting prospect for researchers entering this field. Here, we review the workflow for typical scRNA-seq data analysis, covering raw data processing and quality control, basic data analysis applicable for almost all scRNA-seq data sets, and advanced data analysis that should be tailored to specific scientific questions. While summarizing the current methods for each analysis step, we also provide an online repository of software and wrapped-up scripts to support the implementation. Recommendations and caveats are pointed out for some specific analysis tasks and approaches. We hope this resource will be helpful to researchers engaging with scRNA-seq, in particular for emerging clinical applications.
Collapse
Affiliation(s)
- Min Su
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166 China
| | - Tao Pan
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199 Hainan China
| | - Qiu-Zhen Chen
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166 China
| | - Wei-Wei Zhou
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081 Heilongjiang China
| | - Yi Gong
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166 China
- Department of Immunology, Nanjing Medical University, Nanjing, 211166 China
| | - Gang Xu
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199 Hainan China
| | - Huan-Yu Yan
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166 China
| | - Si Li
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199 Hainan China
| | - Qiao-Zhen Shi
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166 China
| | - Ya Zhang
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199 Hainan China
| | - Xiao He
- Department of Laboratory Medicine, Women and Children’s Hospital of Chongqing Medical University, Chongqing, 401174 China
| | | | - Shi-Cai Fan
- Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Shenzhen, 518110 Guangdong China
| | - Xia Li
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081 Heilongjiang China
| | - Murray J. Cairns
- School of Biomedical Sciences and Pharmacy, Faculty of Health and Medicine, the University of Newcastle, University Drive, Callaghan, NSW 2308 Australia
- Precision Medicine Research Program, Hunter Medical Research Institute, New Lambton Heights, NSW 2305 Australia
| | - Xi Wang
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166 China
| | - Yong-Sheng Li
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199 Hainan China
| |
Collapse
|
25
|
Li RY, Wang Z, Guan J, Zhou S. Effectively Clustering Single Cell RNA Sequencing Data by Sparse Representation. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3425-3434. [PMID: 34788219 DOI: 10.1109/tcbb.2021.3128576] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Clustering analysis has been widely used in analyzing single-cell RNA-sequencing (scRNA-seq) data to study various biological problems at cellular level. Although a number of scRNA-seq data clustering methods have been developed, most of them evaluate the similarity of pairwise cells while ignoring the global relationships among cells, which sometimes cannot effectively capture the latent structure of cells. In this paper, we propose a new clustering method SPARC for scRNA-seq data. The most important feature of SPARC is a novel similarity metric that uses the sparse representation coefficients of each cell in terms of the other cells to measure the relationships among cells. In addition, we develop an outlier detection method to help parameter selection in SPARC. We compare SPARC with nine existing scRNA-seq data clustering methods on twelve real datasets. Experimental results show that SPARC achieves the state of the art performance. By further analyzing the cell similarity data derived from sparse representations, we find that SPARC is much more effective in mining high quality clusters of scRNA-seq data than two traditional similarity metrics. In conclusion, this study provides a new way to effectively cluster scRNA-seq data and achieves more accurate clustering results than the state of art methods.
Collapse
|
26
|
Cuevas-Diaz Duran R, González-Orozco JC, Velasco I, Wu JQ. Single-cell and single-nuclei RNA sequencing as powerful tools to decipher cellular heterogeneity and dysregulation in neurodegenerative diseases. Front Cell Dev Biol 2022; 10:884748. [PMID: 36353512 PMCID: PMC9637968 DOI: 10.3389/fcell.2022.884748] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2022] [Accepted: 10/06/2022] [Indexed: 08/10/2023] Open
Abstract
Neurodegenerative diseases affect millions of people worldwide and there are currently no cures. Two types of common neurodegenerative diseases are Alzheimer's (AD) and Parkinson's disease (PD). Single-cell and single-nuclei RNA sequencing (scRNA-seq and snRNA-seq) have become powerful tools to elucidate the inherent complexity and dynamics of the central nervous system at cellular resolution. This technology has allowed the identification of cell types and states, providing new insights into cellular susceptibilities and molecular mechanisms underlying neurodegenerative conditions. Exciting research using high throughput scRNA-seq and snRNA-seq technologies to study AD and PD is emerging. Herein we review the recent progress in understanding these neurodegenerative diseases using these state-of-the-art technologies. We discuss the fundamental principles and implications of single-cell sequencing of the human brain. Moreover, we review some examples of the computational and analytical tools required to interpret the extensive amount of data generated from these assays. We conclude by highlighting challenges and limitations in the application of these technologies in the study of AD and PD.
Collapse
Affiliation(s)
| | | | - Iván Velasco
- Instituto de Fisiología Celular—Neurociencias, Universidad Nacional Autónoma de México, Mexico City, Mexico
- Laboratorio de Reprogramación Celular, Instituto Nacional de Neurología y Neurocirugía “Manuel Velasco Suárez”, Mexico City, Mexico
| | - Jia Qian Wu
- The Vivian L. Smith Department of Neurosurgery, McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, TX, United States
- Center for Stem Cell and Regenerative Medicine, UT Brown Foundation Institute of Molecular Medicine, Houston, TX, United States
- MD Anderson Cancer Center UTHealth Graduate School of Biomedical Sciences, Houston, TX, United States
| |
Collapse
|
27
|
Breitenbach T, Schmitt MJ, Dandekar T. Optimization of synthetic molecular reporters for a mesenchymal glioblastoma transcriptional program by integer programing. Bioinformatics 2022; 38:4162-4171. [PMID: 35809064 DOI: 10.1093/bioinformatics/btac488] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2022] [Revised: 06/05/2022] [Accepted: 07/07/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION A recent approach to perform genetic tracing of complex biological problems involves the generation of synthetic deoxyribonucleic acid (DNA) probes that specifically mark cells with a phenotype of interest. These synthetic locus control regions (sLCRs), in turn, drive the expression of a reporter gene, such as fluorescent protein. To build functional and specific sLCRs, it is critical to accurately select multiple bona fide cis-regulatory elements from the target cell phenotype cistrome. This selection occurs by maximizing the number and diversity of transcription factors (TFs) within the sLCR, yet the size of the final sLCR should remain limited. RESULTS In this work, we discuss how optimization, in particular integer programing, can be used to systematically address the construction of a specific sLCR and optimize pre-defined properties of the sLCR. Our presented instance of a linear optimization problem maximizes the activation potential of the sLCR such that its size is limited to a pre-defined length and a minimum number of all TFs deemed sufficiently characteristic for the phenotype of interest is covered. We generated an sLCR to trace the mesenchymal glioblastoma program in patients by solving our corresponding linear program with the software optimizer Gurobi. Considering the binding strength of transcription factor binding sites (TFBSs) with their TFs as a proxy for activation potential, the optimized sLCR scores similarly to an sLCR experimentally validated in vivo, and is smaller in size while having the same coverage of TFBSs. AVAILABILITY AND IMPLEMENTATION We provide a Python implementation of the presented framework in the Supplementary Material with which an optimal selection of cis-regulatory elements can be calculated once the target set of TFs and their binding strength with their TFBSs is known. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Tim Breitenbach
- Biozentrum, Julius-Maximilians-Universität, Würzburg 97074, Germany
| | - Matthias Jürgen Schmitt
- Max-Delbrück-Centrum für Molekulare Medizin (MDC), Helmholtz-Gemeinschaft, Berlin 13125, Germany
| | - Thomas Dandekar
- Biozentrum, Julius-Maximilians-Universität, Würzburg 97074, Germany
| |
Collapse
|
28
|
Unified K-means coupled self-representation and neighborhood kernel learning for clustering single-cell RNA-sequencing data. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.06.046] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
|
29
|
Single-cell multiomics analysis reveals regulatory programs in clear cell renal cell carcinoma. Cell Discov 2022; 8:68. [PMID: 35853872 PMCID: PMC9296597 DOI: 10.1038/s41421-022-00415-0] [Citation(s) in RCA: 50] [Impact Index Per Article: 16.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2021] [Accepted: 04/26/2022] [Indexed: 01/01/2023] Open
Abstract
The clear cell renal cell carcinoma (ccRCC) microenvironment consists of many different cell types and structural components that play critical roles in cancer progression and drug resistance, but the cellular architecture and underlying gene regulatory features of ccRCC have not been fully characterized. Here, we applied single-cell RNA sequencing (scRNA-seq) and single-cell assay for transposase-accessible chromatin sequencing (scATAC-seq) to generate transcriptional and epigenomic landscapes of ccRCC. We identified tumor cell-specific regulatory programs mediated by four key transcription factors (TFs) (HOXC5, VENTX, ISL1, and OTP), and these TFs have prognostic significance in The Cancer Genome Atlas (TCGA) database. Targeting these TFs via short hairpin RNAs (shRNAs) or small molecule inhibitors decreased tumor cell proliferation. We next performed an integrative analysis of chromatin accessibility and gene expression for CD8+ T cells and macrophages to reveal the different regulatory elements in their subgroups. Furthermore, we delineated the intercellular communications mediated by ligand–receptor interactions within the tumor microenvironment. Taken together, our multiomics approach further clarifies the cellular heterogeneity of ccRCC and identifies potential therapeutic targets.
Collapse
|
30
|
Liang Z, Zheng R, Chen S, Yan X, Li M. A deep matrix factorization based approach for single-cell RNA-seq data clustering. Methods 2022; 205:114-122. [PMID: 35777719 DOI: 10.1016/j.ymeth.2022.06.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Revised: 05/28/2022] [Accepted: 06/24/2022] [Indexed: 11/17/2022] Open
Abstract
The rapid development of single-cell sequencing technologies makes it possible to analyze cellular heterogeneity at the single-cell level. Cell clustering is one of the most fundamental and common steps in the heterogeneity analysis. However, due to the high noise level, high dimensionality and high sparsity, accurate cell clustering is still challengeable. Here, we present DeepCI, a new clustering approach for scRNA-seq data. Using two autoencoders to obtain cell embedding and gene embedding, DeepCI can simultaneously learn cell low-dimensional representation and clustering. In addition, the recovered gene expression matrix can be obtained by the matrix multiplication of cell and gene embedding. To evaluate the performance of DeepCI, we performed it on several real scRNA-seq datasets for clustering and visualization analysis. The experimental results show that DeepCI obtains the overall better performance than several popular single cell analysis methods. We also evaluated the imputation performance of DeepCI by a dedicated experiment. The corresponding results show that the imputed gene expression of known specific marker gene can greatly improve the accuracy of cell type classification.
Collapse
Affiliation(s)
- Zhenlan Liang
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Ruiqing Zheng
- School of Computer Science and Engineering, Central South University, Changsha 410083, China.
| | - Siqi Chen
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Xuhua Yan
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha 410083, China.
| |
Collapse
|
31
|
Zeira R, Land M, Strzalkowski A, Raphael BJ. Alignment and integration of spatial transcriptomics data. Nat Methods 2022; 19:567-575. [PMID: 35577957 PMCID: PMC9334025 DOI: 10.1038/s41592-022-01459-6] [Citation(s) in RCA: 116] [Impact Index Per Article: 38.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2021] [Accepted: 03/17/2022] [Indexed: 01/05/2023]
Abstract
Spatial transcriptomics (ST) measures mRNA expression across thousands of spots from a tissue slice while recording the two-dimensional (2D) coordinates of each spot. We introduce probabilistic alignment of ST experiments (PASTE), a method to align and integrate ST data from multiple adjacent tissue slices. PASTE computes pairwise alignments of slices using an optimal transport formulation that models both transcriptional similarity and physical distances between spots. PASTE further combines pairwise alignments to construct a stacked 3D alignment of a tissue. Alternatively, PASTE can integrate multiple ST slices into a single consensus slice. We show that PASTE accurately aligns spots across adjacent slices in both simulated and real ST data, demonstrating the advantages of using both transcriptional similarity and spatial information. We further show that the PASTE integrated slice improves the identification of cell types and differentially expressed genes compared with existing approaches that either analyze single ST slices or ignore spatial information.
Collapse
Affiliation(s)
- Ron Zeira
- Department of Computer Science, Princeton University, Princeton, NJ, USA
| | - Max Land
- Department of Computer Science, Princeton University, Princeton, NJ, USA
| | | | - Benjamin J Raphael
- Department of Computer Science, Princeton University, Princeton, NJ, USA.
| |
Collapse
|
32
|
Wu W, Zhang W, Ma X. Network-based integrative analysis of single-cell transcriptomic and epigenomic data for cell types. Brief Bioinform 2022; 23:bbab546. [PMID: 35043143 DOI: 10.1093/bib/bbab546] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2021] [Revised: 11/09/2021] [Accepted: 11/27/2021] [Indexed: 02/02/2023] Open
Abstract
Advances in single-cell biotechnologies simultaneously generate the transcriptomic and epigenomic profiles at cell levels, providing an opportunity for investigating cell fates. Although great efforts have been devoted to either of them, the integrative analysis of single-cell multi-omics data is really limited because of the heterogeneity, noises and sparsity of single-cell profiles. In this study, a network-based integrative clustering algorithm (aka NIC) is present for the identification of cell types by fusing the parallel single-cell transcriptomic (scRNA-seq) and epigenomic profiles (scATAC-seq or DNA methylation). To avoid heterogeneity of multi-omics data, NIC automatically learns the cell-cell similarity graphs, which transforms the fusion of multi-omics data into the analysis of multiple networks. Then, NIC employs joint non-negative matrix factorization to learn the shared features of cells by exploiting the structure of learned cell-cell similarity networks, providing a better way to characterize the features of cells. The graph learning and integrative analysis procedures are jointly formulated as an optimization problem, and then the update rules are derived. Thirteen single-cell multi-omics datasets from various tissues and organisms are adopted to validate the performance of NIC, and the experimental results demonstrate that the proposed algorithm significantly outperforms the state-of-the-art methods in terms of various measurements. The proposed algorithm provides an effective strategy for the integrative analysis of single-cell multi-omics data (The software is coded using Matlab, and is freely available for academic https://github.com/xkmaxidian/NIC ).
Collapse
Affiliation(s)
- Wenming Wu
- School of Computer Science and Technology, Xidian University, Xi an, 710071, China
| | - Wensheng Zhang
- Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China
| | - Xiaoke Ma
- School of Computer Science and Technology, Xidian University, Xi an, 710071, China
| |
Collapse
|
33
|
Ou-Yang L, Lu F, Zhang ZC, Wu M. Matrix factorization for biomedical link prediction and scRNA-seq data imputation: an empirical survey. Brief Bioinform 2021; 23:6447434. [PMID: 34864871 DOI: 10.1093/bib/bbab479] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Revised: 09/25/2021] [Accepted: 10/18/2021] [Indexed: 02/02/2023] Open
Abstract
Advances in high-throughput experimental technologies promote the accumulation of vast number of biomedical data. Biomedical link prediction and single-cell RNA-sequencing (scRNA-seq) data imputation are two essential tasks in biomedical data analyses, which can facilitate various downstream studies and gain insights into the mechanisms of complex diseases. Both tasks can be transformed into matrix completion problems. For a variety of matrix completion tasks, matrix factorization has shown promising performance. However, the sparseness and high dimensionality of biomedical networks and scRNA-seq data have raised new challenges. To resolve these issues, various matrix factorization methods have emerged recently. In this paper, we present a comprehensive review on such matrix factorization methods and their usage in biomedical link prediction and scRNA-seq data imputation. Moreover, we select representative matrix factorization methods and conduct a systematic empirical comparison on 15 real data sets to evaluate their performance under different scenarios. By summarizing the experimental results, we provide general guidelines for selecting matrix factorization methods for different biomedical matrix completion tasks and point out some future directions to further improve the performance for biomedical link prediction and scRNA-seq data imputation.
Collapse
Affiliation(s)
- Le Ou-Yang
- Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen Key Laboratory of Media Security, and Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ), College of Electronics and Information Engineering, Shenzhen University, Shenzhen, 518060, China.,Shenzhen Institute of Artificial Intelligence and Robotics for Society, Shenzhen,518172, China
| | - Fan Lu
- Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen Key Laboratory of Media Security, and Guangdong Laboratory of Artificial Intelligence and Digital Economy(SZ), College of Electronics and Information Engineering, Shenzhen University, Shenzhen, 518060, China
| | - Zi-Chao Zhang
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, 200433, China
| | - Min Wu
- Institute for Infocomm Research (I2R), A*STAR, 138632, Singapore
| |
Collapse
|
34
|
Oh S, Park H, Zhang X. Hybrid Clustering of Single-Cell Gene Expression and Spatial Information via Integrated NMF and K-Means. Front Genet 2021; 12:763263. [PMID: 34819947 PMCID: PMC8606648 DOI: 10.3389/fgene.2021.763263] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Accepted: 10/13/2021] [Indexed: 11/13/2022] Open
Abstract
Advances in single cell transcriptomics have allowed us to study the identity of single cells. This has led to the discovery of new cell types and high resolution tissue maps of them. Technologies that measure multiple modalities of such data add more detail, but they also complicate data integration. We offer an integrated analysis of the spatial location and gene expression profiles of cells to determine their identity. We propose scHybridNMF (single-cell Hybrid Nonnegative Matrix Factorization), which performs cell type identification by combining sparse nonnegative matrix factorization (sparse NMF) with k-means clustering to cluster high-dimensional gene expression and low-dimensional location data. We show that, under multiple scenarios, including the cases where there is a small number of genes profiled and the location data is noisy, scHybridNMF outperforms sparse NMF, k-means, and an existing method that uses a hidden Markov random field to encode cell location and gene expression data for cell type identification.
Collapse
Affiliation(s)
- Sooyoun Oh
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, United States
| | - Haesun Park
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, United States
| | - Xiuwei Zhang
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, United States
| |
Collapse
|
35
|
Fang Q, Su D, Ng W, Feng J. An Effective Biclustering-Based Framework for Identifying Cell Subpopulations From scRNA-seq Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2249-2260. [PMID: 32167906 DOI: 10.1109/tcbb.2020.2979717] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
The advent of single-cell RNA sequencing (scRNA-seq) techniques opens up new opportunities for studying the cell-specific changes in the transcriptomic data. An important research problem related with scRNA-seq data analysis is to identify cell subpopulations with distinct functions. However, the expression profiles of individual cells are usually measured over tens of thousands of genes, and it remains a difficult problem to effectively cluster the cells based on the high-dimensional profiles. An additional challenge of performing the analysis is that, the scRNA-seq data are often noisy and sometimes extremely sparse due to technical limitations and sampling deficiencies. In this paper, we propose a biclustering-based framework called DivBiclust that effectively identifies the cell subpopulations based on the high-dimensional noisy scRNA-seq data. Compared with nine state-of-the-art methods, DivBiclust excels in identifying cell subpopulations with high accuracy as evidenced by our experiments on ten real scRNA-seq datasets with different size and diverse dropout rates. The supplemental materials of DivBiclust, including the source codes, data, and a supplementary document, are available at https://www.github.com/Qiong-Fang/DivBiclust.
Collapse
|
36
|
Shiga M, Seno S, Onizuka M, Matsuda H. SC-JNMF: single-cell clustering integrating multiple quantification methods based on joint non-negative matrix factorization. PeerJ 2021; 9:e12087. [PMID: 34532161 PMCID: PMC8404576 DOI: 10.7717/peerj.12087] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2021] [Accepted: 08/07/2021] [Indexed: 11/20/2022] Open
Abstract
Single-cell RNA-sequencing is a rapidly evolving technology that enables us to understand biological processes at unprecedented resolution. Single-cell expression analysis requires a complex data processing pipeline, and the pipeline is divided into two main parts: The quantification part, which converts the sequence information into gene-cell matrix data; the analysis part, which analyzes the matrix data using statistics and/or machine learning techniques. In the analysis part, unsupervised cell clustering plays an important role in identifying cell types and discovering cell diversity and subpopulations. Identified cell clusters are also used for subsequent analysis, such as finding differentially expressed genes and inferring cell trajectories. However, single-cell clustering using gene expression profiles shows different results depending on the quantification methods. Clustering results are greatly affected by the quantification method used in the upstream process. In other words, even if the original RNA-sequence data is the same, gene expression profiles processed by different quantification methods will produce different clusters. In this article, we propose a robust and highly accurate clustering method based on joint non-negative matrix factorization (joint-NMF) by utilizing the information from multiple gene expression profiles quantified using different methods from the same RNA-sequence data. Our joint-NMF can extract common factors among multiple gene expression profiles by applying each NMF under the constraint that one of the factorized matrices is shared among multiple NMFs. The joint-NMF determines more robust and accurate cell clustering results by leveraging multiple quantification methods compared to conventional clustering methods, which use only a single gene expression profile. Additionally, we showed the usefulness of discovering marker genes with the extracted features using our method.
Collapse
Affiliation(s)
- Mikio Shiga
- Graduate School of Information Science and Technology, Osaka University, Osaka, Japan
| | - Shigeto Seno
- Graduate School of Information Science and Technology, Osaka University, Osaka, Japan
| | - Makoto Onizuka
- Graduate School of Information Science and Technology, Osaka University, Osaka, Japan
| | - Hideo Matsuda
- Graduate School of Information Science and Technology, Osaka University, Osaka, Japan
| |
Collapse
|
37
|
A Multiple Comprehensive Analysis of scATAC-seq Based on Auto-Encoder and Matrix Decomposition. Symmetry (Basel) 2021. [DOI: 10.3390/sym13081467] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Single-cell ATAC-seq (scATAC-seq), as the updating of ATAC-seq, provides a novel method for probing open chromatin sites. Currently, research of scATAC-seq is faced with the problem of high dimensionality and the inherent sparsity of the generated data. Recently, several works proposed the use of an autoencoder–decoder, a symmetry neural network architecture, and non-negative matrix factorization methods to characterize the high-dimensional data. To evaluate the performance of multiple methods, in this work, we performed a multiple comparison for characterizing scATAC-seq based on four kinds of auto-encoders known as a symmetry neural network, and two kinds of matrix factorization methods. Different sizes of latent features were used to generate the UMAP plots and for further K-means clustering. Using a gold-standard data set, we practically explored the performance among the methods and the number of latent features in a comprehensive way. Finally, we briefly discuss the underlying difficulties and future directions for scATAC-seq characterizing. As a result, the method designed for handling the sparsity outperforms other tools in the generated dataset.
Collapse
|
38
|
Zhao Y, Fang ZY, Lin CX, Deng C, Xu YP, Li HD. RFCell: A Gene Selection Approach for scRNA-seq Clustering Based on Permutation and Random Forest. Front Genet 2021; 12:665843. [PMID: 34386033 PMCID: PMC8354212 DOI: 10.3389/fgene.2021.665843] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2021] [Accepted: 04/01/2021] [Indexed: 11/13/2022] Open
Abstract
In recent years, the application of single cell RNA-seq (scRNA-seq) has become more and more popular in fields such as biology and medical research. Analyzing scRNA-seq data can discover complex cell populations and infer single-cell trajectories in cell development. Clustering is one of the most important methods to analyze scRNA-seq data. In this paper, we focus on improving scRNA-seq clustering through gene selection, which also reduces the dimensionality of scRNA-seq data. Studies have shown that gene selection for scRNA-seq data can improve clustering accuracy. Therefore, it is important to select genes with cell type specificity. Gene selection not only helps to reduce the dimensionality of scRNA-seq data, but also can improve cell type identification in combination with clustering methods. Here, we proposed RFCell, a supervised gene selection method, which is based on permutation and random forest classification. We first use RFCell and three existing gene selection methods to select gene sets on 10 scRNA-seq data sets. Then, three classical clustering algorithms are used to cluster the cells obtained by these gene selection methods. We found that the gene selection performance of RFCell was better than other gene selection methods.
Collapse
Affiliation(s)
- Yuan Zhao
- Hunan Provincial Key Laboratory on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, China
| | - Zhao-Yu Fang
- School of Mathematics and Statistics, Central South University, Changsha, China
| | - Cui-Xiang Lin
- Hunan Provincial Key Laboratory on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, China
| | - Chao Deng
- Hunan Provincial Key Laboratory on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, China
| | - Yun-Pei Xu
- Hunan Provincial Key Laboratory on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, China
| | - Hong-Dong Li
- Hunan Provincial Key Laboratory on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, China
| |
Collapse
|
39
|
Zhang W, Xue X, Zheng X, Fan Z. NMFLRR: Clustering scRNA-seq data by integrating non-negative matrix factorization with low rank representation. IEEE J Biomed Health Inform 2021; 26:1394-1405. [PMID: 34310328 DOI: 10.1109/jbhi.2021.3099127] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
Fast-developing single-cell technologies create unprecedented opportunities to reveal cell heterogeneity and diversity. Accurate classification of single cells is a critical prerequisite for recovering the mechanisms of heterogeneity. However, the scRNA-seq profiles we obtained at present have high dimensionality, sparsity, and noise, which pose challenges for existing clustering methods in grouping cells that belong to the same subpopulation based on transcriptomic profiles. Although many computational methods have been proposed developing novel and effective computational methods to accurately identify cell types remains a considerable challenge. We present a new computational framework to identify cell types by integrating low-rank representation (LRR) and nonnegative matrix factorization (NMF); this framework is named NMFLRR. The LRR captures the global properties of original data by using nuclear norms, and a locality constrained graph regularization term is introduced to characterize the data's local geometric information. The similarity matrix and low-dimensional features of data can be simultaneously obtained by applying the alternating direction method of multipliers (ADMM) algorithm to handle each variable alternatively in an iterative way. We finally obtained the predicted cell types by using a spectral algorithm based on the optimized similarity matrix. Nine real scRNA-seq datasets were used to test the performance of NMFLRR and fifteen other competitive methods, and the accuracy and robustness of the simulation results suggest the NMFLRR is a promising algorithm for the classification of single cells. The simulation code is freely available at: https://github.com/wzhangwhu/NMFLRR_code.
Collapse
|
40
|
Zhu YL, Yuan SS, Liu JX. Similarity and Dissimilarity Regularized Nonnegative Matrix Factorization for Single-Cell RNA-seq Analysis. Interdiscip Sci 2021; 14:45-54. [PMID: 34231183 DOI: 10.1007/s12539-021-00457-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2021] [Revised: 06/24/2021] [Accepted: 06/27/2021] [Indexed: 10/20/2022]
Abstract
In traditional sequencing techniques, the different functions of cells and the different roles they play in differentiation are often ignored. With the advancement of single-cell RNA sequencing (scRNA-seq) techniques, scientists can measure the gene expression value at the single-cell level, and it is helping to understand the heterogeneity hidden in cells. One of the most powerful ways to find heterogeneity is using the unsupervised clustering method to get separate subpopulations. In this paper, we propose a novel clustering method Similarity and Dissimilarity Regularized Nonnegative Matrix Factorization (SDCNMF) that simultaneously impose similarity and dissimilarity constraints on low-dimensional representations. SDCNMF both considers the similarity of closer cells and the dissimilarity of cells that are farther away. It can not only keep the similar cells getting closer in low-dimensional space, but also can push the dissimilar cells away from each other. We test the validity of our proposed method on five scRNA-seq datasets. Clustering results show that SDCNMF is better than other comparative methods, and the gene markers we find are also consistent with previous studies. Therefore, we can conclude that SDCNMF is effective in scRNA-seq data analysis. This paper proposes a novel clustering method Similarity and Dissimilarity Regularized Nonnegative Matrix Factorization (SDCNMF) that simultaneously impose similarity and dissimilarity constraints on low-dimensional representations. SDCNMF both considers the similarity of closer cells and the dissimilarity of cells that are farther away. It can not only keep the similar cells getting closer in low-dimensional space, but also can push the dissimilar cells away from each other. Clustering results show that SDCNMF is better than other comparative methods, and the gene markers we find are also consistent with previous studies.
Collapse
Affiliation(s)
- Ya-Li Zhu
- School of Computer Science, Qufu Normal University, Rizhao, China
| | - Sha-Sha Yuan
- School of Computer Science, Qufu Normal University, Rizhao, China.
| | - Jin-Xing Liu
- School of Computer Science, Qufu Normal University, Rizhao, China.,Rizhao Huilian Zhongchuang Institute of Intelligent Technology, Rizhao, 276826, China
| |
Collapse
|
41
|
Kharchenko PV. The triumphs and limitations of computational methods for scRNA-seq. Nat Methods 2021; 18:723-732. [PMID: 34155396 DOI: 10.1038/s41592-021-01171-x] [Citation(s) in RCA: 154] [Impact Index Per Article: 38.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2018] [Accepted: 04/29/2021] [Indexed: 02/05/2023]
Abstract
The rapid progress of protocols for sequencing single-cell transcriptomes over the past decade has been accompanied by equally impressive advances in the computational methods for analysis of such data. As capacity and accuracy of the experimental techniques grew, the emerging algorithm developments revealed increasingly complex facets of the underlying biology, from cell type composition to gene regulation to developmental dynamics. At the same time, rapid growth has forced continuous reevaluation of the underlying statistical models, experimental aims, and sheer volumes of data processing that are handled by these computational tools. Here, I review key computational steps of single-cell RNA sequencing (scRNA-seq) analysis, examine assumptions made by different approaches, and highlight successes, remaining ambiguities, and limitations that are important to keep in mind as scRNA-seq becomes a mainstream technique for studying biology.
Collapse
Affiliation(s)
- Peter V Kharchenko
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
42
|
Li HD, Xu Y, Zhu X, Liu Q, Omenn GS, Wang J. ClusterMine: A knowledge-integrated clustering approach based on expression profiles of gene sets. J Bioinform Comput Biol 2021; 18:2040009. [PMID: 32698720 DOI: 10.1142/s0219720020400090] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
Clustering analysis of gene expression data is essential for understanding complex biological data, and is widely used in important biological applications such as the identification of cell subpopulations and disease subtypes. In commonly used methods such as hierarchical clustering (HC) and consensus clustering (CC), holistic expression profiles of all genes are often used to assess the similarity between samples for clustering. While these methods have been proven successful in identifying sample clusters in many areas, they do not provide information about which gene sets (functions) contribute most to the clustering, thus limiting the interpretability of the resulting cluster. We hypothesize that integrating prior knowledge of annotated gene sets would not only achieve satisfactory clustering performance but also, more importantly, enable potential biological interpretation of clusters. Here we report ClusterMine, an approach that identifies clusters by assessing functional similarity between samples through integrating known annotated gene sets in functional annotation databases such as Gene Ontology. In addition to the cluster membership of each sample as provided by conventional approaches, it also outputs gene sets that most likely contribute to the clustering, thus facilitating biological interpretation. We compare ClusterMine with conventional approaches on nine real-world experimental datasets that represent different application scenarios in biology. We find that ClusterMine achieves better performances and that the gene sets prioritized by our method are biologically meaningful. ClusterMine is implemented as an R package and is freely available at: www.genemine.org/clustermine.php.
Collapse
Affiliation(s)
- Hong-Dong Li
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 400083, P. R. China
| | - Yunpei Xu
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 400083, P. R. China
| | - Xiaoshu Zhu
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 400083, P. R. China.,School of Computer Science and Engineering, Yulin Normal University, Yulin, Guangxi, P. R. China
| | - Quan Liu
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 400083, P. R. China
| | - Gilbert S Omenn
- Departments of Computational Medicine and Bioinformatics, Internal Medicine, Human Genetics and School of Public Health, University of Michigan, Ann Arbor, MI 48109-2218, USA
| | - Jianxin Wang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 400083, P. R. China
| |
Collapse
|
43
|
Li RY, Guan J, Zhou S. Boosting scRNA-seq data clustering by cluster-aware feature weighting. BMC Bioinformatics 2021; 22:130. [PMID: 34078287 PMCID: PMC8171019 DOI: 10.1186/s12859-021-04033-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2021] [Accepted: 02/16/2021] [Indexed: 12/26/2022] Open
Abstract
BACKGROUND The rapid development of single-cell RNA sequencing (scRNA-seq) enables the exploration of cell heterogeneity, which is usually done by scRNA-seq data clustering. The essence of scRNA-seq data clustering is to group cells by measuring the similarities among genes/transcripts of cells. And the selection of features for cell similarity evaluation is of great importance, which will significantly impact clustering effectiveness and efficiency. RESULTS In this paper, we propose a novel method called CaFew to select genes based on cluster-aware feature weighting. By optimizing the clustering objective function, CaFew obtains a feature weight matrix, which is further used for feature selection. The genes have large weights in at least one cluster or the genes whose weights vary greatly in different clusters are selected. Experiments on 8 real scRNA-seq datasets show that CaFew can obviously improve the clustering performance of existing scRNA-seq data clustering methods. Particularly, the combination of CaFew with SC3 achieves the state-of-art performance. Furthermore, CaFew also benefits the visualization of scRNA-seq data. CONCLUSION CaFew is an effective scRNA-seq data clustering method due to its gene selection mechanism based on cluster-aware feature weighting, and it is a useful tool for scRNA-seq data analysis.
Collapse
Affiliation(s)
- Rui-Yi Li
- Department of Computer Science and Technology, Tongji University, 4800 Caoan Road, Shanghai, 201804 China
| | - Jihong Guan
- Department of Computer Science and Technology, Tongji University, 4800 Caoan Road, Shanghai, 201804 China
| | - Shuigeng Zhou
- Shanghai Key Lab of Intelligent Information Processing, and School of Computer Science, Fudan University, 220 Handan Road, Shanghai, 200433 China
| |
Collapse
|
44
|
Li Y, Luo P, Lu Y, Wu FX. Identifying cell types from single-cell data based on similarities and dissimilarities between cells. BMC Bioinformatics 2021; 22:255. [PMID: 34006217 PMCID: PMC8132444 DOI: 10.1186/s12859-020-03873-z] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2020] [Accepted: 11/09/2020] [Indexed: 12/15/2022] Open
Abstract
Background With the development of the technology of single-cell sequence, revealing homogeneity and heterogeneity between cells has become a new area of computational systems biology research. However, the clustering of cell types becomes more complex with the mutual penetration between different types of cells and the instability of gene expression. One way of overcoming this problem is to group similar, related single cells together by the means of various clustering analysis methods. Although some methods such as spectral clustering can do well in the identification of cell types, they only consider the similarities between cells and ignore the influence of dissimilarities on clustering results. This methodology may limit the performance of most of the conventional clustering algorithms for the identification of clusters, it needs to develop special methods for high-dimensional sparse categorical data. Results Inspired by the phenomenon that same type cells have similar gene expression patterns, but different types of cells evoke dissimilar gene expression patterns, we improve the existing spectral clustering method for clustering single-cell data that is based on both similarities and dissimilarities between cells. The method first measures the similarity/dissimilarity among cells, then constructs the incidence matrix by fusing similarity matrix with dissimilarity matrix, and, finally, uses the eigenvalues of the incidence matrix to perform dimensionality reduction and employs the K-means algorithm in the low dimensional space to achieve clustering. The proposed improved spectral clustering method is compared with the conventional spectral clustering method in recognizing cell types on several real single-cell RNA-seq datasets. Conclusions In summary, we show that adding intercellular dissimilarity can effectively improve accuracy and achieve robustness and that improved spectral clustering method outperforms the traditional spectral clustering method in grouping cells.
Collapse
Affiliation(s)
- Yuanyuan Li
- School of Mathematics and Physics, Wuhan Institute of Technology, No.206, Guanggu 1st road, Wuhan, 430205, Hubei, China. .,Division of Biomedical Engineering, University of Saskatchewan, 57 Campus Drive, Saskatoon, SK, S7N 5A9, Canada.
| | - Ping Luo
- Division of Biomedical Engineering, University of Saskatchewan, 57 Campus Drive, Saskatoon, SK, S7N 5A9, Canada
| | - Yi Lu
- Division of Biomedical Engineering, University of Saskatchewan, 57 Campus Drive, Saskatoon, SK, S7N 5A9, Canada
| | - Fang-Xiang Wu
- Division of Biomedical Engineering, University of Saskatchewan, 57 Campus Drive, Saskatoon, SK, S7N 5A9, Canada.,Department of Mechanical Engineering, University of Saskatchewan, 57 Campus Drive, Saskatoon, SK, S7N 5A9, Canada.,Department of Computer Science, University of Saskatchewan, 57 Campus Drive, Saskatoon, SK, S7N 5A9, Canada
| |
Collapse
|
45
|
Liang Z, Li M, Zheng R, Tian Y, Yan X, Chen J, Wu FX, Wang J. SSRE: Cell Type Detection Based on Sparse Subspace Representation and Similarity Enhancement. GENOMICS PROTEOMICS & BIOINFORMATICS 2021; 19:282-291. [PMID: 33647482 PMCID: PMC8602764 DOI: 10.1016/j.gpb.2020.09.004] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/30/2019] [Revised: 08/13/2020] [Accepted: 10/29/2020] [Indexed: 11/25/2022]
Abstract
Accurate identification of cell types from single-cell RNA sequencing (scRNA-seq) data plays a critical role in a variety of scRNA-seq analysis studies. This task corresponds to solving an unsupervised clustering problem, in which the similarity measurement between cells affects the result significantly. Although many approaches for cell type identification have been proposed, the accuracy still needs to be improved. In this study, we proposed a novel single-cell clustering framework based on similarity learning, called SSRE. SSRE models the relationships between cells based on subspace assumption, and generates a sparse representation of the cell-to-cell similarity. The sparse representation retains the most similar neighbors for each cell. Besides, three classical pairwise similarities are incorporated with a gene selection and enhancement strategy to further improve the effectiveness of SSRE. Tested on ten real scRNA-seq datasets and five simulated datasets, SSRE achieved the superior performance in most cases compared to several state-of-the-art single-cell clustering methods. In addition, SSRE can be extended to visualization of scRNA-seq data and identification of differentially expressed genes. The matlab and python implementations of SSRE are available at https://github.com/CSUBioGroup/SSRE.
Collapse
Affiliation(s)
- Zhenlan Liang
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha 410083, China.
| | - Ruiqing Zheng
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Yu Tian
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Xuhua Yan
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Jin Chen
- College of Medicine, University of Kentucky, Lexington, KY 40536, USA
| | - Fang-Xiang Wu
- Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, SK S7N 5A9, Canada
| | - Jianxin Wang
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| |
Collapse
|
46
|
Xu Y, Li HD, Pan Y, Luo F, Wu FX, Wang J. A Gene Rank Based Approach for Single Cell Similarity Assessment and Clustering. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:431-442. [PMID: 31369384 DOI: 10.1109/tcbb.2019.2931582] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Single-cell RNA sequencing (scRNA-seq) technology provides quantitative gene expression profiles at single-cell resolution. As a result, researchers have established new ways to explore cell population heterogeneity and genetic variability of cells. One of the current research directions for scRNA-seq data is to identify different cell types accurately through unsupervised clustering methods. However, scRNA-seq data analysis is challenging because of their high noise level, high dimensionality and sparsity. Moreover, the impact of multiple latent factors on gene expression heterogeneity and on the ability to accurately identify cell types remains unclear. How to overcome these challenges to reveal the biological difference between cell types has become the key to analyze scRNA-seq data. For these reasons, the unsupervised learning for cell population discovery based on scRNA-seq data analysis has become an important research area. A cell similarity assessment method plays a significant role in cell clustering. Here, we present BioRank, a new cell similarity assessment method based on annotated gene sets and gene ranks. To evaluate the performances, we cluster cells by two classical clustering algorithms based on the similarity between cells obtained by BioRank. In addition, BioRank can be used by any clustering algorithm that requires a similarity matrix. Applying BioRank to 12 public scRNA-seq datasets, we show that it is better than or at least as well as several popular similarity assessment methods for single cell clustering.
Collapse
|
47
|
Zhang W, Li Y, Zou X. SCCLRR: A Robust Computational Method for Accurate Clustering Single Cell RNA-Seq Data. IEEE J Biomed Health Inform 2021; 25:247-256. [PMID: 32356764 DOI: 10.1109/jbhi.2020.2991172] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Single-cell RNA transcriptome data present a tremendous opportunity for studying the cellular heterogeneity. Identifying subpopulations based on scRNA-seq data is a hot topic in recent years, although many researchers have been focused on designing elegant computational methods for identifying new cell types; however, the performance of these methods is still unsatisfactory due to the high dimensionality, sparsity and noise of scRNA-seq data. In this study, we propose a new cell type detection method by learning a robust and accurate similarity matrix, named SCCLRR. The method simultaneously captures both global and local intrinsic properties of data based on a low rank representation (LRR) framework mathematical model. The integrated normalized Euclidean distance and cosine similarity are used to balance the intrinsic linear and nonlinear manifold of data in the local regularization term. To solve the non-convex optimization model, we present an iterative optimization procedure using the alternating direction method of multipliers (ADMM) algorithm. We evaluate the performance of the SCCLRR method on nine real scRNA-seq datasets and compare it with seven state-of-the-art methods. The simulation results show that the SCCLRR outperforms other methods and is robust and effective for clustering scRNA-seq data. (The code of SCCLRR is free available for academic https://github.com/wzhangwhu/SCCLRR).
Collapse
|
48
|
Yu B, Chen C, Qi R, Zheng R, Skillman-Lawrence PJ, Wang X, Ma A, Gu H. scGMAI: a Gaussian mixture model for clustering single-cell RNA-Seq data based on deep autoencoder. Brief Bioinform 2020; 22:6029147. [PMID: 33300547 DOI: 10.1093/bib/bbaa316] [Citation(s) in RCA: 34] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2020] [Revised: 10/19/2020] [Indexed: 01/01/2023] Open
Abstract
The rapid development of single-cell RNA sequencing (scRNA-Seq) technology provides strong technical support for accurate and efficient analyzing single-cell gene expression data. However, the analysis of scRNA-Seq is accompanied by many obstacles, including dropout events and the curse of dimensionality. Here, we propose the scGMAI, which is a new single-cell Gaussian mixture clustering method based on autoencoder networks and the fast independent component analysis (FastICA). Specifically, scGMAI utilizes autoencoder networks to reconstruct gene expression values from scRNA-Seq data and FastICA is used to reduce the dimensions of reconstructed data. The integration of these computational techniques in scGMAI leads to outperforming results compared to existing tools, including Seurat, in clustering cells from 17 public scRNA-Seq datasets. In summary, scGMAI is an effective tool for accurately clustering and identifying cell types from scRNA-Seq data and shows the great potential of its applicative power in scRNA-Seq data analysis. The source code is available at https://github.com/QUST-AIBBDRC/scGMAI/.
Collapse
Affiliation(s)
- Bin Yu
- College of Mathematics and Physics, Qingdao University of Science and Technolog, China
| | - Chen Chen
- College of Mathematics and Physics, Qingdao University of Science and Technology, China
| | - Ren Qi
- College of Intelligence and Computing, Tianjin University, China
| | - Ruiqing Zheng
- School of Computer Science and Engineering, Central South University, China
| | | | - Xiaolin Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, China
| | - Anjun Ma
- Department of Biomedical Informatics, The Ohio State University, USA
| | - Haiming Gu
- College of Mathematics and Physics, Qingdao University of Science and Technology, China
| |
Collapse
|
49
|
Wu P, An M, Zou HR, Zhong CY, Wang W, Wu CP. A robust semi-supervised NMF model for single cell RNA-seq data. PeerJ 2020; 8:e10091. [PMID: 33088619 PMCID: PMC7571410 DOI: 10.7717/peerj.10091] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2020] [Accepted: 09/13/2020] [Indexed: 11/20/2022] Open
Abstract
Background Single-cell RNA-sequencing (scRNA-seq) technology is a powerful tool to study organism from a single cell perspective and explore the heterogeneity between cells. Clustering is a fundamental step in scRNA-seq data analysis and it is the key to understand cell function and constitutes the basis of other advanced analysis. Nonnegative Matrix Factorization (NMF) has been widely used in clustering analysis of transcriptome data and achieved good performance. However, the existing NMF model is unsupervised and ignores known gene functions in the process of clustering. Knowledges of cell markers genes (genes that only express in specific cells) in human and model organisms have been accumulated a lot, such as the Molecular Signatures Database (MSigDB), which can be used as prior information in the clustering analysis of scRNA-seq data. Because the same kind of cells is likely to have similar biological functions and specific gene expression patterns, the marker genes of cells can be utilized as prior knowledge in the clustering analysis. Methods We propose a robust and semi-supervised NMF (rssNMF) model, which introduces a new variable to absorb noises of data and incorporates marker genes as prior information into a graph regularization term. We use rssNMF to solve the clustering problem of scRNA-seq data. Results Twelve scRNA-seq datasets with true labels are used to test the model performance and the results illustrate that our model outperforms original NMF and other common methods such as KMeans and Hierarchical Clustering. Biological significance analysis shows that rssNMF can identify key subclasses and latent biological processes. To our knowledge, this study is the first method that incorporates prior knowledge into the clustering analysis of scRNA-seq data.
Collapse
Affiliation(s)
- Peng Wu
- Department of Neurosurgery, The People's Hospital of Longhua District, Shenzhen, Guangdong Province, China
| | - Mo An
- Department of Neurosurgery, The People's Hospital of Longhua District, Shenzhen, Guangdong Province, China
| | - Hai-Ren Zou
- Department of Neurosurgery, The People's Hospital of Longhua District, Shenzhen, Guangdong Province, China
| | - Cai-Ying Zhong
- Department of Neurosurgery, The People's Hospital of Longhua District, Shenzhen, Guangdong Province, China
| | - Wei Wang
- Department of Neurosurgery, The People's Hospital of Longhua District, Shenzhen, Guangdong Province, China
| | - Chang-Peng Wu
- Department of Neurosurgery, The People's Hospital of Longhua District, Shenzhen, Guangdong Province, China
| |
Collapse
|
50
|
Sun YS, Ou-Yang L, Dai DQ. LRSK: a low-rank self-representation K-means method for clustering single-cell RNA-sequencing data. Mol Omics 2020; 16:465-473. [PMID: 32572422 DOI: 10.1039/d0mo00034e] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Abstract
The development of single-cell RNA-sequencing (scRNA-seq) technologies brings tremendous opportunities for quantitative research and analyses at the cellular level. In particular, as a crucial task of scRNA-seq analysis, single cell clustering shines a light on natural groupings of cells to give new insights into the biological mechanisms and disease studies. However, it remains a challenge to identify cell clusters from lots of cell mixtures effectively and accurately. In this paper, we propose a novel adaptive joint clustering framework, named the low-rank self-representation K-means method (LRSK), to learn the data representation matrix and cluster indicator matrix jointly from scRNA-seq data. Specifically, instead of calculating the similarities among cells from the original data, we seek a low-rank representation of the original data to better reflect the underlying relationships among cells. Moreover, an Augmented Lagrangian Multiplier (ALM) based optimization algorithm is adopted to solve this problem. Experimental results on various scRNA-seq datasets and case studies demonstrate that our method performs better than other state-of-the-art single cell clustering algorithms. The analysis of unlabeled large single-cell liver cancer sequencing data further shows that our prediction results are more reasonable and interpretable.
Collapse
Affiliation(s)
- Ye-Sen Sun
- Intelligent Data Center, School of Mathematics, Sun Yat-sen University, Guangzhou, China.
| | | | | |
Collapse
|