1
|
Xie J, Ruan S, Tu M, Yuan Z, Hu J, Li H, Li S. Clustering single-cell RNA sequencing data via iterative smoothing and self-supervised discriminative embedding. Oncogene 2024:10.1038/s41388-024-03074-5. [PMID: 38834657 DOI: 10.1038/s41388-024-03074-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2024] [Revised: 05/22/2024] [Accepted: 05/28/2024] [Indexed: 06/06/2024]
Abstract
Single-cell transcriptome sequencing (scRNA-seq) is a high-throughput technique used to study gene expression at the single-cell level. Clustering analysis is a commonly used method in scRNA-seq data analysis, helping researchers identify cell types and uncover interactions between cells. However, the choice of a robust similarity metric in the clustering procedure is still an open challenge due to the complex underlying structures of the data and the inherent noise in data acquisition. Here, we propose a deep clustering method for scRNA-seq data called scRISE (scRNA-seq Iterative Smoothing and self-supervised discriminative Embedding model) to resolve this challenge. The model consists of two main modules: an iterative smoothing module based on graph autoencoders designed to denoise the data and refine the pairwise similarity in turn to gradually incorporate cell structural features and enrich the data information; and a self-supervised discriminative embedding module with adaptive similarity threshold for partitioning samples into correct clusters. Our approach has shown improved quality of data representation and clustering on seventeen scRNA-seq datasets against a number of state-of-the-art deep learning clustering methods. Furthermore, utilizing the scRISE method in biological analysis against the HNSCC dataset has unveiled 62 informative genes, highlighting their potential roles as therapeutic targets and biomarkers.
Collapse
Affiliation(s)
- Jinxin Xie
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China
| | - Shanshan Ruan
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China
| | - Mingyan Tu
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China
| | - Zhen Yuan
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China
| | - Jianguo Hu
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China
| | - Honglin Li
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China.
- Innovation Center for AI and Drug Discovery, School of Pharmacy, East China Normal University, Shanghai, 200062, China.
- Lingang Laboratory, Shanghai, 200031, China.
| | - Shiliang Li
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China.
- Innovation Center for AI and Drug Discovery, School of Pharmacy, East China Normal University, Shanghai, 200062, China.
| |
Collapse
|
2
|
Gong Y, Haeri M, Zhang X, Li Y, Liu A, Wu D, Zhang Q, Jazwinski SM, Zhou X, Wang X, Jiang L, Chen YP, Yan X, Swerdlow RH, Shen H, Deng HW. Spatial Dissection of the Distinct Cellular Responses to Normal Aging and Alzheimer's Disease in Human Prefrontal Cortex at Single-Nucleus Resolution. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.05.21.24306783. [PMID: 38826275 PMCID: PMC11142279 DOI: 10.1101/2024.05.21.24306783] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2024]
Abstract
Aging significantly elevates the risk for Alzheimer's disease (AD), contributing to the accumulation of AD pathologies, such as amyloid-β (Aβ), inflammation, and oxidative stress. The human prefrontal cortex (PFC) is highly vulnerable to the impacts of both aging and AD. Unveiling and understanding the molecular alterations in PFC associated with normal aging (NA) and AD is essential for elucidating the mechanisms of AD progression and developing novel therapeutics for this devastating disease. In this study, for the first time, we employed a cutting-edge spatial transcriptome platform, STOmics® SpaTial Enhanced Resolution Omics-sequencing (Stereo-seq), to generate the first comprehensive, subcellular resolution spatial transcriptome atlas of the human PFC from six AD cases at various neuropathological stages and six age, sex, and ethnicity matched controls. Our analyses revealed distinct transcriptional alterations across six neocortex layers, highlighted the AD-associated disruptions in laminar architecture, and identified changes in layer-to-layer interactions as AD progresses. Further, throughout the progression from NA to various stages of AD, we discovered specific genes that were significantly upregulated in neurons experiencing high stress and in nearby non-neuronal cells, compared to cells distant from the source of stress. Notably, the cell-cell interactions between the neurons under the high stress and adjacent glial cells that promote Aβ clearance and neuroprotection were diminished in AD in response to stressors compared to NA. Through cell-type specific gene co-expression analysis, we identified three modules in excitatory and inhibitory neurons associated with neuronal protection, protein dephosphorylation, and negative regulation of Aβ plaque formation. These modules negatively correlated with AD progression, indicating a reduced capacity for toxic substance clearance in AD subject samples. Moreover, we have discovered a novel transcription factor, ZNF460, that regulates all three modules, establishing it as a potential new therapeutic target for AD. Overall, utilizing the latest spatial transcriptome platform, our study developed the first transcriptome-wide atlas with subcellular resolution for assessing the molecular alterations in the human PFC due to AD. This atlas sheds light on the potential mechanisms underlying the progression from NA to AD.
Collapse
Affiliation(s)
- Yun Gong
- Tulane Center for Biomedical Informatics and Genomics, Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, 70112, USA
| | - Mohammad Haeri
- Department of Pathology & Laboratory Medicine, University of Kansas Medical Center, Kansas City, MO, 66160, USA
| | - Xiao Zhang
- Tulane Center for Biomedical Informatics and Genomics, Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, 70112, USA
| | - Yisu Li
- Department of Cell and Molecular Biology, School of Science of Engineering, Tulane University, New Orleans, LA, 70118, USA
| | - Anqi Liu
- Tulane Center for Biomedical Informatics and Genomics, Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, 70112, USA
| | - Di Wu
- Tulane Center for Biomedical Informatics and Genomics, Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, 70112, USA
| | - Qilei Zhang
- School of Basic Medical Sciences, Central South University, Changsha, Hunan, 410008, China
| | - S. Michal Jazwinski
- Tulane Center for Aging, Deming Department of Medicine, Tulane University School of Medicne, New Orleans, LA 70112, USA
| | - Xiang Zhou
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Xiaoying Wang
- Clinical Neuroscience Research Center, Departments of Neurosurgery and Neurology, Tulane University School of Medicine, New Orleans, LA 70112, USA
| | - Lindong Jiang
- Tulane Center for Biomedical Informatics and Genomics, Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, 70112, USA
| | - Yi-Ping Chen
- Department of Cell and Molecular Biology, School of Science of Engineering, Tulane University, New Orleans, LA, 70118, USA
| | - Xiaoxin Yan
- School of Basic Medical Sciences, Central South University, Changsha, Hunan, 410008, China
| | - Russell H. Swerdlow
- Department of Neurology, University of Kansas Medical Center, Kansas City, MO, 66160, USA
| | - Hui Shen
- Tulane Center for Biomedical Informatics and Genomics, Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, 70112, USA
| | - Hong-Wen Deng
- Tulane Center for Biomedical Informatics and Genomics, Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, 70112, USA
| |
Collapse
|
3
|
Zou X, Liu Y, Wang M, Zou J, Shi Y, Su X, Xu J, Tong HHY, Ji Y, Gui L, Hao J. scCURE identifies cell types responding to immunotherapy and enables outcome prediction. CELL REPORTS METHODS 2023; 3:100643. [PMID: 37989083 PMCID: PMC10694528 DOI: 10.1016/j.crmeth.2023.100643] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/17/2023] [Revised: 07/17/2023] [Accepted: 10/23/2023] [Indexed: 11/23/2023]
Abstract
A deep understanding of immunotherapy response/resistance mechanisms and a highly reliable therapy response prediction are vital for cancer treatment. Here, we developed scCURE (single-cell RNA sequencing [scRNA-seq] data-based Changed and Unchanged cell Recognition during immunotherapy). Based on Gaussian mixture modeling, Kullback-Leibler (KL) divergence, and mutual nearest-neighbors criteria, scCURE can faithfully discriminate between cells affected or unaffected by immunotherapy intervention. By conducting scCURE analyses in melanoma and breast cancer immunotherapy scRNA-seq data, we found that the baseline profiles of specific CD8+ T and macrophage cells (identified by scCURE) can determine the way in which tumor microenvironment immune cells respond to immunotherapy, e.g., antitumor immunity activation or de-activation; therefore, these cells could be predictive factors for treatment response. In this work, we demonstrated that the immunotherapy-associated cell-cell heterogeneities revealed by scCURE can be utilized to integrate the therapy response mechanism study and prediction model construction.
Collapse
Affiliation(s)
- Xin Zou
- Center for Tumor Diagnosis & Therapy, Jinshan Hospital, Fudan University, Shanghai 201508, China; Department of Pathology, Jinshan Hospital, Fudan University, Shanghai 201508, China.
| | - Yujun Liu
- Department of Radiation Oncology, Fudan University Shanghai Cancer Center, Fudan University, Shanghai, China
| | - Miaochen Wang
- Department of Oral and Maxillofacial-Head & Neck Oncology, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine; College of Stomatology, Shanghai Jiao Tong University; National Center for Stomatology; National Clinical Research Center for Oral Diseases; Shanghai Key Laboratory of Stomatology, Shanghai, China
| | - Jiawei Zou
- Institute of Clinical Science, Zhongshan Hospital, Fudan University, Shanghai 200032, China
| | - Yi Shi
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Shanghai Jiao Tong University, 1954 Huashan Road, Shanghai 200030, China
| | - Xianbin Su
- Key Laboratory of Systems Biomedicine (Ministry of Education), Shanghai Center for Systems Biomedicine, Shanghai JiaoTong University, Shanghai, China
| | - Juan Xu
- Department of Stomatology, Sijing Hospital, Shanghai 201601, China
| | - Henry H Y Tong
- Centre for Artificial Intelligence Driven Drug Discovery, Faculty of Applied Sciences, Macao Polytechnic University, Macao SAR, China
| | - Yuan Ji
- Molecular Pathology Center, Department Pathology, Zhongshan Hospital, Fudan University, Shanghai, China
| | - Lv Gui
- Department of Pathology, Jinshan Hospital, Fudan University, Shanghai 201508, China.
| | - Jie Hao
- Institute of Clinical Science, Zhongshan Hospital, Fudan University, Shanghai 200032, China.
| |
Collapse
|
4
|
Cui T, Wang T. A comprehensive assessment of hurdle and zero-inflated models for single cell RNA-sequencing analysis. Brief Bioinform 2023; 24:bbad272. [PMID: 37507115 PMCID: PMC10516395 DOI: 10.1093/bib/bbad272] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Revised: 06/17/2023] [Accepted: 07/06/2023] [Indexed: 07/30/2023] Open
Abstract
Single cell RNA-sequencing (scRNA-seq) technology has significantly advanced the understanding of transcriptomic signatures. Although various statistical models have been used to describe the distribution of gene expression across cells, a comprehensive assessment of the different models is missing. Moreover, the growing number of features associated with scRNA-seq datasets creates new challenges for analytical accuracy and computing speed. Here, we developed a Python-based package (TensorZINB) to solve the zero-inflated negative binomial (ZINB) model using the TensorFlow deep learning framework. We used a sequential initialization method to solve the numerical stability issues associated with hurdle and zero-inflated models. A recursive feature selection protocol was used to optimize feature selections for data processing and downstream differentially expressed gene (DEG) analysis. We proposed a class of hybrid models combining nested models to further improve the model's performance. Additionally, we developed a new method to convert a continuous distribution to its equivalent discrete form, so that statistical models can be fairly compared. Finally, we showed that the proposed TensorFlow algorithm (TensorZINB) was numerically stable and that its computing speed and performance were superior to those of existing ZINB solvers. Moreover, we implemented seven hurdle and zero-inflated statistical models in Python and systematically assessed their performance using a real scRNA-seq dataset. We demonstrated that the ZINB model achieved the lowest Akaike information criterion compared with other models tested. Taken together, TensorZINB was accurate, efficient and scalable for the implementation of ZINB and for large-scale scRNA-seq data analysis with DEG identification.
Collapse
Affiliation(s)
- Tao Cui
- Department of Pharmacology and Physiology Georgetown University Medical Center SE407 Med/Dent 3900 Reservoir Road, N.W. Washington D.C., USA
| | - Tingting Wang
- Department of Pharmacology and Physiology Georgetown University Medical Center SE407 Med/Dent 3900 Reservoir Road, N.W. Washington D.C., USA
| |
Collapse
|
5
|
Erfanian N, Heydari AA, Feriz AM, Iañez P, Derakhshani A, Ghasemigol M, Farahpour M, Razavi SM, Nasseri S, Safarpour H, Sahebkar A. Deep learning applications in single-cell genomics and transcriptomics data analysis. Biomed Pharmacother 2023; 165:115077. [PMID: 37393865 DOI: 10.1016/j.biopha.2023.115077] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Revised: 06/22/2023] [Accepted: 06/23/2023] [Indexed: 07/04/2023] Open
Abstract
Traditional bulk sequencing methods are limited to measuring the average signal in a group of cells, potentially masking heterogeneity, and rare populations. The single-cell resolution, however, enhances our understanding of complex biological systems and diseases, such as cancer, the immune system, and chronic diseases. However, the single-cell technologies generate massive amounts of data that are often high-dimensional, sparse, and complex, thus making analysis with traditional computational approaches difficult and unfeasible. To tackle these challenges, many are turning to deep learning (DL) methods as potential alternatives to the conventional machine learning (ML) algorithms for single-cell studies. DL is a branch of ML capable of extracting high-level features from raw inputs in multiple stages. Compared to traditional ML, DL models have provided significant improvements across many domains and applications. In this work, we examine DL applications in genomics, transcriptomics, spatial transcriptomics, and multi-omics integration, and address whether DL techniques will prove to be advantageous or if the single-cell omics domain poses unique challenges. Through a systematic literature review, we have found that DL has not yet revolutionized the most pressing challenges of the single-cell omics field. However, using DL models for single-cell omics has shown promising results (in many cases outperforming the previous state-of-the-art models) in data preprocessing and downstream analysis. Although developments of DL algorithms for single-cell omics have generally been gradual, recent advances reveal that DL can offer valuable resources in fast-tracking and advancing research in single-cell.
Collapse
Affiliation(s)
- Nafiseh Erfanian
- Student Research Committee, Birjand University of Medical Sciences, Birjand, Iran
| | - A Ali Heydari
- Department of Applied Mathematics, University of California, Merced, CA, USA; Health Sciences Research Institute, University of California, Merced, CA, USA
| | - Adib Miraki Feriz
- Student Research Committee, Birjand University of Medical Sciences, Birjand, Iran
| | - Pablo Iañez
- Cellular Systems Genomics Group, Josep Carreras Research Institute, Barcelona, Spain
| | - Afshin Derakhshani
- Department of Biochemistry and Molecular Biology, University of Calgary, Calgary, AB, Canada
| | | | - Mohsen Farahpour
- Department of Electronics, Faculty of Electrical and Computer Engineering, University of Birjand, Birjand, Iran
| | - Seyyed Mohammad Razavi
- Department of Electronics, Faculty of Electrical and Computer Engineering, University of Birjand, Birjand, Iran
| | - Saeed Nasseri
- Cellular and Molecular Research Center, Birjand University of Medical Sciences, Birjand, Iran
| | - Hossein Safarpour
- Cellular and Molecular Research Center, Birjand University of Medical Sciences, Birjand, Iran.
| | - Amirhossein Sahebkar
- Biotechnology Research Center, Pharmaceutical Technology Institute, Mashhad University of Medical Sciences, Mashhad, Iran; Applied Biomedical Research Center, Mashhad University of Medical Sciences, Mashhad, Iran; Department of Biotechnology, School of Pharmacy, Mashhad University of Medical Sciences, Mashhad, Iran.
| |
Collapse
|
6
|
Zhang S, Li X, Lin J, Lin Q, Wong KC. Review of single-cell RNA-seq data clustering for cell-type identification and characterization. RNA (NEW YORK, N.Y.) 2023; 29:517-530. [PMID: 36737104 PMCID: PMC10158997 DOI: 10.1261/rna.078965.121] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/27/2022] [Accepted: 01/03/2023] [Indexed: 05/06/2023]
Abstract
In recent years, the advances in single-cell RNA-seq techniques have enabled us to perform large-scale transcriptomic profiling at single-cell resolution in a high-throughput manner. Unsupervised learning such as data clustering has become the central component to identify and characterize novel cell types and gene expression patterns. In this study, we review the existing single-cell RNA-seq data clustering methods with critical insights into the related advantages and limitations. In addition, we also review the upstream single-cell RNA-seq data processing techniques such as quality control, normalization, and dimension reduction. We conduct performance comparison experiments to evaluate several popular single-cell RNA-seq clustering approaches on simulated and multiple single-cell transcriptomic data sets.
Collapse
Affiliation(s)
- Shixiong Zhang
- School of Computer Science and Technology, Xidian University, Xi'an 710071, China
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China
| | - Xiangtao Li
- School of Artificial Intelligence, Jilin University, Jilin 130012, China
| | - Jiecong Lin
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China
| | - Qiuzhen Lin
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China
| |
Collapse
|
7
|
Multi-Objective Genetic Algorithm for Cluster Analysis of Single-Cell Transcriptomes. J Pers Med 2023; 13:jpm13020183. [PMID: 36836417 PMCID: PMC9960600 DOI: 10.3390/jpm13020183] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2022] [Revised: 01/15/2023] [Accepted: 01/16/2023] [Indexed: 01/22/2023] Open
Abstract
Cells are the basic building blocks of human organisms, and the identification of their types and states in transcriptomic data is an important and challenging task. Many of the existing approaches to cell-type prediction are based on clustering methods that optimize only one criterion. In this paper, a multi-objective Genetic Algorithm for cluster analysis is proposed, implemented, and systematically validated on 48 experimental and 60 synthetic datasets. The results demonstrate that the performance and the accuracy of the proposed algorithm are reproducible, stable, and better than those of single-objective clustering methods. Computational run times of multi-objective clustering of large datasets were studied and used in supervised machine learning to accurately predict the execution times of clustering of new single-cell transcriptomes.
Collapse
|
8
|
Xu L, Xue T, Ding W, Shen L. Comparison of scRNA-seq data analysis method combinations. Brief Funct Genomics 2022; 21:433-440. [PMID: 36124658 DOI: 10.1093/bfgp/elac027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2022] [Revised: 07/29/2022] [Accepted: 07/29/2022] [Indexed: 12/14/2022] Open
Abstract
Single-cell ribonucleic acid (RNA)-sequencing (scRNA-seq) data analysis refers to the use of appropriate methods to analyze the dataset generated by RNA-sequencing performed on the single-cell transcriptome. It usually contains three steps: normalization to eliminate the technical noise, dimensionality reduction to facilitate visual understanding and data compression and clustering to divide the data into several similarity-based clusters. In addition, the gene expression data contain a large number of zero counts. These zero counts are considered relevant to random dropout events induced by multiple factors in the sequencing experiments, such as low RNA input, and the stochastic nature of the gene expression pattern at the single-cell level. The zero counts can be eliminated only through the analysis of the scRNA-seq data, and although many methods have been proposed to this end, there is still a lack of research on the combined effect of existing methods. In this paper, we summarize the two kinds of normalization, two kinds of dimension reduction and three kinds of clustering methods widely used in the current mainstream scRNA-seq data analysis. Furthermore, we propose to combine these methods into 12 technology combinations, each with a whole set of scRNA-seq data analysis processes. We evaluated the proposed combinations using Goolam, a publicly available scRNA-seq, by comparing the final clustering results and found the most suitable collection scheme of these classic methods. Our results showed that using appropriate technology combinations can improve the efficiency and accuracy of the scRNA-seq data analysis. The combinations not only satisfy the basic requirements of noise reduction, dimension reduction and cell clustering but also ensure preserving the heterogeneity of cells in downstream analysis. The dataset, Goolam, used in the study can be obtained from the ArrayExpress database under the accession number E-MTAB-3321.
Collapse
|
9
|
Zeng Y, Wei Z, Zhong F, Pan Z, Lu Y, Yang Y. A parameter-free deep embedded clustering method for single-cell RNA-seq data. Brief Bioinform 2022; 23:6582003. [PMID: 35524494 DOI: 10.1093/bib/bbac172] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2021] [Revised: 03/25/2022] [Accepted: 04/18/2022] [Indexed: 11/12/2022] Open
Abstract
Clustering analysis is widely used in single-cell ribonucleic acid (RNA)-sequencing (scRNA-seq) data to discover cell heterogeneity and cell states. While many clustering methods have been developed for scRNA-seq analysis, most of these methods require to provide the number of clusters. However, it is not easy to know the exact number of cell types in advance, and experienced determination is not always reliable. Here, we have developed ADClust, an automatic deep embedding clustering method for scRNA-seq data, which can accurately cluster cells without requiring a predefined number of clusters. Specifically, ADClust first obtains low-dimensional representation through pre-trained autoencoder and uses the representations to cluster cells into initial micro-clusters. The clusters are then compared in between by a statistical test, and similar micro-clusters are merged into larger clusters. According to the clustering, cell representations are updated so that each cell will be pulled toward centers of its assigned cluster and similar clusters, while cells are separated to keep distances between clusters. This is accomplished through jointly optimizing the carefully designed clustering and autoencoder loss functions. This merging process continues until convergence. ADClust was tested on 11 real scRNA-seq datasets and was shown to outperform existing methods in terms of both clustering performance and the accuracy on the number of the determined clusters. More importantly, our model provides high speed and scalability for large datasets.
Collapse
Affiliation(s)
- Yuansong Zeng
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Zhuoyi Wei
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Fengqi Zhong
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Zixiang Pan
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Yutong Lu
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China
| | - Yuedong Yang
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510000, China.,Key Laboratory of Machine Intelligence and Advanced Computing (MOE), Guangzhou 510000, China
| |
Collapse
|
10
|
He S, Dou L, Li X, Zhang Y. Review of bioinformatics in Azheimer's Disease Research. Comput Biol Med 2022; 143:105269. [PMID: 35158118 DOI: 10.1016/j.compbiomed.2022.105269] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2022] [Revised: 01/21/2022] [Accepted: 01/23/2022] [Indexed: 01/05/2023]
Abstract
Alzheimer's disease (AD) is a severe neurodegenerative disease with slow course of onset and deterioration with time. With the speedup of global aging, AD has become a disease that seriously threatens the physical health of the elderly; therefore, the effective prevention and treatments of AD is an extremely important area of study for researchers and clinicians. Rapid technological developments have promoted the analysis of various kinds of complex data sets using machine learning methods. The common machine learning algorithms, such as Lasso, SVM and Random Forest, are very important in AD research. To help accelerate AD-related research, we review some recent research progress on Alzheimer's disease, including database, image analysis, gene expression, etc., which can provide AD researchers with more comprehensive research methods.
Collapse
Affiliation(s)
- Shida He
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China; Department of Computer Science, University of Tsukuba, Japan
| | - Lijun Dou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China; School of Automotive and Transportation Engineering, Shenzhen Polytechnic, Shenzhen, China
| | - Xuehong Li
- Beidahuang Industry Group General Hospital, Harbin, China.
| | - Ying Zhang
- Department of Anesthesiology, Hospital (T.C.M) Affiliated To Southwest Medical University, Luzhou, China.
| |
Collapse
|
11
|
Single Cell Self-Paced Clustering with Transcriptome Sequencing Data. Int J Mol Sci 2022; 23:ijms23073900. [PMID: 35409258 PMCID: PMC8999118 DOI: 10.3390/ijms23073900] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2022] [Revised: 03/28/2022] [Accepted: 03/29/2022] [Indexed: 11/17/2022] Open
Abstract
Single cell RNA sequencing (scRNA-seq) allows researchers to explore tissue heterogeneity, distinguish unusual cell identities, and find novel cellular subtypes by providing transcriptome profiling for individual cells. Clustering analysis is usually used to predict cell class assignments and infer cell identities. However, the performance of existing single-cell clustering methods is extremely sensitive to the presence of noise data and outliers. Existing clustering algorithms can easily fall into local optimal solutions. There is still no consensus on the best performing method. To address this issue, we introduce a single cell self-paced clustering (scSPaC) method with F-norm based nonnegative matrix factorization (NMF) for scRNA-seq data and a sparse single cell self-paced clustering (sscSPaC) method with l21-norm based nonnegative matrix factorization for scRNA-seq data. We gradually add single cells from simple to complex to our model until all cells are selected. In this way, the influences of noisy data and outliers can be significantly reduced. The proposed method achieved the best performance on both simulation data and real scRNA-seq data. A case study about human clara cells and ependymal cells scRNA-seq data clustering shows that scSPaC is more advantageous near the clustering dividing line.
Collapse
|
12
|
Simultaneous Learning the Dimension and Parameter of a Statistical Model with Big Data. STATISTICS IN BIOSCIENCES 2021. [DOI: 10.1007/s12561-021-09324-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
13
|
Su K, Yu T, Wu H. Accurate feature selection improves single-cell RNA-seq cell clustering. Brief Bioinform 2021; 22:6145899. [PMID: 33611426 DOI: 10.1093/bib/bbab034] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2020] [Revised: 01/06/2021] [Accepted: 01/22/2021] [Indexed: 02/04/2023] Open
Abstract
Cell clustering is one of the most important and commonly performed tasks in single-cell RNA sequencing (scRNA-seq) data analysis. An important step in cell clustering is to select a subset of genes (referred to as 'features'), whose expression patterns will then be used for downstream clustering. A good set of features should include the ones that distinguish different cell types, and the quality of such set could have a significant impact on the clustering accuracy. All existing scRNA-seq clustering tools include a feature selection step relying on some simple unsupervised feature selection methods, mostly based on the statistical moments of gene-wise expression distributions. In this work, we carefully evaluate the impact of feature selection on cell clustering accuracy. In addition, we develop a feature selection algorithm named FEAture SelecTion (FEAST), which provides more representative features. We apply the method on 12 public scRNA-seq datasets and demonstrate that using features selected by FEAST with existing clustering tools significantly improve the clustering accuracy.
Collapse
Affiliation(s)
- Kenong Su
- Department of Computer Science, Emory University
| | - Tianwei Yu
- School of Data Science, The Chinese University of Hong Kong, Shenzhen
| | - Hao Wu
- Department of Biostatistics and Bioinformatics, Emory University, 201 Dowman Dr, Atlanta, GA 30322, USA
| |
Collapse
|
14
|
Liu Z. Clustering Single-Cell RNA-Seq Data with Regularized Gaussian Graphical Model. Genes (Basel) 2021; 12:311. [PMID: 33671799 PMCID: PMC7927011 DOI: 10.3390/genes12020311] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2020] [Revised: 02/07/2021] [Accepted: 02/15/2021] [Indexed: 11/20/2022] Open
Abstract
Single-cell RNA-seq (scRNA-seq) is a powerful tool to measure the expression patterns of individual cells and discover heterogeneity and functional diversity among cell populations. Due to variability, it is challenging to analyze such data efficiently. Many clustering methods have been developed using at least one free parameter. Different choices for free parameters may lead to substantially different visualizations and clusters. Tuning free parameters is also time consuming. Thus there is need for a simple, robust, and efficient clustering method. In this paper, we propose a new regularized Gaussian graphical clustering (RGGC) method for scRNA-seq data. RGGC is based on high-order (partial) correlations and subspace learning, and is robust over a wide-range of a regularized parameter λ. Therefore, we can simply set λ=2 or λ=log(p) for AIC (Akaike information criterion) or BIC (Bayesian information criterion) without cross-validation. Cell subpopulations are discovered by the Louvain community detection algorithm that determines the number of clusters automatically. There is no free parameter to be tuned with RGGC. When evaluated with simulated and benchmark scRNA-seq data sets against widely used methods, RGGC is computationally efficient and one of the top performers. It can detect inter-sample cell heterogeneity, when applied to glioblastoma scRNA-seq data.
Collapse
Affiliation(s)
- Zhenqiu Liu
- Department of Public Health Sciences, Pennsylvania State University College of Medicine, 500 University Drive, Hershey, PA 17033, USA
| |
Collapse
|
15
|
Nayak R, Hasija Y. A hitchhiker's guide to single-cell transcriptomics and data analysis pipelines. Genomics 2021; 113:606-619. [PMID: 33485955 DOI: 10.1016/j.ygeno.2021.01.007] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2020] [Revised: 12/30/2020] [Accepted: 01/18/2021] [Indexed: 12/20/2022]
Abstract
Single-cell transcriptomics (SCT) is a tour de force in the era of big omics data that has led to the accumulation of massive cellular transcription data at an astounding resolution of single cells. It provides valuable insights into cells previously unachieved by bulk cell analysis and is proving crucial in uncovering cellular heterogeneity, identifying rare cell populations, distinct cell-lineage trajectories, and mechanisms involved in complex cellular processes. SCT data is highly complex and necessitates advanced statistical and computational methods for analysis. This review provides a comprehensive overview of the steps in a typical SCT workflow, starting from experimental protocol to data analysis, deliberating various pipelines used. We discuss recent trends, challenges, machine learning methods for data analysis, and future prospects. We conclude by listing the multitude of scRNA-seq data applications and how it shall revolutionize our understanding of cellular biology and diseases.
Collapse
Affiliation(s)
- Richa Nayak
- Department of Biotechnology, Delhi Technological University, Delhi 110042, India
| | - Yasha Hasija
- Department of Biotechnology, Delhi Technological University, Delhi 110042, India.
| |
Collapse
|
16
|
Abstract
Kidney fibrosis is the hallmark of chronic kidney disease progression; however, at present no antifibrotic therapies exist1-3. The origin, functional heterogeneity and regulation of scar-forming cells that occur during human kidney fibrosis remain poorly understood1,2,4. Here, using single-cell RNA sequencing, we profiled the transcriptomes of cells from the proximal and non-proximal tubules of healthy and fibrotic human kidneys to map the entire human kidney. This analysis enabled us to map all matrix-producing cells at high resolution, and to identify distinct subpopulations of pericytes and fibroblasts as the main cellular sources of scar-forming myofibroblasts during human kidney fibrosis. We used genetic fate-tracing, time-course single-cell RNA sequencing and ATAC-seq (assay for transposase-accessible chromatin using sequencing) experiments in mice, and spatial transcriptomics in human kidney fibrosis, to shed light on the cellular origins and differentiation of human kidney myofibroblasts and their precursors at high resolution. Finally, we used this strategy to detect potential therapeutic targets, and identified NKD2 as a myofibroblast-specific target in human kidney fibrosis.
Collapse
|
17
|
Xie K, Huang Y, Zeng F, Liu Z, Chen T. scAIDE: clustering of large-scale single-cell RNA-seq data reveals putative and rare cell types. NAR Genom Bioinform 2020; 2:lqaa082. [PMID: 33575628 PMCID: PMC7671411 DOI: 10.1093/nargab/lqaa082] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2020] [Revised: 08/20/2020] [Accepted: 09/18/2020] [Indexed: 02/07/2023] Open
Abstract
Recent advancements in both single-cell RNA-sequencing technology and computational resources facilitate the study of cell types on global populations. Up to millions of cells can now be sequenced in one experiment; thus, accurate and efficient computational methods are needed to provide clustering and post-analysis of assigning putative and rare cell types. Here, we present a novel unsupervised deep learning clustering framework that is robust and highly scalable. To overcome the high level of noise, scAIDE first incorporates an autoencoder-imputation network with a distance-preserved embedding network (AIDE) to learn a good representation of data, and then applies a random projection hashing based k-means algorithm to accommodate the detection of rare cell types. We analyzed a 1.3 million neural cell dataset within 30 min, obtaining 64 clusters which were mapped to 19 putative cell types. In particular, we further identified three different neural stem cell developmental trajectories in these clusters. We also classified two subpopulations of malignant cells in a small glioblastoma dataset using scAIDE. We anticipate that scAIDE would provide a more in-depth understanding of cell development and diseases.
Collapse
Affiliation(s)
- Kaikun Xie
- Institute for Artificial Intelligence, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
- Tsinghua-Fuzhou Institute of Digital Technology, Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China
| | - Yu Huang
- Institute for Artificial Intelligence, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
- Tsinghua-Fuzhou Institute of Digital Technology, Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China
| | - Feng Zeng
- Department of Automation, Xiamen University, Xiamen 361005, China
- National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen 361005, China
| | - Zehua Liu
- Center for Computational and Integrative Biology, Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114, USA
- Department of Molecular Biology, Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114, USA
- Broad Institute of Massachusetts Institute of Technology and Harvard, Cambridge, MA 02142, USA
| | - Ting Chen
- Institute for Artificial Intelligence, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
- Tsinghua-Fuzhou Institute of Digital Technology, Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China
| |
Collapse
|
18
|
Cao W, Lee H, Wu W, Zaman A, McCorkle S, Yan M, Chen J, Xing Q, Sinnott-Armstrong N, Xu H, Sailani MR, Tang W, Cui Y, Liu J, Guan H, Lv P, Sun X, Sun L, Han P, Lou Y, Chang J, Wang J, Gao Y, Guo J, Schenk G, Shain AH, Biddle FG, Collisson E, Snyder M, Bivona TG. Multi-faceted epigenetic dysregulation of gene expression promotes esophageal squamous cell carcinoma. Nat Commun 2020; 11:3675. [PMID: 32699215 PMCID: PMC7376194 DOI: 10.1038/s41467-020-17227-z] [Citation(s) in RCA: 59] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2019] [Accepted: 06/17/2020] [Indexed: 12/20/2022] Open
Abstract
Epigenetic landscapes can shape physiologic and disease phenotypes. We used integrative, high resolution multi-omics methods to delineate the methylome landscape and characterize the oncogenic drivers of esophageal squamous cell carcinoma (ESCC). We found 98% of CpGs are hypomethylated across the ESCC genome. Hypo-methylated regions are enriched in areas with heterochromatin binding markers (H3K9me3, H3K27me3), while hyper-methylated regions are enriched in polycomb repressive complex (EZH2/SUZ12) recognizing regions. Altered methylation in promoters, enhancers, and gene bodies, as well as in polycomb repressive complex occupancy and CTCF binding sites are associated with cancer-specific gene dysregulation. Epigenetic-mediated activation of non-canonical WNT/β-catenin/MMP signaling and a YY1/lncRNA ESCCAL-1/ribosomal protein network are uncovered and validated as potential novel ESCC driver alterations. This study advances our understanding of how epigenetic landscapes shape cancer pathogenesis and provides a resource for biomarker and target discovery.
Collapse
Grants
- U01 CA217882 NCI NIH HHS
- R01 CA239604 NCI NIH HHS
- K22 CA217997 NCI NIH HHS
- R01 CA227807 NCI NIH HHS
- U54 CA224081 NCI NIH HHS
- R01 CA211052 NCI NIH HHS
- S10 OD020141 NIH HHS
- U24 CA210974 NCI NIH HHS
- R01 CA222862 NCI NIH HHS
- R01 CA230263 NCI NIH HHS
- R01 CA169338 NCI NIH HHS
- R01 CA204302 NCI NIH HHS
- R01 CA178015 NCI NIH HHS
- the National Natural Science Foundation of China (Grants 81171992, 31570899), the Natural Science Foundation of Henan (Grants 182102310328, 162300410279, 182300410374, 192102310096); the Education Department of Henan Province(18B310022,19A320037).
- National Natural Science Foundation of China (National Science Foundation of China)
- the Natural Science Foundation of Henan (Grants 182102310328, 162300410279, 182300410374, 192102310096); the Education Department of Henan Province(18B310022,19A320037). This work used the Genome Sequencing Service Center by Stanford Center for Genomics and Personalized Medicine Sequencing Center, supported by the grant award NIH S10OD020141. E.A.C acknowledge funding support from NCI Grants R01 [CA178015, CA222862, CA227807, CA239604, CA230263] and U24 [CA210974]. T.G.B acknowledges funding support from NIH / NCI U01CA217882, NIH / NCI U54CA224081, NIH / NCI R01CA204302, NIH / NCI R01CA211052, NIH / NCI R01CA169338, and the Pew-Stewart Foundations.
Collapse
Affiliation(s)
- Wei Cao
- Translational Medical Center, Zhengzhou Central Hospital Affiliated Zhengzhou University, Zhengzhou, China.
| | - Hayan Lee
- Department of Genetics, School of Medicine, Stanford University, CA, USA
| | - Wei Wu
- Department of Medicine, University of California San Francisco, San Francisco, CA, USA.
- Helen Diller Family Comprehensive Cancer Center, University of California San Francisco, San Francisco, CA, USA.
| | - Aubhishek Zaman
- Department of Medicine, University of California San Francisco, San Francisco, CA, USA
- Helen Diller Family Comprehensive Cancer Center, University of California San Francisco, San Francisco, CA, USA
| | - Sean McCorkle
- Computational Science Initiative, Brookhaven National Laboratory, Upton, NY, USA
| | - Ming Yan
- Basic Medical College, Zhengzhou University, Zhengzhou, China
| | - Justin Chen
- Department of Genetics, School of Medicine, Stanford University, CA, USA
| | - Qinghe Xing
- Institutes of Biomedical Sciences and Children's Hospital, Fudan University, Shanghai, China
| | | | - Hongen Xu
- Precision Medicine Center, The Academy of Medical Sciences, Zhengzhou University, Zhengzhou, China
| | - M Reza Sailani
- Department of Genetics, School of Medicine, Stanford University, CA, USA
| | - Wenxue Tang
- Precision Medicine Center, The Academy of Medical Sciences, Zhengzhou University, Zhengzhou, China
| | - Yuanbo Cui
- Translational Medical Center, Zhengzhou Central Hospital Affiliated Zhengzhou University, Zhengzhou, China
| | - Jia Liu
- Translational Medical Center, Zhengzhou Central Hospital Affiliated Zhengzhou University, Zhengzhou, China
| | - Hongyan Guan
- Translational Medical Center, Zhengzhou Central Hospital Affiliated Zhengzhou University, Zhengzhou, China
| | - Pengju Lv
- Translational Medical Center, Zhengzhou Central Hospital Affiliated Zhengzhou University, Zhengzhou, China
| | - Xiaoyan Sun
- Translational Medical Center, Zhengzhou Central Hospital Affiliated Zhengzhou University, Zhengzhou, China
| | - Lei Sun
- Translational Medical Center, Zhengzhou Central Hospital Affiliated Zhengzhou University, Zhengzhou, China
| | - Pengli Han
- Translational Medical Center, Zhengzhou Central Hospital Affiliated Zhengzhou University, Zhengzhou, China
| | - Yanan Lou
- Translational Medical Center, Zhengzhou Central Hospital Affiliated Zhengzhou University, Zhengzhou, China
| | - Jing Chang
- Jiangsu Mai Jian Biotechnology Development Company, Wuxi, China
| | - Jinwu Wang
- Department of Pathology, Linzhou Cancer Hospital, Linzhou, China
| | - Yuchi Gao
- Annoroad Gene Company, Beijing, China
| | - Jiancheng Guo
- Precision Medicine Center, The Academy of Medical Sciences, Zhengzhou University, Zhengzhou, China
| | - Gundolf Schenk
- Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, CA, USA
| | - Alan Hunter Shain
- Department of Dermatology, University of California San Francisco, San Francisco, CA, USA
| | - Fred G Biddle
- Department of Biological Sciences, University of Calgary, Calgary, Canada
| | - Eric Collisson
- Department of Medicine, University of California San Francisco, San Francisco, CA, USA
- Helen Diller Family Comprehensive Cancer Center, University of California San Francisco, San Francisco, CA, USA
| | - Michael Snyder
- Department of Genetics, School of Medicine, Stanford University, CA, USA.
| | - Trever G Bivona
- Department of Medicine, University of California San Francisco, San Francisco, CA, USA.
- Helen Diller Family Comprehensive Cancer Center, University of California San Francisco, San Francisco, CA, USA.
| |
Collapse
|
19
|
Dimension Reduction and Clustering Models for Single-Cell RNA Sequencing Data: A Comparative Study. Int J Mol Sci 2020; 21:ijms21062181. [PMID: 32235704 PMCID: PMC7139673 DOI: 10.3390/ijms21062181] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2020] [Revised: 03/09/2020] [Accepted: 03/20/2020] [Indexed: 12/30/2022] Open
Abstract
With recent advances in single-cell RNA sequencing, enormous transcriptome datasets have been generated. These datasets have furthered our understanding of cellular heterogeneity and its underlying mechanisms in homogeneous populations. Single-cell RNA sequencing (scRNA-seq) data clustering can group cells belonging to the same cell type based on patterns embedded in gene expression. However, scRNA-seq data are high-dimensional, noisy, and sparse, owing to the limitation of existing scRNA-seq technologies. Traditional clustering methods are not effective and efficient for high-dimensional and sparse matrix computations. Therefore, several dimension reduction methods have been introduced. To validate a reliable and standard research routine, we conducted a comprehensive review and evaluation of four classical dimension reduction methods and five clustering models. Four experiments were progressively performed on two large scRNA-seq datasets using 20 models. Results showed that the feature selection method contributed positively to high-dimensional and sparse scRNA-seq data. Moreover, feature-extraction methods were able to promote clustering performance, although this was not eternally immutable. Independent component analysis (ICA) performed well in those small compressed feature spaces, whereas principal component analysis was steadier than all the other feature-extraction methods. In addition, ICA was not ideal for fuzzy C-means clustering in scRNA-seq data analysis. K-means clustering was combined with feature-extraction methods to achieve good results.
Collapse
|
20
|
Mieth B, Hockley JRF, Görnitz N, Vidovic MMC, Müller KR, Gutteridge A, Ziemek D. Using transfer learning from prior reference knowledge to improve the clustering of single-cell RNA-Seq data. Sci Rep 2019; 9:20353. [PMID: 31889137 PMCID: PMC6937257 DOI: 10.1038/s41598-019-56911-z] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2019] [Accepted: 12/13/2019] [Indexed: 01/21/2023] Open
Abstract
In many research areas scientists are interested in clustering objects within small datasets while making use of prior knowledge from large reference datasets. We propose a method to apply the machine learning concept of transfer learning to unsupervised clustering problems and show its effectiveness in the field of single-cell RNA sequencing (scRNA-Seq). The goal of scRNA-Seq experiments is often the definition and cataloguing of cell types from the transcriptional output of individual cells. To improve the clustering of small disease- or tissue-specific datasets, for which the identification of rare cell types is often problematic, we propose a transfer learning method to utilize large and well-annotated reference datasets, such as those produced by the Human Cell Atlas. Our approach modifies the dataset of interest while incorporating key information from the larger reference dataset via Non-negative Matrix Factorization (NMF). The modified dataset is subsequently provided to a clustering algorithm. We empirically evaluate the benefits of our approach on simulated scRNA-Seq data as well as on publicly available datasets. Finally, we present results for the analysis of a recently published small dataset and find improved clustering when transferring knowledge from a large reference dataset. Implementations of the method are available at https://github.com/nicococo/scRNA.
Collapse
Affiliation(s)
- Bettina Mieth
- Machine Learning Group, Technische Universität Berlin, Berlin, 10587, Germany
| | - James R F Hockley
- Department of Pharmacology, University of Cambridge, Cambridge, CB2 1PD, United Kingdom
- GlaxoSmithKline, Stevenage, SG1 2NY, United Kingdom
| | - Nico Görnitz
- Machine Learning Group, Technische Universität Berlin, Berlin, 10587, Germany
| | - Marina M-C Vidovic
- Machine Learning Group, Technische Universität Berlin, Berlin, 10587, Germany
| | - Klaus-Robert Müller
- Machine Learning Group, Technische Universität Berlin, Berlin, 10587, Germany.
- Department of Brain and Cognitive Engineering, Korea University, Seoul, 02841, Republic of Korea.
- Max Planck Institute for Informatics, Saarbrücken, 66123, Germany.
| | | | - Daniel Ziemek
- Pfizer, Worldwide Research and Development, Berlin, 10785, Germany.
| |
Collapse
|
21
|
Petegrosso R, Li Z, Kuang R. Machine learning and statistical methods for clustering single-cell RNA-sequencing data. Brief Bioinform 2019; 21:1209-1223. [DOI: 10.1093/bib/bbz063] [Citation(s) in RCA: 65] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2019] [Revised: 04/04/2019] [Accepted: 04/29/2019] [Indexed: 01/08/2023] Open
Abstract
Abstract
Single-cell RNAsequencing (scRNA-seq) technologies have enabled the large-scale whole-transcriptome profiling of each individual single cell in a cell population. A core analysis of the scRNA-seq transcriptome profiles is to cluster the single cells to reveal cell subtypes and infer cell lineages based on the relations among the cells. This article reviews the machine learning and statistical methods for clustering scRNA-seq transcriptomes developed in the past few years. The review focuses on how conventional clustering techniques such as hierarchical clustering, graph-based clustering, mixture models, $k$-means, ensemble learning, neural networks and density-based clustering are modified or customized to tackle the unique challenges in scRNA-seq data analysis, such as the dropout of low-expression genes, low and uneven read coverage of transcripts, highly variable total mRNAs from single cells and ambiguous cell markers in the presence of technical biases and irrelevant confounding biological variations. We review how cell-specific normalization, the imputation of dropouts and dimension reduction methods can be applied with new statistical or optimization strategies to improve the clustering of single cells. We will also introduce those more advanced approaches to cluster scRNA-seq transcriptomes in time series data and multiple cell populations and to detect rare cell types. Several software packages developed to support the cluster analysis of scRNA-seq data are also reviewed and experimentally compared to evaluate their performance and efficiency. Finally, we conclude with useful observations and possible future directions in scRNA-seq data analytics.
Availability
All the source code and data are available at https://github.com/kuanglab/single-cell-review.
Collapse
Affiliation(s)
| | - Zhuliu Li
- CREST (Ensai, Université Bretagne Loire), Bruz, France
| | - Rui Kuang
- CREST (Ensai, Université Bretagne Loire), Bruz, France
| |
Collapse
|
22
|
Li X, Zhang S, Wong KC. Single-cell RNA-seq interpretations using evolutionary multiobjective ensemble pruning. Bioinformatics 2018; 35:2809-2817. [DOI: 10.1093/bioinformatics/bty1056] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2018] [Revised: 10/31/2018] [Accepted: 12/21/2018] [Indexed: 11/14/2022] Open
Abstract
Abstract
Motivation
In recent years, single-cell RNA sequencing enables us to discover cell types or even subtypes. Its increasing availability provides opportunities to identify cell populations from single-cell RNA-seq data. Computational methods have been employed to reveal the gene expression variations among multiple cell populations. Unfortunately, the existing ones can suffer from realistic restrictions such as experimental noises, numerical instability, high dimensionality and computational scalability.
Results
We propose an evolutionary multiobjective ensemble pruning algorithm (EMEP) that addresses those realistic restrictions. Our EMEP algorithm first applies the unsupervised dimensionality reduction to project data from the original high dimensions to low-dimensional subspaces; basic clustering algorithms are applied in those new subspaces to generate different clustering results to form cluster ensembles. However, most of those cluster ensembles are unnecessarily bulky with the expense of extra time costs and memory consumption. To overcome that problem, EMEP is designed to dynamically select the suitable clustering results from the ensembles. Moreover, to guide the multiobjective ensemble evolution, three cluster validity indices including the overall cluster deviation, the within-cluster compactness and the number of basic partition clusters are formulated as the objective functions to unleash its cell type discovery performance using evolutionary multiobjective optimization. We applied EMEP to 55 simulated datasets and seven real single-cell RNA-seq datasets, including six single-cell RNA-seq dataset and one large-scale dataset with 3005 cells and 4412 genes. Two case studies are also conducted to reveal mechanistic insights into the biological relevance of EMEP. We found that EMEP can achieve superior performance over the other clustering algorithms, demonstrating that EMEP can identify cell populations clearly.
Availability and implementation
EMEP is written in Matlab and available at https://github.com/lixt314/EMEP
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xiangtao Li
- School of Computer Science and Information Technology, Northeast Normal University, Changchun, Jilin, China
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR
| | - Shixiong Zhang
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR
| |
Collapse
|
23
|
The International Conference on Intelligent Biology and Medicine (ICIBM) 2016: summary and innovation in genomics. BMC Genomics 2017; 18:703. [PMID: 28984207 PMCID: PMC5629612 DOI: 10.1186/s12864-017-4018-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
In this editorial, we first summarize the 2016 International Conference on Intelligent Biology and Medicine (ICIBM 2016) that was held on December 8–10, 2016 in Houston, Texas, USA, and then briefly introduce the ten research articles included in this supplement issue. ICIBM 2016 included four workshops or tutorials, four keynote lectures, four conference invited talks, eight concurrent scientific sessions and a poster session for 53 accepted abstracts, covering current topics in bioinformatics, systems biology, intelligent computing, and biomedical informatics. Through our call for papers, a total of 77 original manuscripts were submitted to ICIBM 2016. After peer review, 11 articles were selected in this special issue, covering topics such as single cell RNA-seq analysis method, genome sequence and variation analysis, bioinformatics method for vaccine development, and cancer genomics.
Collapse
|