1
|
Traversa D, Chiara M. Mapping Cell Identity from scRNA-seq: A primer on computational methods. Comput Struct Biotechnol J 2025; 27:1559-1569. [PMID: 40270709 PMCID: PMC12017876 DOI: 10.1016/j.csbj.2025.03.051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2024] [Revised: 03/29/2025] [Accepted: 03/31/2025] [Indexed: 04/25/2025] Open
Abstract
Single cell (sc) technologies mark a conceptual and methodological breakthrough in our way to study cells, the base units of life. Thanks to these technological developments, large-scale initiatives are currently ongoing aimed at mapping of all the cell types in the human body, with the ambitious aim to gain a cell-level resolution of physiological development and disease. Since its broad applicability and ease of interpretation scRNA-seq is probably the most common sc-based application. This assay uses high throughput RNA sequencing to capture gene expression profiles at the sc-level. Subsequently, under the assumption that differences in transcriptional programs correspond to distinct cellular identities, ad-hoc computational methods are used to infer cell types from gene expression patterns. A wide array of computational methods were developed for this task. However, depending on the underlying algorithmic approach and associated computational requirements, each method might have a specific range of application, with implications that are not always clear to the end user. Here we will provide a concise overview on state-of-the-art computational methods for cell identity annotation in scRNA-seq, tailored for new users and non-computational scientists. To this end, we classify existing tools in five main categories, and discuss their key strengths, limitations and range of application.
Collapse
Affiliation(s)
- Daniele Traversa
- Department of Biosciences, Università degli Studi di Milano, via Celoria 26, Milan 20133, Italy
| | - Matteo Chiara
- Department of Biosciences, Università degli Studi di Milano, via Celoria 26, Milan 20133, Italy
| |
Collapse
|
2
|
Sujana STA, Shahjaman M, Singha AC. Application of bioinformatic tools in cell type classification for single-cell RNA-seq data. Comput Biol Chem 2025; 115:108332. [PMID: 39793515 DOI: 10.1016/j.compbiolchem.2024.108332] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2024] [Revised: 12/06/2024] [Accepted: 12/24/2024] [Indexed: 01/13/2025]
Abstract
The advancements in single-cell RNA sequencing (scRNAseq) technology have significantly transformed genomics research, enabling the handling of thousands of cells in each experiment. As of now, 32,068 research studies have been cataloged in the Pubmed database. The primary aim of scRNAseq investigations is to identify cell types, understand the antitumor immune response, and identify new and uncommon cell types. Traditional techniques for identifying cell types include microscopy, histology, and pathological characteristics. However, the complexity of instruments and the need for precise experimental design make it difficult to fully capture the overall heterogeneity. Unsupervised clustering and supervised classification methods have been used to solve this task. Supervised cell type classification methods have gained popularity as large-scale, high-quality, well-annotated and more robust results compared to clustering methods. A recent study showed that support vector machine (SVM) gives a high-quality classification performance in different scenarios. In this article, we compare and evaluate the performance of four different kernels (sigmoid, linear, radial, polynomial) of SVM. The results of the experiments on three standard scRNA-seq datasets indicate that SVM with linear and SVM with sigmoid kernel classify the cells more accurately (approx. 99 %) where SVM linear kernel method has remarkably fast computation time and we also evaluate the results using some single cell specific evaluation matrices F-1 score, MCC, AUC value. Additionally, it sheds light on the potential use of kernels of SVM to give underlying information of single-cell RNA-Seq data more effectively.
Collapse
Affiliation(s)
- Shah Tania Akter Sujana
- Bioinformatics Lab, Department of Statistics, Begum Rokeya University, Rangpur 5404, Bangladesh.
| | - Md Shahjaman
- Bioinformatics Lab, Department of Statistics, Begum Rokeya University, Rangpur 5404, Bangladesh.
| | - Atul Chandra Singha
- Bioinformatics Lab, Department of Statistics, Begum Rokeya University, Rangpur 5404, Bangladesh.
| |
Collapse
|
3
|
Zou Z, Liu Y, Bai Y, Luo J, Zhang Z. scTrans: Sparse attention powers fast and accurate cell type annotation in single-cell RNA-seq data. PLoS Comput Biol 2025; 21:e1012904. [PMID: 40184563 PMCID: PMC11970913 DOI: 10.1371/journal.pcbi.1012904] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2024] [Accepted: 02/24/2025] [Indexed: 04/06/2025] Open
Abstract
Cell type annotation is crucial in single-cell RNA sequencing data analysis because it enables significant biological discoveries and deepens our understanding of tissue biology. Given the high-dimensional and highly sparse nature of single-cell RNA sequencing data, most existing annotation tools focus on highly variable genes to reduce dimensionality and computational load. However, this approach inevitably results in information loss, potentially weakening the model's generalization performance and adaptability to novel datasets. To mitigate this issue, we developed scTrans, a single cell Transformer-based model, which employs sparse attention to utilize all non-zero genes, thereby effectively reducing the input data dimensionality while minimizing information loss. We validated the speed and accuracy of scTrans by performing cell type annotation on 31 different tissues within the Mouse Cell Atlas. Remarkably, even with datasets nearing a million cells, scTrans efficiently perform cell type annotation in limited computational resources. Furthermore, scTrans demonstrates strong generalization capabilities, accurately annotating cells in novel datasets and generating high-quality latent representations, which are essential for precise clustering and trajectory analysis.
Collapse
Affiliation(s)
- Zhiyi Zou
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan, China
| | - Ying Liu
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan, China
| | - Yuting Bai
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan, China
| | - Jiawei Luo
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan, China
| | - Zhaolei Zhang
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
4
|
Fievet G, Broséus J, Meyre D, Hergalant S. adverSCarial: assessing the vulnerability of single-cell RNA-sequencing classifiers to adversarial attacks. Bioinformatics 2025; 41:btaf168. [PMID: 40234247 PMCID: PMC12036967 DOI: 10.1093/bioinformatics/btaf168] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2024] [Revised: 04/01/2025] [Accepted: 04/13/2025] [Indexed: 04/17/2025] Open
Abstract
MOTIVATION Several machine learning (ML) algorithms dedicated to the detection of healthy and diseased cell types from single-cell RNA sequencing (scRNA-seq) data have been proposed for biomedical purposes. This raises concerns about their vulnerability to adversarial attacks, exploiting threats causing malicious alterations of the classifiers' output with defective and well-crafted input. RESULTS With adverSCarial, adversarial attacks of single-cell transcriptomic data can easily be simulated in a range of ways, from expanded but undetectable modifications to aggressive and targeted ones, enabling vulnerability assessment of scRNA-seq classifiers to variations of gene expression, whether technical, biological, or intentional. We exemplify the usefulness and performance with a panel of attack modes proposed in adverSCarial by assessing the robustness of five scRNA-seq classifiers, each belonging to a distinct class of ML algorithm, and explore the potential unlocked by exposing their inner workings and sensitivities on four different datasets. These analyses can guide the development of more reliable models, with improved interpretability, usable in biomedical research and future clinical applications. AVAILABILITY AND IMPLEMENTATION adverSCarial is a freely available R package accessible from Bioconductor: https://bioconductor.org/packages/adverSCarial/ or https://doi.org/10.18129/B9.bioc.adverSCarial. A development version is available at https://github.com/GhislainFievet/adverSCarial.
Collapse
Affiliation(s)
- Ghislain Fievet
- INSERM U1256, Nutrition, Genetics, and Environmental Risk Exposure (NGERE), University of Lorraine, Nancy, 54500, France
| | - Julien Broséus
- INSERM U1256, Nutrition, Genetics, and Environmental Risk Exposure (NGERE), University of Lorraine, Nancy, 54500, France
- Department of Biological Hematology, Laboratory Center, University Hospital of Nancy, Nancy, 54500, France
| | - David Meyre
- INSERM U1256, Nutrition, Genetics, and Environmental Risk Exposure (NGERE), University of Lorraine, Nancy, 54500, France
- Department of Molecular Medicine, Division of Biochemistry, Molecular Biology, and Nutrition, University Hospital of Nancy, Nancy, 54500, France
| | - Sébastien Hergalant
- INSERM U1256, Nutrition, Genetics, and Environmental Risk Exposure (NGERE), University of Lorraine, Nancy, 54500, France
| |
Collapse
|
5
|
Ito K, Hirakawa T, Shigenobu S, Fujiyoshi H, Yamashita T. Mouse-Geneformer: A deep learning model for mouse single-cell transcriptome and its cross-species utility. PLoS Genet 2025; 21:e1011420. [PMID: 40106407 PMCID: PMC11964219 DOI: 10.1371/journal.pgen.1011420] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2024] [Revised: 04/02/2025] [Accepted: 02/17/2025] [Indexed: 03/22/2025] Open
Abstract
Deep learning techniques are increasingly utilized to analyze large-scale single-cell RNA sequencing (scRNA-seq) data, offering valuable insights from complex transcriptome datasets. Geneformer, a pre-trained model using a Transformer Encoder architecture and human scRNA-seq datasets, has demonstrated remarkable success in human transcriptome analysis. However, given the prominence of the mouse, Mus musculus, as a primary mammalian model in biological and medical research, there is an acute need for a mouse-specific version of Geneformer. In this study, we developed a mouse-specific Geneformer (mouse-Geneformer) by constructing a large transcriptome dataset consisting of 21 million mouse scRNA-seq profiles and pre-training Geneformer on this dataset. The mouse-Geneformer effectively models the mouse transcriptome and, upon fine-tuning for downstream tasks, enhances the accuracy of cell type classification. In silico perturbation experiments using mouse-Geneformer successfully identified disease-causing genes that have been validated in in vivo experiments. These results demonstrate the feasibility of analyzing mouse data with mouse-Geneformer and highlight the robustness of the Geneformer architecture, applicable to any species with large-scale transcriptome data available. Furthermore, we found that mouse-Geneformer can analyze human transcriptome data in a cross-species manner. After the ortholog-based gene name conversion, the analysis of human scRNA-seq data using mouse-Geneformer, followed by fine-tuning with human data, achieved cell type classification accuracy comparable to that obtained using the original human Geneformer. In in silico simulation experiments using human disease models, we obtained results similar to human-Geneformer for the myocardial infarction model but only partially consistent results for the COVID-19 model, a trait unique to humans (laboratory mice are not susceptible to SARS-CoV-2). These findings suggest the potential for cross-species application of the Geneformer model while emphasizing the importance of species-specific models for capturing the full complexity of disease mechanisms. Despite the existence of the original Geneformer tailored for humans, human research could benefit from mouse-Geneformer due to its inclusion of samples that are ethically or technically inaccessible for humans, such as embryonic tissues and certain disease models. Additionally, this cross-species approach indicates potential use for non-model organisms, where obtaining large-scale single-cell transcriptome data is challenging.
Collapse
Affiliation(s)
- Keita Ito
- Graduate School of Engineering, Chubu University, Kasugai, Aichi, Japan
| | - Tsubasa Hirakawa
- Department of Artificial Intelligence and Robotics, Center for Mathematical Science and Artificial Intelligence, Chubu University, Kasugai, Aichi, Japan
| | - Shuji Shigenobu
- Trans-Scale Biology Center, National Institute for Basic Biology, Okazaki, Aichi, Japan
- Life Science Center for Survival Dynamics, Tsukuba Advanced Research Alliance (TARA), University of Tsukuba, Tsukuba, Ibaraki, Japan
| | | | - Takayoshi Yamashita
- Department of Artificial Intelligence and Robotics, Chubu University, Kasugai, Aichi, Japan
| |
Collapse
|
6
|
Ceccarelli F, Liò P, Holden S. AnnoGCD: a generalized category discovery framework for automatic cell type annotation. NAR Genom Bioinform 2024; 6:lqae166. [PMID: 39660254 PMCID: PMC11629990 DOI: 10.1093/nargab/lqae166] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2024] [Revised: 10/15/2024] [Accepted: 11/11/2024] [Indexed: 12/12/2024] Open
Abstract
The identification of cell types in single-cell RNA sequencing (scRNA-seq) data is a critical task in understanding complex biological systems. Traditional supervised machine learning methods rely on large, well-labeled datasets, which are often impractical to obtain in open-world scenarios due to budget constraints and incomplete information. To address these challenges, we propose a novel computational framework, named AnnoGCD, building on Generalized Category Discovery (GCD) and Anomaly Detection (AD) for automatic cell type annotation. Our semi-supervised method combines labeled and unlabeled data to accurately classify known cell types and to discover novel ones, even in imbalanced datasets. AnnoGCD includes a semi-supervised block to first classify known cell types, followed by an unsupervised block aimed at identifying and clustering novel cell types. We evaluated our approach on five human scRNA-seq datasets and a mouse model atlas, demonstrating superior performance in both known and novel cell type identification compared to existing methods. Our model also exhibited robustness in datasets with significant class imbalance. The results suggest that AnnoGCD is a powerful tool for the automatic annotation of cell types in scRNA-seq data, providing a scalable solution for biological research and clinical applications. Our code and the datasets used for evaluations are publicly available on GitHub: https://github.com/cecca46/AnnoGCD/.
Collapse
Affiliation(s)
- Francesco Ceccarelli
- Department of Computer Science and Technology, University of Cambridge, 15 JJ Thomson Ave, CB3 0FD, Cambridge, UK
| | - Pietro Liò
- Department of Computer Science and Technology, University of Cambridge, 15 JJ Thomson Ave, CB3 0FD, Cambridge, UK
| | - Sean B Holden
- Department of Computer Science and Technology, University of Cambridge, 15 JJ Thomson Ave, CB3 0FD, Cambridge, UK
| |
Collapse
|
7
|
Hong R, Tong Y, Tang H, Zeng T, Liu R. eMCI: An Explainable Multimodal Correlation Integration Model for Unveiling Spatial Transcriptomics and Intercellular Signaling. RESEARCH (WASHINGTON, D.C.) 2024; 7:0522. [PMID: 39494219 PMCID: PMC11528068 DOI: 10.34133/research.0522] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/22/2024] [Revised: 09/23/2024] [Accepted: 10/14/2024] [Indexed: 11/05/2024]
Abstract
Current integration methods for single-cell RNA sequencing (scRNA-seq) data and spatial transcriptomics (ST) data are typically designed for specific tasks, such as deconvolution of cell types or spatial distribution prediction of RNA transcripts. These methods usually only offer a partial analysis of ST data, neglecting the complex relationship between spatial expression patterns underlying cell-type specificity and intercellular cross-talk. Here, we present eMCI, an explainable multimodal correlation integration model based on deep neural network framework. eMCI leverages the fusion of scRNA-seq and ST data using different spot-cell correlations to integrate multiple synthetic analysis tasks of ST data at cellular level. First, eMCI can achieve better or comparable accuracy in cell-type classification and deconvolution according to wide evaluations and comparisons with state-of-the-art methods on both simulated and real ST datasets. Second, eMCI can identify key components across spatial domains responsible for different cell types and elucidate the spatial expression patterns underlying cell-type specificity and intercellular communication, by employing an attribution algorithm to dissect the visual input. Especially, eMCI has been applied to 3 cross-species datasets, including zebrafish melanomas, soybean nodule maturation, and human embryonic lung, which accurately and efficiently estimate per-spot cell composition and infer proximal and distal cellular interactions within the spatial and temporal context. In summary, eMCI serves as an integrative analytical framework to better resolve the spatial transcriptome based on existing single-cell datasets and elucidate proximal and distal intercellular signal transduction mechanisms over spatial domains without requirement of biological prior reference. This approach is expected to facilitate the discovery of spatial expression patterns of potential biomolecules with cell type and cell-cell communication specificity.
Collapse
Affiliation(s)
- Renhao Hong
- School of Mathematics,
South China University of Technology, Guangzhou 510640, China
| | - Yuyan Tong
- School of Mathematics,
South China University of Technology, Guangzhou 510640, China
| | - Hui Tang
- School of Mathematics and Big Data,
Foshan University, Foshan 528000, China
| | - Tao Zeng
- Guangzhou National Laboratory, Guangzhou, China
- GMU-GIBH Joint School of Life Sciences, The Guangdong-Hong Kong-Macau Joint Laboratory for Cell Fate Regulation and Diseases, Guangzhou Laboratory, Guangzhou Medical University, Guangzhou, China
| | - Rui Liu
- School of Mathematics,
South China University of Technology, Guangzhou 510640, China
| |
Collapse
|
8
|
Qi Q, Wang Y, Huang Y, Fan Y, Li X. PredGCN: a Pruning-enabled Gene-Cell Net for automatic cell annotation of single cell transcriptome data. Bioinformatics 2024; 40:btae421. [PMID: 38924517 PMCID: PMC11236098 DOI: 10.1093/bioinformatics/btae421] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2024] [Revised: 05/27/2024] [Accepted: 06/25/2024] [Indexed: 06/28/2024] Open
Abstract
MOTIVATION The annotation of cell types from single-cell transcriptomics is essential for understanding the biological identity and functionality of cellular populations. Although manual annotation remains the gold standard, the advent of automatic pipelines has become crucial for scalable, unbiased, and cost-effective annotations. Nonetheless, the effectiveness of these automatic methods, particularly those employing deep learning, significantly depends on the architecture of the classifier and the quality and diversity of the training datasets. RESULTS To address these limitations, we present a Pruning-enabled Gene-Cell Net (PredGCN) incorporating a Coupled Gene-Cell Net (CGCN) to enable representation learning and information storage. PredGCN integrates a Gene Splicing Net (GSN) and a Cell Stratification Net (CSN), employing a pruning operation (PrO) to dynamically tackle the complexity of heterogeneous cell identification. Among them, GSN leverages multiple statistical and hypothesis-driven feature extraction methods to selectively assemble genes with specificity for scRNA-seq data while CSN unifies elements based on diverse region demarcation principles, exploiting the representations from GSN and precise identification from different regional homogeneity perspectives. Furthermore, we develop a multi-objective Pareto pruning operation (Pareto PrO) to expand the dynamic capabilities of CGCN, optimizing the sub-network structure for accurate cell type annotation. Multiple comparison experiments on real scRNA-seq datasets from various species have demonstrated that PredGCN surpasses existing state-of-the-art methods, including its scalability to cross-species datasets. Moreover, PredGCN can uncover unknown cell types and provide functional genomic analysis by quantifying the influence of genes on cell clusters, bringing new insights into cell type identification and characterizing scRNA-seq data from different perspectives. AVAILABILITY AND IMPLEMENTATION The source code is available at https://github.com/IrisQi7/PredGCN and test data is available at https://figshare.com/articles/dataset/PredGCN/25251163.
Collapse
Affiliation(s)
- Qi Qi
- School of Artificial Intelligence, Jilin University, Changchun 130012, China
| | - Yunhe Wang
- School of Artificial Intelligence, Hebei University of Technology, Tianjin 300401, China
| | - Yujian Huang
- College of Computer Science and Cyber Security, Chengdu University of Technology, Chengdu 610059, China
| | - Yi Fan
- School of Artificial Intelligence, Jilin University, Changchun 130012, China
| | - Xiangtao Li
- School of Artificial Intelligence, Jilin University, Changchun 130012, China
| |
Collapse
|
9
|
Mongia A, Zohora FT, Burget NG, Zhou Y, Saunders DC, Wang YJ, Brissova M, Powers AC, Kaestner KH, Vahedi G, Naji A, Schwartz GW, Faryabi RB. AnnoSpat annotates cell types and quantifies cellular arrangements from spatial proteomics. Nat Commun 2024; 15:3744. [PMID: 38702321 PMCID: PMC11068798 DOI: 10.1038/s41467-024-47334-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Accepted: 03/25/2024] [Indexed: 05/06/2024] Open
Abstract
Cellular composition and anatomical organization influence normal and aberrant organ functions. Emerging spatial single-cell proteomic assays such as Image Mass Cytometry (IMC) and Co-Detection by Indexing (CODEX) have facilitated the study of cellular composition and organization by enabling high-throughput measurement of cells and their localization directly in intact tissues. However, annotation of cell types and quantification of their relative localization in tissues remain challenging. To address these unmet needs for atlas-scale datasets like Human Pancreas Analysis Program (HPAP), we develop AnnoSpat (Annotator and Spatial Pattern Finder) that uses neural network and point process algorithms to automatically identify cell types and quantify cell-cell proximity relationships. Our study of data from IMC and CODEX shows the higher performance of AnnoSpat in rapid and accurate annotation of cell types compared to alternative approaches. Moreover, the application of AnnoSpat to type 1 diabetic, non-diabetic autoantibody-positive, and non-diabetic organ donor cohorts recapitulates known islet pathobiology and shows differential dynamics of pancreatic polypeptide (PP) cell abundance and CD8+ T cells infiltration in islets during type 1 diabetes progression.
Collapse
Affiliation(s)
- Aanchal Mongia
- Department of Pathology and Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
- Epigenetics Institute, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Fatema Tuz Zohora
- Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada
- Vector Institute, University of Toronto, Toronto, ON, Canada
| | - Noah G Burget
- Department of Pathology and Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
- Epigenetics Institute, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Yeqiao Zhou
- Department of Pathology and Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
- Epigenetics Institute, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Diane C Saunders
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Yue J Wang
- Department of Genetics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Marcela Brissova
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Alvin C Powers
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA
- Department of Molecular Physiology and Biophysics, Vanderbilt University, Nashville, TN, USA
- VA Tennessee Valley Healthcare System, Nashville, TN, USA
| | - Klaus H Kaestner
- Epigenetics Institute, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
- Department of Genetics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
- Institute for Diabetes, Obesity and Metabolism, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Golnaz Vahedi
- Epigenetics Institute, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
- Department of Genetics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
- Institute for Diabetes, Obesity and Metabolism, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Ali Naji
- Institute for Diabetes, Obesity and Metabolism, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
- Department of Surgery, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Gregory W Schwartz
- Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada.
- Vector Institute, University of Toronto, Toronto, ON, Canada.
- Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada.
| | - Robert B Faryabi
- Department of Pathology and Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA.
- Epigenetics Institute, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA.
| |
Collapse
|
10
|
Bottomly D, McWeeney S. Just how transformative will AI/ML be for immuno-oncology? J Immunother Cancer 2024; 12:e007841. [PMID: 38531545 DOI: 10.1136/jitc-2023-007841] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/15/2024] [Indexed: 03/28/2024] Open
Abstract
Immuno-oncology involves the study of approaches which harness the patient's immune system to fight malignancies. Immuno-oncology, as with every other biomedical and clinical research field as well as clinical operations, is in the midst of technological revolutions, which vastly increase the amount of available data. Recent advances in artificial intelligence and machine learning (AI/ML) have received much attention in terms of their potential to harness available data to improve insights and outcomes in many areas including immuno-oncology. In this review, we discuss important aspects to consider when evaluating the potential impact of AI/ML applications in the clinic. We highlight four clinical/biomedical challenges relevant to immuno-oncology and how they may be able to be addressed by the latest advancements in AI/ML. These challenges include (1) efficiency in clinical workflows, (2) curation of high-quality image data, (3) finding, extracting and synthesizing text knowledge as well as addressing, and (4) small cohort size in immunotherapeutic evaluation cohorts. Finally, we outline how advancements in reinforcement and federated learning, as well as the development of best practices for ethical and unbiased data generation, are likely to drive future innovations.
Collapse
Affiliation(s)
- Daniel Bottomly
- Knight Cancer Institute, Oregon Health and Science University, Portland, Oregon, USA
| | - Shannon McWeeney
- Knight Cancer Institute, Oregon Health and Science University, Portland, Oregon, USA
| |
Collapse
|
11
|
Li C, Ye G, Jiang Y, Wang Z, Yu H, Yang M. Artificial Intelligence in battling infectious diseases: A transformative role. J Med Virol 2024; 96:e29355. [PMID: 38179882 DOI: 10.1002/jmv.29355] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Revised: 12/01/2023] [Accepted: 12/17/2023] [Indexed: 01/06/2024]
Abstract
It is widely acknowledged that infectious diseases have wrought immense havoc on human society, being regarded as adversaries from which humanity cannot elude. In recent years, the advancement of Artificial Intelligence (AI) technology has ushered in a revolutionary era in the realm of infectious disease prevention and control. This evolution encompasses early warning of outbreaks, contact tracing, infection diagnosis, drug discovery, and the facilitation of drug design, alongside other facets of epidemic management. This article presents an overview of the utilization of AI systems in the field of infectious diseases, with a specific focus on their role during the COVID-19 pandemic. The article also highlights the contemporary challenges that AI confronts within this domain and posits strategies for their mitigation. There exists an imperative to further harness the potential applications of AI across multiple domains to augment its capacity in effectively addressing future disease outbreaks.
Collapse
Affiliation(s)
- Chunhui Li
- School of Life Science, Advanced Research Institute of Multidisciplinary Science, Key Laboratory of Molecular Medicine and Biotherapy, Beijing Institute of Technology, Beijing, People's Republic of China
| | - Guoguo Ye
- Shenzhen Key Laboratory of Pathogen and Immunity, National Clinical Research Center for Infectious Disease, The Third People's Hospital of Shenzhen, Second Hospital Affiliated to Southern University of Science and Technology, Shenzhen, China
| | - Yinghan Jiang
- School of Life Science, Advanced Research Institute of Multidisciplinary Science, Key Laboratory of Molecular Medicine and Biotherapy, Beijing Institute of Technology, Beijing, People's Republic of China
| | - Zhiming Wang
- School of Life Science, Advanced Research Institute of Multidisciplinary Science, Key Laboratory of Molecular Medicine and Biotherapy, Beijing Institute of Technology, Beijing, People's Republic of China
| | - Haiyang Yu
- Hangzhou Yalla Information Technology Service Co., Ltd., Hangzhou, People's Republic of China
| | - Minghui Yang
- School of Life Science, Advanced Research Institute of Multidisciplinary Science, Key Laboratory of Molecular Medicine and Biotherapy, Beijing Institute of Technology, Beijing, People's Republic of China
| |
Collapse
|
12
|
Wang S, Shen B, Guo L, Shang M, Liu J, Sun Q, Shen B. scFed: federated learning for cell type classification with scRNA-seq. Brief Bioinform 2023; 25:bbad507. [PMID: 38221903 PMCID: PMC10788680 DOI: 10.1093/bib/bbad507] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2023] [Revised: 12/03/2023] [Accepted: 12/12/2023] [Indexed: 01/16/2024] Open
Abstract
The advent of single-cell RNA sequencing (scRNA-seq) has revolutionized our understanding of cellular heterogeneity and complexity in biological tissues. However, the nature of large, sparse scRNA-seq datasets and privacy regulations present challenges for efficient cell identification. Federated learning provides a solution, allowing efficient and private data use. Here, we introduce scFed, a unified federated learning framework that allows for benchmarking of four classification algorithms without violating data privacy, including single-cell-specific and general-purpose classifiers. We evaluated scFed using eight publicly available scRNA-seq datasets with diverse sizes, species and technologies, assessing its performance via intra-dataset and inter-dataset experimental setups. We find that scFed performs well on a variety of datasets with competitive accuracy to centralized models. Though Transformer-based model excels in centralized training, its performance slightly lags behind single-cell-specific model within the scFed framework, coupled with a notable time complexity concern. Our study not only helps select suitable cell identification methods but also highlights federated learning's potential for privacy-preserving, collaborative biomedical research.
Collapse
Affiliation(s)
- Shuang Wang
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, 610212, Chengdu, China
- Department of Bioinformatics, Hangzhou Nuowei Information Technology Co., Ltd, 310053, Hangzhou, China
| | - Bochen Shen
- Department of Bioinformatics, Hangzhou Nuowei Information Technology Co., Ltd, 310053, Hangzhou, China
| | - Lanting Guo
- Department of Bioinformatics, Hangzhou Nuowei Information Technology Co., Ltd, 310053, Hangzhou, China
| | - Mengqi Shang
- Department of Bioinformatics, Hangzhou Nuowei Information Technology Co., Ltd, 310053, Hangzhou, China
| | - Jinze Liu
- Department of Biostatistics, Virginia Commonwealth University, 23298, Richmond, VA, USA
| | - Qi Sun
- Department of Bioinformatics, Hangzhou Nuowei Information Technology Co., Ltd, 310053, Hangzhou, China
| | - Bairong Shen
- Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, 610212, Chengdu, China
| |
Collapse
|
13
|
Ng GYL, Tan SC, Ong CS. On the use of QDE-SVM for gene feature selection and cell type classification from scRNA-seq data. PLoS One 2023; 18:e0292961. [PMID: 37856458 PMCID: PMC10586655 DOI: 10.1371/journal.pone.0292961] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Accepted: 10/03/2023] [Indexed: 10/21/2023] Open
Abstract
Cell type identification is one of the fundamental tasks in single-cell RNA sequencing (scRNA-seq) studies. It is a key step to facilitate downstream interpretations such as differential expression, trajectory inference, etc. scRNA-seq data contains technical variations that could affect the interpretation of the cell types. Therefore, gene selection, also known as feature selection in data science, plays an important role in selecting informative genes for scRNA-seq cell type identification. Generally speaking, feature selection methods are categorized into filter-, wrapper-, and embedded-based approaches. From the existing literature, methods from filter- and embedded-based approaches are widely applied in scRNA-seq gene selection tasks. The wrapper-based method that gives promising results in other fields has yet been extensively utilized for selecting gene features from scRNA-seq data; in addition, most of the existing wrapper methods used in this field are clustering instead of classification-based. With a large number of annotated data available today, this study applied a classification-based approach as an alternative to the clustering-based wrapper method. In our work, a quantum-inspired differential evolution (QDE) wrapped with a classification method was introduced to select a subset of genes from twelve well-known scRNA-seq transcriptomic datasets to identify cell types. In particular, the QDE was combined with different machine-learning (ML) classifiers namely logistic regression, decision tree, support vector machine (SVM) with linear and radial basis function kernels, as well as extreme learning machine. The linear SVM wrapped with QDE, namely QDE-SVM, was chosen by referring to the feature selection results from the experiment. QDE-SVM showed a superior cell type classification performance among QDE wrapping with other ML classifiers as well as the recent wrapper methods (i.e., FSCAM, SSD-LAHC, MA-HS, and BSF). QDE-SVM achieved an average accuracy of 0.9559, while the other wrapper methods achieved average accuracies in the range of 0.8292 to 0.8872.
Collapse
Affiliation(s)
- Grace Yee Lin Ng
- Faculty of Information Science and Technology, Multimedia University, Bukit Beruang, Melaka, Malaysia
| | - Shing Chiang Tan
- Faculty of Information Science and Technology, Multimedia University, Bukit Beruang, Melaka, Malaysia
| | - Chia Sui Ong
- Faculty of Information Science and Technology, Multimedia University, Bukit Beruang, Melaka, Malaysia
| |
Collapse
|
14
|
Carido M, Völkner M, Steinheuer LM, Wagner F, Kurth T, Dumler N, Ulusoy S, Wieneke S, Norniella AV, Golfieri C, Khattak S, Schönfelder B, Scamozzi M, Zoschke K, Canzler S, Hackermüller J, Ader M, Karl MO. Reliability of human retina organoid generation from hiPSC-derived neuroepithelial cysts. Front Cell Neurosci 2023; 17:1166641. [PMID: 37868194 PMCID: PMC10587494 DOI: 10.3389/fncel.2023.1166641] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2023] [Accepted: 09/18/2023] [Indexed: 10/24/2023] Open
Abstract
The possible applications for human retinal organoids (HROs) derived from human induced pluripotent stem cells (hiPSC) rely on the robustness and transferability of the methodology for their generation. Standardized strategies and parameters to effectively assess, compare, and optimize organoid protocols are starting to be established, but are not yet complete. To advance this, we explored the efficiency and reliability of a differentiation method, called CYST protocol, that facilitates retina generation by forming neuroepithelial cysts from hiPSC clusters. Here, we tested seven different hiPSC lines which reproducibly generated HROs. Histological and ultrastructural analyses indicate that HRO differentiation and maturation are regulated. The different hiPSC lines appeared to be a larger source of variance than experimental rounds. Although previous reports have shown that HROs in several other protocols contain a rather low number of cones, HROs from the CYST protocol are consistently richer in cones and with a comparable ratio of cones, rods, and Müller glia. To provide further insight into HRO cell composition, we studied single cell RNA sequencing data and applied CaSTLe, a transfer learning approach. Additionally, we devised a potential strategy to systematically evaluate different organoid protocols side-by-side through parallel differentiation from the same hiPSC batches: In an explorative study, the CYST protocol was compared to a conceptually different protocol based on the formation of cell aggregates from single hiPSCs. Comparing four hiPSC lines showed that both protocols reproduced key characteristics of retinal epithelial structure and cell composition, but the CYST protocol provided a higher HRO yield. So far, our data suggest that CYST-derived HROs remained stable up to at least day 200, while single hiPSC-derived HROs showed spontaneous pathologic changes by day 200. Overall, our data provide insights into the efficiency, reproducibility, and stability of the CYST protocol for generating HROs, which will be useful for further optimizing organoid systems, as well as for basic and translational research applications.
Collapse
Affiliation(s)
- Madalena Carido
- Center for Regenerative Therapies Dresden (CRTD), TU Dresden, Dresden, Germany
| | - Manuela Völkner
- Center for Regenerative Therapies Dresden (CRTD), TU Dresden, Dresden, Germany
- German Center for Neurodegenerative Diseases (DZNE) Dresden, Dresden, Germany
| | - Lisa Maria Steinheuer
- Department Computational Biology, Helmholtz Centre for Environmental Research-UFZ, Leipzig, Germany
- Department of Computer Science, Leipzig University, Leipzig, Germany
| | - Felix Wagner
- Center for Regenerative Therapies Dresden (CRTD), TU Dresden, Dresden, Germany
| | - Thomas Kurth
- Center for Molecular and Cellular Bioengineering (CMCB), Technology Platform, Core Facility Electron Microscopy and Histology, TU Dresden, Dresden, Germany
| | - Natalie Dumler
- Center for Regenerative Therapies Dresden (CRTD), TU Dresden, Dresden, Germany
| | - Selen Ulusoy
- Center for Regenerative Therapies Dresden (CRTD), TU Dresden, Dresden, Germany
| | - Stephanie Wieneke
- German Center for Neurodegenerative Diseases (DZNE) Dresden, Dresden, Germany
| | | | - Cristina Golfieri
- German Center for Neurodegenerative Diseases (DZNE) Dresden, Dresden, Germany
| | - Shahryar Khattak
- Center for Molecular and Cellular Bioengineering (CMCB), Stem Cell Engineering Facility, TU Dresden, Dresden, Germany
| | - Bruno Schönfelder
- German Center for Neurodegenerative Diseases (DZNE) Dresden, Dresden, Germany
| | - Maria Scamozzi
- Center for Regenerative Therapies Dresden (CRTD), TU Dresden, Dresden, Germany
| | - Katja Zoschke
- German Center for Neurodegenerative Diseases (DZNE) Dresden, Dresden, Germany
| | - Sebastian Canzler
- Department Computational Biology, Helmholtz Centre for Environmental Research-UFZ, Leipzig, Germany
| | - Jörg Hackermüller
- Department Computational Biology, Helmholtz Centre for Environmental Research-UFZ, Leipzig, Germany
- Department of Computer Science, Leipzig University, Leipzig, Germany
| | - Marius Ader
- Center for Regenerative Therapies Dresden (CRTD), TU Dresden, Dresden, Germany
| | - Mike O Karl
- Center for Regenerative Therapies Dresden (CRTD), TU Dresden, Dresden, Germany
- German Center for Neurodegenerative Diseases (DZNE) Dresden, Dresden, Germany
| |
Collapse
|
15
|
Madadi Y, Monavarfeshani A, Chen H, Stamer WD, Williams RW, Yousefi S. Artificial Intelligence Models for Cell Type and Subtype Identification Based on Single-Cell RNA Sequencing Data in Vision Science. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2837-2852. [PMID: 37294649 PMCID: PMC10631573 DOI: 10.1109/tcbb.2023.3284795] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Single-cell RNA sequencing (scRNA-seq) provides a high throughput, quantitative and unbiased framework for scientists in many research fields to identify and characterize cell types within heterogeneous cell populations from various tissues. However, scRNA-seq based identification of discrete cell-types is still labor intensive and depends on prior molecular knowledge. Artificial intelligence has provided faster, more accurate, and user-friendly approaches for cell-type identification. In this review, we discuss recent advances in cell-type identification methods using artificial intelligence techniques based on single-cell and single-nucleus RNA sequencing data in vision science. The main purpose of this review paper is to assist vision scientists not only to select suitable datasets for their problems, but also to be aware of the appropriate computational tools to perform their analysis. Developing novel methods for scRNA-seq data analysis remains to be addressed in future studies.
Collapse
|
16
|
Toseef M, Olayemi Petinrin O, Wang F, Rahaman S, Liu Z, Li X, Wong KC. Deep transfer learning for clinical decision-making based on high-throughput data: comprehensive survey with benchmark results. Brief Bioinform 2023:bbad254. [PMID: 37455245 DOI: 10.1093/bib/bbad254] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Revised: 06/04/2023] [Accepted: 06/20/2023] [Indexed: 07/18/2023] Open
Abstract
The rapid growth of omics-based data has revolutionized biomedical research and precision medicine, allowing machine learning models to be developed for cutting-edge performance. However, despite the wealth of high-throughput data available, the performance of these models is hindered by the lack of sufficient training data, particularly in clinical research (in vivo experiments). As a result, translating this knowledge into clinical practice, such as predicting drug responses, remains a challenging task. Transfer learning is a promising tool that bridges the gap between data domains by transferring knowledge from the source to the target domain. Researchers have proposed transfer learning to predict clinical outcomes by leveraging pre-clinical data (mouse, zebrafish), highlighting its vast potential. In this work, we present a comprehensive literature review of deep transfer learning methods for health informatics and clinical decision-making, focusing on high-throughput molecular data. Previous reviews mostly covered image-based transfer learning works, while we present a more detailed analysis of transfer learning papers. Furthermore, we evaluated original studies based on different evaluation settings across cross-validations, data splits and model architectures. The result shows that those transfer learning methods have great potential; high-throughput sequencing data and state-of-the-art deep learning models lead to significant insights and conclusions. Additionally, we explored various datasets in transfer learning papers with statistics and visualization.
Collapse
Affiliation(s)
- Muhammad Toseef
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR
| | | | - Fuzhou Wang
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR
| | - Saifur Rahaman
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR
| | - Zhe Liu
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR
| | - Xiangtao Li
- School of Artificial Intelligence, Jilin University, Jilin, China
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR
- Hong Kong Institute for Data Science, City University of Hong Kong, Hong Kong SAR
| |
Collapse
|
17
|
Kim H, Kim HK, Hong D, Kim M, Jang S, Yang CS, Yoon S. Identification of ulcerative colitis-specific immune cell signatures from public single-cell RNA-seq data. Genes Genomics 2023; 45:957-967. [PMID: 37133723 DOI: 10.1007/s13258-023-01390-w] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2023] [Accepted: 04/13/2023] [Indexed: 05/04/2023]
Abstract
BACKGROUND Single-cell RNA-seq enabled microscopic studies on tissue microenvironment of many diseases. Inflammatory bowel disease, an autoimmune disease, is involved with various dysfunction of immune cells, for which single-cell RNA-seq may provide us a deeper insight into the causes and mechanism of this complex disease. OBJECTIVE In this work, we used public single-cell RNA-seq data to study tissue microenvironment around ulcerative colitis, an inflammatory bowel disease causing chronic inflammation and ulcers in large intestine. METHODS Since not all the datasets provide cell-type annotations, we first identified cell identities to select cell populations of our interest. Differentially expressed genes and gene set enrichment analysis was then performed to infer the polarization/activation state of macrophages and T cells. Cell-to-cell interaction analysis was also performed to discover distinct interactions in ulcerative colitis. RESULTS Differentially expressed genes analysis of the two datasets confirmed the regulation of CTLA4, IL2RA, and CCL5 genes in the T cell subset and regulation of S100A8/A9, CLEC10A genes in macrophages. Cell-to-cell interaction analysis showed CD4+ T cells and macrophages interact actively to each other. We also identified IL-18 pathway activation in inflammatory macrophages, evidence that CD4+ T cells induce Th1 and Th2 differentiation, and also found that macrophages regulate T cell activation through different ligand-receptor pairs, viz. CD86-CTL4, LGALS9-CD47, SIRPA-CD47, and GRN-TNFRSF1B. CONCLUSION Analysis of these immune cell subsets may suggest novel strategies for the treatment of inflammatory bowel disease.
Collapse
Affiliation(s)
- Hanbyeol Kim
- Dept of Computer Science, College of SW Convergence, Dankook Univ, Yongin-si, 16890, Korea
| | - Hyo Keun Kim
- Dept of Molecular and Life Science and Center for Bionano Intelligence Education and Research, Hanyang University, Ansan-si, 15588, Korea
| | - Dawon Hong
- Dept of Molecular Biology, Graduate Department of Bioconvergence Engineering, Dankook University, Yongin-si, 16890, Korea
| | - Minsu Kim
- Dept of Computer Science, College of SW Convergence, Dankook Univ, Yongin-si, 16890, Korea
| | - Sein Jang
- Dept of Molecular and Life Science and Center for Bionano Intelligence Education and Research, Hanyang University, Ansan-si, 15588, Korea
| | - Chul-Su Yang
- Dept of Medicinal/Molecular and Life Science and Center for Bionano Intelligence Education and Research, Hanyang University, Ansan-si, 15588, Korea
| | - Seokhyun Yoon
- Dept of Electronics & Electrical Eng, College of Engineering, Dankook Univ, Yongin-si, 16890, Korea.
| |
Collapse
|
18
|
Theodoris CV, Xiao L, Chopra A, Chaffin MD, Al Sayed ZR, Hill MC, Mantineo H, Brydon EM, Zeng Z, Liu XS, Ellinor PT. Transfer learning enables predictions in network biology. Nature 2023; 618:616-624. [PMID: 37258680 PMCID: PMC10949956 DOI: 10.1038/s41586-023-06139-9] [Citation(s) in RCA: 247] [Impact Index Per Article: 123.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2022] [Accepted: 04/27/2023] [Indexed: 06/02/2023]
Abstract
Mapping gene networks requires large amounts of transcriptomic data to learn the connections between genes, which impedes discoveries in settings with limited data, including rare diseases and diseases affecting clinically inaccessible tissues. Recently, transfer learning has revolutionized fields such as natural language understanding1,2 and computer vision3 by leveraging deep learning models pretrained on large-scale general datasets that can then be fine-tuned towards a vast array of downstream tasks with limited task-specific data. Here, we developed a context-aware, attention-based deep learning model, Geneformer, pretrained on a large-scale corpus of about 30 million single-cell transcriptomes to enable context-specific predictions in settings with limited data in network biology. During pretraining, Geneformer gained a fundamental understanding of network dynamics, encoding network hierarchy in the attention weights of the model in a completely self-supervised manner. Fine-tuning towards a diverse panel of downstream tasks relevant to chromatin and network dynamics using limited task-specific data demonstrated that Geneformer consistently boosted predictive accuracy. Applied to disease modelling with limited patient data, Geneformer identified candidate therapeutic targets for cardiomyopathy. Overall, Geneformer represents a pretrained deep learning model from which fine-tuning towards a broad range of downstream applications can be pursued to accelerate discovery of key network regulators and candidate therapeutic targets.
Collapse
Affiliation(s)
- Christina V Theodoris
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA.
- Cardiovascular Disease Initiative and Precision Cardiology Laboratory, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Division of Genetics and Genomics, Boston Children's Hospital, Boston, MA, USA.
- Harvard Medical School Genetics Training Program, Boston, USA.
| | - Ling Xiao
- Cardiovascular Disease Initiative and Precision Cardiology Laboratory, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA
| | - Anant Chopra
- Precision Cardiology Laboratory, Bayer US LLC, Cambridge, MA, USA
| | - Mark D Chaffin
- Cardiovascular Disease Initiative and Precision Cardiology Laboratory, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Zeina R Al Sayed
- Cardiovascular Disease Initiative and Precision Cardiology Laboratory, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Matthew C Hill
- Cardiovascular Disease Initiative and Precision Cardiology Laboratory, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA
| | - Helene Mantineo
- Cardiovascular Disease Initiative and Precision Cardiology Laboratory, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA
| | | | - Zexian Zeng
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - X Shirley Liu
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Center for Functional Cancer Epigenetics, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Patrick T Ellinor
- Cardiovascular Disease Initiative and Precision Cardiology Laboratory, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA.
| |
Collapse
|
19
|
Yan D, Sun Z, Fang J, Cao S, Wang W, Chang X, Badirli S, Fu H, Liu Y. scRAA: the development of a robust and automatic annotation procedure for single-cell RNA sequencing data. J Biopharm Stat 2023:1-14. [PMID: 37162278 DOI: 10.1080/10543406.2023.2208671] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
A critical task in single-cell RNA sequencing (scRNA-Seq) data analysis is to identify cell types from heterogeneous tissues. While the majority of classification methods demonstrated high performance in scRNA-Seq annotation problems, a robust and accurate solution is desired to generate reliable outcomes for downstream analyses, for instance, marker genes identification, differentially expressed genes, and pathway analysis. It is hard to establish a universally good metric. Thus, a universally good classification method for all kinds of scenarios does not exist. In addition, reference and query data in cell classification are usually from different experimental batches, and failure to consider batch effects may result in misleading conclusions. To overcome this bottleneck, we propose a robust ensemble approach to classify cells and utilize a batch correction method between reference and query data. We simulated four scenarios that comprise simple to complex batch effect and account for varying cell-type proportions. We further tested our approach on both lung and pancreas data. We found improved prediction accuracy and robust performance across simulation scenarios and real data. The incorporation of batch effect correction between reference and query, and the ensemble approach improve cell-type prediction accuracy while maintaining robustness. We demonstrated these through simulated and real scRNA-Seq data.
Collapse
Affiliation(s)
- Dongyan Yan
- Global Statistical Science, Eli Lilly & Co, Indianapolis, Indiana, USA
| | - Zhe Sun
- Global Statistical Science, Eli Lilly & Co, Indianapolis, Indiana, USA
| | - Jiyuan Fang
- Global Statistical Science, Eli Lilly & Co, Indianapolis, Indiana, USA
| | - Shanshan Cao
- Global Statistical Science, Eli Lilly & Co, Indianapolis, Indiana, USA
| | - Wenjie Wang
- Advance Analytics and Data Science, Eli Lilly & Co, Indianapolis, Indiana, USA
| | - Xinyue Chang
- Advance Analytics and Data Science, Eli Lilly & Co, Indianapolis, Indiana, USA
| | - Sarkhan Badirli
- Advance Analytics and Data Science, Eli Lilly & Co, Indianapolis, Indiana, USA
| | - Haoda Fu
- Advance Analytics and Data Science, Eli Lilly & Co, Indianapolis, Indiana, USA
| | - Yushi Liu
- Global Statistical Science, Eli Lilly & Co, Indianapolis, Indiana, USA
| |
Collapse
|
20
|
Lee J, Kim M, Kang K, Yang CS, Yoon S. Hierarchical cell-type identifier accurately distinguishes immune-cell subtypes enabling precise profiling of tissue microenvironment with single-cell RNA-sequencing. Brief Bioinform 2023; 24:bbad006. [PMID: 36681937 PMCID: PMC10025442 DOI: 10.1093/bib/bbad006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2022] [Revised: 12/22/2022] [Accepted: 01/02/2023] [Indexed: 01/23/2023] Open
Abstract
Single-cell RNA-seq enabled in-depth study on tissue micro-environment and immune-profiling, where a crucial step is to annotate cell identity. Immune cells play key roles in many diseases, whereas their activities are hard to track due to their diverse and highly variable nature. Existing cell-type identifiers had limited performance for this purpose. We present HiCAT, a hierarchical, marker-based cell-type identifier utilising gene set analysis for statistical scoring for given markers. It features successive identification of major-type, minor-type and subsets utilising subset markers structured in a three-level taxonomy tree. Comparison with manual annotation and pairwise match test showed HiCAT outperforms others in major- and minor-type identification. For subsets, we qualitatively evaluated the marker expression profile demonstrating that HiCAT provide the clearest immune-cell landscape. HiCAT was also used for immune-cell profiling in ulcerative colitis and discovered distinct features of the disease in macrophage and T-cell subsets that could not be identified previously.
Collapse
Affiliation(s)
- Joongho Lee
- Dept. of Computer Science, College of SW Convergence, Dankook University, Yongin-si, Korea, 16890
| | - Minsoo Kim
- Dept. of Computer Science, College of SW Convergence, Dankook University, Yongin-si, Korea, 16890
| | - Keunsoo Kang
- Dept. of Microbiology, College of Natural Sciences, Dankook University, Cheonan-si, Korea, 31116
| | - Chul-Su Yang
- Dept. of Molecular and Life Science, Center for Bionano Intelligence Education and Research, Hanyang University, Ansan, Korea, 15588
| | - Seokhyun Yoon
- Dept. of Electronics & Electrical Eng., College of Engineering, Dankook University, Yongin-si Korea, 16890
| |
Collapse
|
21
|
Dong X, Chowdhury S, Victor U, Li X, Qian L. Semi-Supervised Deep Learning for Cell Type Identification From Single-Cell Transcriptomic Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1492-1505. [PMID: 35536811 DOI: 10.1109/tcbb.2022.3173587] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Cell type identification from single-cell transcriptomic data is a common goal of single-cell RNA sequencing (scRNAseq) data analysis. Deep neural networks have been employed to identify cell types from scRNAseq data with high performance. However, it requires a large mount of individual cells with accurate and unbiased annotated types to train the identification models. Unfortunately, labeling the scRNAseq data is cumbersome and time-consuming as it involves manual inspection of marker genes. To overcome this challenge, we propose a semi-supervised learning model "SemiRNet" to use unlabeled scRNAseq cells and a limited amount of labeled scRNAseq cells to implement cell identification. The proposed model is based on recurrent convolutional neural networks (RCNN), which includes a shared network, a supervised network and an unsupervised network. The proposed model is evaluated on two large scale single-cell transcriptomic datasets. It is observed that the proposed model is able to achieve encouraging performance by learning on the very limited amount of labeled scRNAseq cells together with a large number of unlabeled scRNAseq cells.
Collapse
|
22
|
Cell Type Annotation Model Selection: General-Purpose vs. Pattern-Aware Feature Gene Selection in Single-Cell RNA-Seq Data. Genes (Basel) 2023; 14:genes14030596. [PMID: 36980868 PMCID: PMC10048047 DOI: 10.3390/genes14030596] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2022] [Revised: 02/22/2023] [Accepted: 02/24/2023] [Indexed: 03/03/2023] Open
Abstract
With the advances in high-throughput sequencing technology, an increasing amount of research in revealing heterogeneity among cells has been widely performed. Differences between individual cells’ functionality are determined based on the differences in the gene expression profiles. Although the observations indicate a great performance of clustering methods, manual annotation of the clusters of cells is a challenge yet to be addressed more scalable and faster. On the other hand, due to the lack of enough labelled datasets, just a few supervised techniques have been used in cell type identification, and they obtained more robust results compared to clustering methods. A recent study showed that a complementary step of feature selection helped support vector machine (SVM) to outperform other classifiers in different scenarios. In this article, we compare and evaluate the performance of two state-of-the-art supervised methods, XGBoost and SVM, with information gain as a feature selection method. The results of the experiments on three standard scRNA-seq datasets indicate that XGBoost automatically annotates cell types in a simpler and more scalable framework. Additionally, it sheds light on the potential use of boosting tree approaches combined with deep neural networks to capture underlying information of single-cell RNA-Seq data more effectively. It can be used to identify marker genes and other applications in biological studies.
Collapse
|
23
|
Singh A, Saint-Antoine M. Probing transient memory of cellular states using single-cell lineages. Front Microbiol 2023; 13:1050516. [PMID: 36824587 PMCID: PMC9942930 DOI: 10.3389/fmicb.2022.1050516] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Accepted: 12/22/2022] [Indexed: 02/10/2023] Open
Abstract
The inherent stochasticity in the gene product levels can drive single cells within an isoclonal population to different phenotypic states. The dynamic nature of this intercellular variation, where individual cells can transition between different states over time, makes it a particularly hard phenomenon to characterize. We reviewed recent progress in leveraging the classical Luria-Delbrück experiment to infer the transient heritability of the cellular states. Similar to the original experiment, individual cells were first grown into cell colonies, and then, the fraction of cells residing in different states was assayed for each colony. We discuss modeling approaches for capturing dynamic state transitions in a growing cell population and highlight formulas that identify the kinetics of state switching from the extent of colony-to-colony fluctuations. The utility of this method in identifying multi-generational memory of the both expression and phenotypic states is illustrated across diverse biological systems from cancer drug resistance, reactivation of human viruses, and cellular immune responses. In summary, this fluctuation-based methodology provides a powerful approach for elucidating cell-state transitions from a single time point measurement, which is particularly relevant in situations where measurements lead to cell death (as in single-cell RNA-seq or drug treatment) or cause an irreversible change in cell physiology.
Collapse
Affiliation(s)
- Abhyudai Singh
- Departments of Electrical and Computer Engineering, Biomedical Engineering, Mathematical Sciences University of Delaware, Newark, DE, United States
| | | |
Collapse
|
24
|
Wang K, Li Z, You ZH, Han P, Nie R. Adversarial dense graph convolutional networks for single-cell classification. Bioinformatics 2023; 39:6994183. [PMID: 36661313 PMCID: PMC9919433 DOI: 10.1093/bioinformatics/btad043] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2022] [Revised: 12/30/2022] [Accepted: 01/19/2023] [Indexed: 01/21/2023] Open
Abstract
MOTIVATION In single-cell transcriptomics applications, effective identification of cell types in multicellular organisms and in-depth study of the relationships between genes has become one of the main goals of bioinformatics research. However, data heterogeneity and random noise pose significant difficulties for scRNA-seq data analysis. RESULTS We have proposed an adversarial dense graph convolutional network architecture for single-cell classification. Specifically, to enhance the representation of higher-order features and the organic combination between features, dense connectivity mechanism and attention-based feature aggregation are introduced for feature learning in convolutional neural networks. To preserve the features of the original data, we use a feature reconstruction module to assist the goal of single-cell classification. In addition, HNNVAT uses virtual adversarial training to improve the generalization and robustness. Experimental results show that our model outperforms the existing classical methods in terms of classification accuracy on benchmark datasets. AVAILABILITY AND IMPLEMENTATION The source code of HNNVAT is available at https://github.com/DisscLab/HNNVAT. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Kangwei Wang
- School of Computer Science and Technology, China University of Mining and Technology, Xuzhou 221116, China
| | - Zhengwei Li
- School of Computer Science and Technology, China University of Mining and Technology, Xuzhou 221116, China
| | - Zhu-Hong You
- School of Computer Science, Northwestern Polytechnical University, Xi'an 710072, China
| | - Pengyong Han
- Central Lab, Changzhi Medical College, Changzhi 046000, China
| | - Ru Nie
- School of Computer Science and Technology, China University of Mining and Technology, Xuzhou 221116, China
| |
Collapse
|
25
|
Zhou W, Jie Q, Pan T, Shi J, Jiang T, Zhang Y, Ding N, Xu J, Ma Y, Li Y. Single-cell RNA binding protein regulatory network analyses reveal oncogenic HNRNPK-MYC signalling pathway in cancer. Commun Biol 2023; 6:82. [PMID: 36681772 PMCID: PMC9867709 DOI: 10.1038/s42003-023-04457-2] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Accepted: 01/10/2023] [Indexed: 01/22/2023] Open
Abstract
RNA-binding proteins (RBPs) are key players of gene expression and perturbations of RBP-RNA regulatory network have been observed in various cancer types. Here, we propose a computational method, RBPreg, to identify the RBP regulators by integration of single cell RNA-Seq (N = 233,591) and RBP binding data. Pan-cancer analyses suggest that RBP regulators exhibit cancer and cell specificity and perturbations of RBP regulatory network are involved in cancer hallmark-related functions. We prioritize an oncogenic RBP-HNRNPK, which is highly expressed in tumors and associated with poor prognosis of patients. Functional assays performed in cancer cells reveal that HNRNPK promotes cancer cell proliferation, migration, and invasion in vitro and in vivo. Mechanistic investigations further demonstrate that HNRNPK promotes tumorigenesis and progression by directly binding to MYC and perturbed the MYC targets pathway in lung cancer. Our results provide a valuable resource for characterizing RBP regulatory networks in cancer, yielding potential biomarkers for precision medicine.
Collapse
Affiliation(s)
- Weiwei Zhou
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, Heilongjiang, 150081, China
| | - Qiuling Jie
- Hainan Provincial Key Laboratory for Human Reproductive Medicine and Genetic Research, Hainan Clinical Research Center for Thalassemia, Reproductive Medical Center, National Center for International Research "China-Myanmar Joint Research Center for Prevention and Treatment of Regional Major Disease", The First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199, China
| | - Tao Pan
- College of Biomedical Information and Engineering, Hainan Women and Children's Medical Center, Hainan Medical University, Haikou, 571199, China
| | - Jingyi Shi
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, Heilongjiang, 150081, China
| | - Tiantongfei Jiang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, Heilongjiang, 150081, China
| | - Ya Zhang
- College of Biomedical Information and Engineering, Hainan Women and Children's Medical Center, Hainan Medical University, Haikou, 571199, China
| | - Na Ding
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, Heilongjiang, 150081, China
| | - Juan Xu
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, Heilongjiang, 150081, China.
| | - Yanlin Ma
- Hainan Provincial Key Laboratory for Human Reproductive Medicine and Genetic Research, Hainan Clinical Research Center for Thalassemia, Reproductive Medical Center, National Center for International Research "China-Myanmar Joint Research Center for Prevention and Treatment of Regional Major Disease", The First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199, China.
| | - Yongsheng Li
- Hainan Provincial Key Laboratory for Human Reproductive Medicine and Genetic Research, Hainan Clinical Research Center for Thalassemia, Reproductive Medical Center, National Center for International Research "China-Myanmar Joint Research Center for Prevention and Treatment of Regional Major Disease", The First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199, China.
- College of Biomedical Information and Engineering, Hainan Women and Children's Medical Center, Hainan Medical University, Haikou, 571199, China.
| |
Collapse
|
26
|
Chen J, Xu H, Tao W, Chen Z, Zhao Y, Han JDJ. Transformer for one stop interpretable cell type annotation. Nat Commun 2023; 14:223. [PMID: 36641532 PMCID: PMC9840170 DOI: 10.1038/s41467-023-35923-4] [Citation(s) in RCA: 73] [Impact Index Per Article: 36.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Accepted: 01/09/2023] [Indexed: 01/15/2023] Open
Abstract
Consistent annotation transfer from reference dataset to query dataset is fundamental to the development and reproducibility of single-cell research. Compared with traditional annotation methods, deep learning based methods are faster and more automated. A series of useful single cell analysis tools based on autoencoder architecture have been developed but these struggle to strike a balance between depth and interpretability. Here, we present TOSICA, a multi-head self-attention deep learning model based on Transformer that enables interpretable cell type annotation using biologically understandable entities, such as pathways or regulons. We show that TOSICA achieves fast and accurate one-stop annotation and batch-insensitive integration while providing biologically interpretable insights for understanding cellular behavior during development and disease progressions. We demonstrate TOSICA's advantages by applying it to scRNA-seq data of tumor-infiltrating immune cells, and CD14+ monocytes in COVID-19 to reveal rare cell types, heterogeneity and dynamic trajectories associated with disease progression and severity.
Collapse
Affiliation(s)
- Jiawei Chen
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Center for Quantitative Biology (CQB), Peking University, Beijing, 100871, China
| | - Hao Xu
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Center for Quantitative Biology (CQB), Peking University, Beijing, 100871, China
| | - Wanyu Tao
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Center for Quantitative Biology (CQB), Peking University, Beijing, 100871, China
| | - Zhaoxiong Chen
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Center for Quantitative Biology (CQB), Peking University, Beijing, 100871, China
| | - Yuxuan Zhao
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Center for Quantitative Biology (CQB), Peking University, Beijing, 100871, China
| | - Jing-Dong J Han
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Center for Quantitative Biology (CQB), Peking University, Beijing, 100871, China.
| |
Collapse
|
27
|
Christensen E, Luo P, Turinsky A, Husić M, Mahalanabis A, Naidas A, Diaz-Mejia JJ, Brudno M, Pugh T, Ramani A, Shooshtari P. Evaluation of single-cell RNAseq labelling algorithms using cancer datasets. Brief Bioinform 2022; 24:6965910. [PMID: 36585784 PMCID: PMC9851326 DOI: 10.1093/bib/bbac561] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2022] [Revised: 09/19/2022] [Accepted: 11/01/2022] [Indexed: 01/01/2023] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) clustering and labelling methods are used to determine precise cellular composition of tissue samples. Automated labelling methods rely on either unsupervised, cluster-based approaches or supervised, cell-based approaches to identify cell types. The high complexity of cancer poses a unique challenge, as tumor microenvironments are often composed of diverse cell subpopulations with unique functional effects that may lead to disease progression, metastasis and treatment resistance. Here, we assess 17 cell-based and 9 cluster-based scRNA-seq labelling algorithms using 8 cancer datasets, providing a comprehensive large-scale assessment of such methods in a cancer-specific context. Using several performance metrics, we show that cell-based methods generally achieved higher performance and were faster compared to cluster-based methods. Cluster-based methods more successfully labelled non-malignant cell types, likely because of a lack of gene signatures for relevant malignant cell subpopulations. Larger cell numbers present in some cell types in training data positively impacted prediction scores for cell-based methods. Finally, we examined which methods performed favorably when trained and tested on separate patient cohorts in scenarios similar to clinical applications, and which were able to accurately label particularly small or under-represented cell populations in the given datasets. We conclude that scPred and SVM show the best overall performances with cancer-specific data and provide further suggestions for algorithm selection. Our analysis pipeline for assessing the performance of cell type labelling algorithms is available in https://github.com/shooshtarilab/scRNAseq-Automated-Cell-Type-Labelling.
Collapse
Affiliation(s)
| | | | - Andrei Turinsky
- Centre for Computational Medicine, The Hospital for Sick Children, Toronto, ON, Canada
| | - Mia Husić
- Centre for Computational Medicine, The Hospital for Sick Children, Toronto, ON, Canada
| | - Alaina Mahalanabis
- Centre for Computational Medicine, The Hospital for Sick Children, Toronto, ON, Canada
| | - Alaine Naidas
- Children’s Health Research Institute, Lawson Research Institute, London, ON, Canada
- Department of Pathology and Lab Medicine, University of Western Ontario, London, ON, Canada
| | | | - Michael Brudno
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
| | - Trevor Pugh
- Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada
- Ontario Institute for Cancer Research, Toronto, ON, Canada
- Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada
| | - Arun Ramani
- Centre for Computational Medicine, The Hospital for Sick Children, Toronto, ON, Canada
| | - Parisa Shooshtari
- Corresponding author: Parisa Shooshtari, Department of Pathology and Lab Medicine, University of Western Ontario, London, ON, Canada. Tel.: +1 (519) 685-8500 x55427. E-mail:
| |
Collapse
|
28
|
Grabski IN, Irizarry RA. A probabilistic gene expression barcode for annotation of cell types from single-cell RNA-seq data. Biostatistics 2022; 23:1150-1164. [PMID: 35770795 PMCID: PMC9802389 DOI: 10.1093/biostatistics/kxac021] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2021] [Revised: 05/10/2022] [Accepted: 05/22/2022] [Indexed: 01/07/2023] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) quantifies gene expression for individual cells in a sample, which allows distinct cell-type populations to be identified and characterized. An important step in many scRNA-seq analysis pipelines is the annotation of cells into known cell types. While this can be achieved using experimental techniques, such as fluorescence-activated cell sorting, these approaches are impractical for large numbers of cells. This motivates the development of data-driven cell-type annotation methods. We find limitations with current approaches due to the reliance on known marker genes or from overfitting because of systematic differences, or batch effects, between studies. Here, we present a statistical approach that leverages public data sets to combine information across thousands of genes, uses a latent variable model to define cell-type-specific barcodes and account for batch effect variation, and probabilistically annotates cell-type identity from a reference of known cell types. The barcoding approach also provides a new way to discover marker genes. Using a range of data sets, including those generated to represent imperfect real-world reference data, we demonstrate that our approach substantially outperforms current reference-based methods, particularly when predicting across studies.
Collapse
Affiliation(s)
- Isabella N Grabski
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | - Rafael A Irizarry
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA and Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA
| |
Collapse
|
29
|
Guo H, Yang Z, Jiang T, Liu S, Wang Y, Cui Z. Evaluation of classification in single cell atac-seq data with machine learning methods. BMC Bioinformatics 2022; 23:249. [PMID: 36131234 PMCID: PMC9494763 DOI: 10.1186/s12859-022-04774-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Accepted: 06/08/2022] [Indexed: 12/04/2022] Open
Abstract
BACKGROUND The technologies advances of single-cell Assay for Transposase Accessible Chromatin using sequencing (scATAC-seq) allowed to generate thousands of single cells in a relatively easy and economic manner and it is rapidly advancing the understanding of the cellular composition of complex organisms and tissues. The data structure and feature in scRNA-seq is similar to that in scATAC-seq, therefore, it's encouraged to identify and classify the cell types in scATAC-seq through traditional supervised machine learning methods, which are proved reliable in scRNA-seq datasets. RESULTS In this study, we evaluated the classification performance of 6 well-known machine learning methods on scATAC-seq. A total of 4 public scATAC-seq datasets vary in tissues, sizes and technologies were applied to the evaluation of the performance of the methods. We assessed these methods using a 5-folds cross validation experiment, called intra-dataset experiment, based on recall, precision and the percentage of correctly predicted cells. The results show that these methods performed well in some specific types of the cell in a specific scATAC-seq dataset, while the overall performance is not as well as that in scRNA-seq analysis. In addition, we evaluated the classification performance of these methods by training and predicting in different datasets generated from same sample, called inter-datasets experiments, which may help us to assess the performance of these methods in more realistic scenarios. CONCLUSIONS Both in intra-dataset and in inter-dataset experiment, SVM and NMC are overall outperformed others across all 4 datasets. Thus, we recommend researchers to use SVM and NMC as the underlying classifier when developing an automatic cell-type classification method for scATAC-seq.
Collapse
Affiliation(s)
- Hongzhe Guo
- Faculty of Computing, Harbin Institute of Technology, Harbin, 150001, China
| | - Zhongbo Yang
- Faculty of Computing, Harbin Institute of Technology, Harbin, 150001, China
| | - Tao Jiang
- Faculty of Computing, Harbin Institute of Technology, Harbin, 150001, China
| | - Shiqi Liu
- Faculty of Computing, Harbin Institute of Technology, Harbin, 150001, China
| | - Yadong Wang
- Faculty of Computing, Harbin Institute of Technology, Harbin, 150001, China.
| | - Zhe Cui
- Faculty of Computing, Harbin Institute of Technology, Harbin, 150001, China.
| |
Collapse
|
30
|
Madadi Y, Sun J, Chen H, Williams R, Yousefi S. Detecting retinal neural and stromal cell classes and ganglion cell subtypes based on transcriptome data with deep transfer learning. Bioinformatics 2022; 38:4321-4329. [PMID: 35876552 PMCID: PMC9991888 DOI: 10.1093/bioinformatics/btac514] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Revised: 07/11/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION To develop and assess the accuracy of deep learning models that identify different retinal cell types, as well as different retinal ganglion cell (RGC) subtypes, based on patterns of single-cell RNA sequencing (scRNA-seq) in multiple datasets. RESULTS Deep domain adaptation models were developed and tested using three different datasets. The first dataset included 44 808 single retinal cells from mice (39 cell types) with 24 658 genes, the second dataset included 6225 single RGCs from mice (41 subtypes) with 13 616 genes and the third dataset included 35 699 single RGCs from mice (45 subtypes) with 18 222 genes. We used four loss functions in the learning process to align the source and target distributions, reduce misclassification errors and maximize robustness. Models were evaluated based on classification accuracy and confusion matrix. The accuracy of the model for correctly classifying 39 different retinal cell types in the first dataset was ∼92%. Accuracy in the second and third datasets reached ∼97% and 97% in correctly classifying 40 and 45 different RGCs subtypes, respectively. Across a range of seven different batches in the first dataset, the accuracy of the lead model ranged from 74% to nearly 100%. The lead model provided high accuracy in identifying retinal cell types and RGC subtypes based on scRNA-seq data. The performance was reasonable based on data from different batches as well. The validated model could be readily applied to scRNA-seq data to identify different retinal cell types and subtypes. AVAILABILITY AND IMPLEMENTATION The code and datasets are available on https://github.com/DM2LL/Detecting-Retinal-Cell-Classes-and-Ganglion-Cell-Subtypes. We have also added the class labels of all samples to the datasets. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yeganeh Madadi
- Department of Ophthalmology, University of Tennessee Health Science Center, Memphis, TN, USA
- University of Tehran, Tehran, Iran
| | - Jian Sun
- Department of Ophthalmology, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Hao Chen
- Department of Pharmacology, Addiction Science and Toxicology, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Robert Williams
- Department of Genetics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Siamak Yousefi
- Department of Ophthalmology, University of Tennessee Health Science Center, Memphis, TN, USA
- Department of Genetics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| |
Collapse
|
31
|
Galdos FX, Xu S, Goodyer WR, Duan L, Huang YV, Lee S, Zhu H, Lee C, Wei N, Lee D, Wu SM. devCellPy is a machine learning-enabled pipeline for automated annotation of complex multilayered single-cell transcriptomic data. Nat Commun 2022; 13:5271. [PMID: 36071107 PMCID: PMC9452519 DOI: 10.1038/s41467-022-33045-x] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2021] [Accepted: 08/31/2022] [Indexed: 11/09/2022] Open
Abstract
A major informatic challenge in single cell RNA-sequencing analysis is the precise annotation of datasets where cells exhibit complex multilayered identities or transitory states. Here, we present devCellPy a highly accurate and precise machine learning-enabled tool that enables automated prediction of cell types across complex annotation hierarchies. To demonstrate the power of devCellPy, we construct a murine cardiac developmental atlas from published datasets encompassing 104,199 cells from E6.5-E16.5 and train devCellPy to generate a cardiac prediction algorithm. Using this algorithm, we observe a high prediction accuracy (>90%) across multiple layers of annotation and across de novo murine developmental data. Furthermore, we conduct a cross-species prediction of cardiomyocyte subtypes from in vitro-derived human induced pluripotent stem cells and unexpectedly uncover a predominance of left ventricular (LV) identity that we confirmed by an LV-specific TBX5 lineage tracing system. Together, our results show devCellPy to be a useful tool for automated cell prediction across complex cellular hierarchies, species, and experimental systems.
Collapse
Affiliation(s)
- Francisco X Galdos
- Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, USA
- Institute for Stem Cell Biology and Regenerative Medicine, Stanford University School of Medicine, Palo Alto, USA
| | - Sidra Xu
- Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, USA
| | - William R Goodyer
- Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, USA
- Institute for Stem Cell Biology and Regenerative Medicine, Stanford University School of Medicine, Palo Alto, USA
- Division of Pediatric Cardiology, Department of Pediatrics, Stanford University School of Medicine, Palo Alto, USA
| | - Lauren Duan
- Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, USA
| | - Yuhsin V Huang
- Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, USA
| | - Soah Lee
- Biopharmaceutical Convergence, School of Pharmacy, Sungkyunkwan University, Suwon, South Korea
| | - Han Zhu
- Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, USA
- Division of Cardiovascular Medicine, Department of Medicine, Stanford University School of Medicine, Palo Alto, USA
| | - Carissa Lee
- Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, USA
| | - Nicholas Wei
- Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, USA
| | - Daniel Lee
- Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, USA
| | - Sean M Wu
- Cardiovascular Institute, Stanford University School of Medicine, Stanford, CA, USA.
- Institute for Stem Cell Biology and Regenerative Medicine, Stanford University School of Medicine, Palo Alto, USA.
- Division of Cardiovascular Medicine, Department of Medicine, Stanford University School of Medicine, Palo Alto, USA.
| |
Collapse
|
32
|
Cao X, Xing L, Majd E, He H, Gu J, Zhang X. A Systematic Evaluation of Supervised Machine Learning Algorithms for Cell Phenotype Classification Using Single-Cell RNA Sequencing Data. Front Genet 2022; 13:836798. [PMID: 35281805 PMCID: PMC8905542 DOI: 10.3389/fgene.2022.836798] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Accepted: 01/18/2022] [Indexed: 11/13/2022] Open
Abstract
The new technology of single-cell RNA sequencing (scRNA-seq) can yield valuable insights into gene expression and give critical information about the cellular compositions of complex tissues. In recent years, vast numbers of scRNA-seq datasets have been generated and made publicly available, and this has enabled researchers to train supervised machine learning models for predicting or classifying various cell-level phenotypes. This has led to the development of many new methods for analyzing scRNA-seq data. Despite the popularity of such applications, there has as yet been no systematic investigation of the performance of these supervised algorithms using predictors from various sizes of scRNA-seq datasets. In this study, 13 popular supervised machine learning algorithms for cell phenotype classification were evaluated using published real and simulated datasets with diverse cell sizes. This benchmark comprises two parts. In the first, real datasets were used to assess the computing speed and cell phenotype classification performance of popular supervised algorithms. The classification performances were evaluated using the area under the receiver operating characteristic curve, F1-score, Precision, Recall, and false-positive rate. In the second part, we evaluated gene-selection performance using published simulated datasets with a known list of real genes. The results showed that ElasticNet with interactions performed the best for small and medium-sized datasets. The NaiveBayes classifier was found to be another appropriate method for medium-sized datasets. With large datasets, the performance of the XGBoost algorithm was found to be excellent. Ensemble algorithms were not found to be significantly superior to individual machine learning methods. Including interactions in the ElasticNet algorithm caused a significant performance improvement for small datasets. The linear discriminant analysis algorithm was found to be the best choice when speed is critical; it is the fastest method, it can scale to handle large sample sizes, and its performance is not much worse than the top performers.
Collapse
Affiliation(s)
- Xiaowen Cao
- School of Artificial Intelligence, Hebei University of Technology, Tianjin, China.,Department of Mathematics and Statistics, University of Victoria, Victoria, BC, Canada
| | - Li Xing
- Department of Mathematics and Statistics, University of Saskatchewan, Saskatoon, SK, Canada
| | - Elham Majd
- Department of Mathematics and Statistics, University of Victoria, Victoria, BC, Canada
| | - Hua He
- School of Science, Hebei University of Technology, Tianjin, China
| | - Junhua Gu
- School of Artificial Intelligence, Hebei University of Technology, Tianjin, China
| | - Xuekui Zhang
- Department of Mathematics and Statistics, University of Victoria, Victoria, BC, Canada
| |
Collapse
|
33
|
Ianevski A, Giri AK, Aittokallio T. Fully-automated and ultra-fast cell-type identification using specific marker combinations from single-cell transcriptomic data. Nat Commun 2022; 13:1246. [PMID: 35273156 PMCID: PMC8913782 DOI: 10.1038/s41467-022-28803-w] [Citation(s) in RCA: 298] [Impact Index Per Article: 99.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2021] [Accepted: 02/03/2022] [Indexed: 12/29/2022] Open
Abstract
Identification of cell populations often relies on manual annotation of cell clusters using established marker genes. However, the selection of marker genes is a time-consuming process that may lead to sub-optimal annotations as the markers must be informative of both the individual cell clusters and various cell types present in the sample. Here, we developed a computational platform, ScType, which enables a fully-automated and ultra-fast cell-type identification based solely on a given scRNA-seq data, along with a comprehensive cell marker database as background information. Using six scRNA-seq datasets from various human and mouse tissues, we show how ScType provides unbiased and accurate cell type annotations by guaranteeing the specificity of positive and negative marker genes across cell clusters and cell types. We also demonstrate how ScType distinguishes between healthy and malignant cell populations, based on single-cell calling of single-nucleotide variants, making it a versatile tool for anticancer applications. The widely applicable method is deployed both as an interactive web-tool (https://sctype.app), and as an open-source R-package. Cell types are typically identified in single cell transcriptomic data by manual annotation of cell clusters using established marker genes. Here the authors present a fully-automated computational platform that can quickly and accurately distinguish between cell types.
Collapse
Affiliation(s)
- Aleksandr Ianevski
- Institute for Molecular Medicine Finland (FIMM), HiLIFE, University of Helsinki, Helsinki, Finland.,Helsinki Institute for Information Technology (HIIT), Aalto University, Helsinki, Finland
| | - Anil K Giri
- Institute for Molecular Medicine Finland (FIMM), HiLIFE, University of Helsinki, Helsinki, Finland.
| | - Tero Aittokallio
- Institute for Molecular Medicine Finland (FIMM), HiLIFE, University of Helsinki, Helsinki, Finland. .,Helsinki Institute for Information Technology (HIIT), Aalto University, Helsinki, Finland. .,Institute for Cancer Research, Department of Cancer Genetics, Oslo University Hospital, Oslo, Norway. .,Centre for Biostatistics and Epidemiology (OCBE), Faculty of Medicine, University of Oslo, Oslo, Norway.
| |
Collapse
|
34
|
Flores M, Liu Z, Zhang T, Hasib MM, Chiu YC, Ye Z, Paniagua K, Jo S, Zhang J, Gao SJ, Jin YF, Chen Y, Huang Y. Deep learning tackles single-cell analysis-a survey of deep learning for scRNA-seq analysis. Brief Bioinform 2022; 23:bbab531. [PMID: 34929734 PMCID: PMC8769926 DOI: 10.1093/bib/bbab531] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2021] [Revised: 11/15/2021] [Accepted: 11/16/2021] [Indexed: 12/17/2022] Open
Abstract
Since its selection as the method of the year in 2013, single-cell technologies have become mature enough to provide answers to complex research questions. With the growth of single-cell profiling technologies, there has also been a significant increase in data collected from single-cell profilings, resulting in computational challenges to process these massive and complicated datasets. To address these challenges, deep learning (DL) is positioned as a competitive alternative for single-cell analyses besides the traditional machine learning approaches. Here, we survey a total of 25 DL algorithms and their applicability for a specific step in the single cell RNA-seq processing pipeline. Specifically, we establish a unified mathematical representation of variational autoencoder, autoencoder, generative adversarial network and supervised DL models, compare the training strategies and loss functions for these models, and relate the loss functions of these models to specific objectives of the data processing step. Such a presentation will allow readers to choose suitable algorithms for their particular objective at each step in the pipeline. We envision that this survey will serve as an important information portal for learning the application of DL for scRNA-seq analysis and inspire innovative uses of DL to address a broader range of new challenges in emerging multi-omics and spatial single-cell sequencing.
Collapse
Affiliation(s)
- Mario Flores
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX 78249, USA
| | - Zhentao Liu
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX 78249, USA
| | - Tinghe Zhang
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX 78249, USA
| | - Md Musaddaqui Hasib
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX 78249, USA
| | - Yu-Chiao Chiu
- Greehey Children’s Cancer Research Institute, University of Texas Health San Antonio, San Antonio, TX 78229, USA
| | - Zhenqing Ye
- Greehey Children’s Cancer Research Institute, University of Texas Health San Antonio, San Antonio, TX 78229, USA
- Department of Population Health Sciences, University of Texas Health San Antonio, San Antonio, TX 78229, USA
| | - Karla Paniagua
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX 78249, USA
| | - Sumin Jo
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX 78249, USA
| | - Jianqiu Zhang
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX 78249, USA
| | - Shou-Jiang Gao
- Department of Microbiology and Molecular Genetics, University of Pittsburgh, Pittsburgh, Pennsylvania, PA 15232, USA
- UPMC Hillman Cancer Center, University of Pittsburgh, PA 15232, USA
| | - Yu-Fang Jin
- Department of Electrical and Computer Engineering, the University of Texas at San Antonio, San Antonio, TX 78249, USA
| | - Yidong Chen
- Greehey Children’s Cancer Research Institute, University of Texas Health San Antonio, San Antonio, TX 78229, USA
- Department of Population Health Sciences, University of Texas Health San Antonio, San Antonio, TX 78229, USA
| | - Yufei Huang
- Department of Medicine, School of Medicine, University of Pittsburgh, PA 15232, USA
- UPMC Hillman Cancer Center, University of Pittsburgh, PA 15232, USA
| |
Collapse
|
35
|
Lotfollahi M, Naghipourfar M, Luecken MD, Khajavi M, Büttner M, Wagenstetter M, Avsec Ž, Gayoso A, Yosef N, Interlandi M, Rybakov S, Misharin AV, Theis FJ. Mapping single-cell data to reference atlases by transfer learning. Nat Biotechnol 2022; 40:121-130. [PMID: 34462589 PMCID: PMC8763644 DOI: 10.1038/s41587-021-01001-7] [Citation(s) in RCA: 233] [Impact Index Per Article: 77.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2020] [Accepted: 06/28/2021] [Indexed: 02/07/2023]
Abstract
Large single-cell atlases are now routinely generated to serve as references for analysis of smaller-scale studies. Yet learning from reference data is complicated by batch effects between datasets, limited availability of computational resources and sharing restrictions on raw data. Here we introduce a deep learning strategy for mapping query datasets on top of a reference called single-cell architectural surgery (scArches). scArches uses transfer learning and parameter optimization to enable efficient, decentralized, iterative reference building and contextualization of new datasets with existing references without sharing raw data. Using examples from mouse brain, pancreas, immune and whole-organism atlases, we show that scArches preserves biological state information while removing batch effects, despite using four orders of magnitude fewer parameters than de novo integration. scArches generalizes to multimodal reference mapping, allowing imputation of missing modalities. Finally, scArches retains coronavirus disease 2019 (COVID-19) disease variation when mapping to a healthy reference, enabling the discovery of disease-specific cell states. scArches will facilitate collaborative projects by enabling iterative construction, updating, sharing and efficient use of reference atlases.
Collapse
Affiliation(s)
- Mohammad Lotfollahi
- Helmholtz Center Munich-German Research Center for Environmental Health, Institute of Computational Biology, Neuherberg, Germany
- School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany
| | - Mohsen Naghipourfar
- Helmholtz Center Munich-German Research Center for Environmental Health, Institute of Computational Biology, Neuherberg, Germany
| | - Malte D Luecken
- Helmholtz Center Munich-German Research Center for Environmental Health, Institute of Computational Biology, Neuherberg, Germany
| | - Matin Khajavi
- Helmholtz Center Munich-German Research Center for Environmental Health, Institute of Computational Biology, Neuherberg, Germany
| | - Maren Büttner
- Helmholtz Center Munich-German Research Center for Environmental Health, Institute of Computational Biology, Neuherberg, Germany
| | - Marco Wagenstetter
- Helmholtz Center Munich-German Research Center for Environmental Health, Institute of Computational Biology, Neuherberg, Germany
| | - Žiga Avsec
- Department of Computer Science, Technical University of Munich, Munich, Germany
| | - Adam Gayoso
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
| | - Nir Yosef
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA
- Chan Zuckerberg Biohub, San Francisco, CA, USA
- Ragon Institute of MGH, MIT and Harvard, Cambridge, MA, USA
| | - Marta Interlandi
- Institute of Medical Informatics, University of Münster, Münster, Germany
| | - Sergei Rybakov
- Helmholtz Center Munich-German Research Center for Environmental Health, Institute of Computational Biology, Neuherberg, Germany
- Department of Mathematics, Technical University of Munich, Munich, Germany
| | - Alexander V Misharin
- Division of Pulmonary and Critical Care Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
| | - Fabian J Theis
- Helmholtz Center Munich-German Research Center for Environmental Health, Institute of Computational Biology, Neuherberg, Germany.
- School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany.
- Department of Mathematics, Technical University of Munich, Munich, Germany.
| |
Collapse
|
36
|
Mapping single-cell data to reference atlases by transfer learning. Nat Biotechnol 2022. [DOI: 10.1038/s41587-021-01001-7\] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
AbstractLarge single-cell atlases are now routinely generated to serve as references for analysis of smaller-scale studies. Yet learning from reference data is complicated by batch effects between datasets, limited availability of computational resources and sharing restrictions on raw data. Here we introduce a deep learning strategy for mapping query datasets on top of a reference called single-cell architectural surgery (scArches). scArches uses transfer learning and parameter optimization to enable efficient, decentralized, iterative reference building and contextualization of new datasets with existing references without sharing raw data. Using examples from mouse brain, pancreas, immune and whole-organism atlases, we show that scArches preserves biological state information while removing batch effects, despite using four orders of magnitude fewer parameters than de novo integration. scArches generalizes to multimodal reference mapping, allowing imputation of missing modalities. Finally, scArches retains coronavirus disease 2019 (COVID-19) disease variation when mapping to a healthy reference, enabling the discovery of disease-specific cell states. scArches will facilitate collaborative projects by enabling iterative construction, updating, sharing and efficient use of reference atlases.
Collapse
|
37
|
Kim H, Lee J, Kang K, Yoon S. MarkerCount: A stable, count-based cell type identifier for single-cell RNA-seq experiments. Comput Struct Biotechnol J 2022; 20:3120-3132. [PMID: 35782735 PMCID: PMC9233224 DOI: 10.1016/j.csbj.2022.06.010] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2022] [Revised: 06/05/2022] [Accepted: 06/05/2022] [Indexed: 11/21/2022] Open
Abstract
Cell type identification is a key step toward downstream analysis of single cell RNA-seq experiments. Although the primary objective is to identify known cell populations, good identifiers should also recognize unknown clusters which may represent a previously unidentified subpopulation of a known cell type or tumor cells of an unknown phenotype. Herein, we present MarkerCount, which utilizes the number of expressed markers, regardless of their expression level. MarkerCount works in both reference- and marker-based mode, where the latter utilizes existing lists of markers, while the former uses a pre-annotated dataset to find markers to be used for cell type identification. In both modes, MarkerCount first utilizes the “marker count” to identify cell populations and, after rejecting uncertain cells, reassigns cell type and/or makes corrections in cluster-basis. The performance of MarkerCount was evaluated and compared with existing identifiers, both marker- and reference-based, that can be customized using publicly available datasets and marker databases. The results show that MarkerCount performs better in the identification of known populations as well as of unknown ones, when compared to other reference- and marker-based cell type identifiers for most of the datasets analyzed.
Collapse
|
38
|
Smolander J, Junttila S, Venäläinen MS, Elo LL. scShaper: an ensemble method for fast and accurate linear trajectory inference from single-cell RNA-seq data. Bioinformatics 2021; 38:1328-1335. [PMID: 34888622 PMCID: PMC8825760 DOI: 10.1093/bioinformatics/btab831] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2021] [Revised: 11/30/2021] [Accepted: 12/03/2021] [Indexed: 01/05/2023] Open
Abstract
MOTIVATION Computational models are needed to infer a representation of the cells, i.e. a trajectory, from single-cell RNA-sequencing data that model cell differentiation during a dynamic process. Although many trajectory inference methods exist, their performance varies greatly depending on the dataset and hence there is a need to establish more accurate, better generalizable methods. RESULTS We introduce scShaper, a new trajectory inference method that enables accurate linear trajectory inference. The ensemble approach of scShaper generates a continuous smooth pseudotime based on a set of discrete pseudotimes. We demonstrate that scShaper is able to infer accurate trajectories for a variety of trigonometric trajectories, including many for which the commonly used principal curves method fails. A comprehensive benchmarking with state-of-the-art methods revealed that scShaper achieved superior accuracy of the cell ordering and, in particular, the differentially expressed genes. Moreover, scShaper is a fast method with few hyperparameters, making it a promising alternative to the principal curves method for linear pseudotemporal ordering. AVAILABILITY AND IMPLEMENTATION scShaper is available as an R package at https://github.com/elolab/scshaper. The test data are available at https://doi.org/10.5281/zenodo.5734488. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Johannes Smolander
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, Tykistökatu 6, 20520 Turku, Finland
| | - Sini Junttila
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, Tykistökatu 6, 20520 Turku, Finland
| | - Mikko S Venäläinen
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, Tykistökatu 6, 20520 Turku, Finland
| | - Laura L Elo
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, Tykistökatu 6, 20520 Turku, Finland,Institute of Biomedicine, University of Turku, 20520 Turku, Finland,To whom correspondence should be addressed.
| |
Collapse
|
39
|
Xie B, Jiang Q, Mora A, Li X. Automatic cell type identification methods for single-cell RNA sequencing. Comput Struct Biotechnol J 2021; 19:5874-5887. [PMID: 34815832 PMCID: PMC8572862 DOI: 10.1016/j.csbj.2021.10.027] [Citation(s) in RCA: 32] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2021] [Revised: 09/23/2021] [Accepted: 10/18/2021] [Indexed: 11/24/2022] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) has become a powerful tool for scientists of many research disciplines due to its ability to elucidate the heterogeneous and complex cell-type compositions of different tissues and cell populations. Traditional cell-type identification methods for scRNA-seq data analysis are time-consuming and knowledge-dependent for manual annotation. By contrast, automatic cell-type identification methods may have the advantages of being fast, accurate, and more user friendly. Here, we discuss and evaluate thirty-two published automatic methods for scRNA-seq data analysis in terms of their prediction accuracy, F1-score, unlabeling rate and running time. We highlight the advantages and disadvantages of these methods and provide recommendations of method choice depending on the available information. The challenges and future applications of these automatic methods are further discussed. In addition, we provide a free scRNA-seq data analysis package encompassing the discussed automatic methods to help the easy usage of them in real-world applications.
Collapse
Affiliation(s)
- Bingbing Xie
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangzhou 510060, Guangdong, China
| | - Qin Jiang
- Affiliated Eye Hospital of Nanjing Medical University, Nanjing, China
| | - Antonio Mora
- Joint School of Life Sciences, Guangzhou Medical University and Guangzhou Institutes of Biomedicine and Health (Chinese Academy of Sciences), Xinzao, Panyu District, Guangzhou 511436, Guangdong, China
| | - Xuri Li
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangzhou 510060, Guangdong, China
| |
Collapse
|
40
|
Cortal A, Martignetti L, Six E, Rausell A. Gene signature extraction and cell identity recognition at the single-cell level with Cell-ID. Nat Biotechnol 2021; 39:1095-1102. [PMID: 33927417 DOI: 10.1038/s41587-021-00896-6] [Citation(s) in RCA: 82] [Impact Index Per Article: 20.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2020] [Accepted: 03/15/2021] [Indexed: 02/08/2023]
Abstract
Because of the stochasticity associated with high-throughput single-cell sequencing, current methods for exploring cell-type diversity rely on clustering-based computational approaches in which heterogeneity is characterized at cell subpopulation rather than at full single-cell resolution. Here we present Cell-ID, a clustering-free multivariate statistical method for the robust extraction of per-cell gene signatures from single-cell sequencing data. We applied Cell-ID to data from multiple human and mouse samples, including blood cells, pancreatic islets and airway, intestinal and olfactory epithelium, as well as to comprehensive mouse cell atlas datasets. We demonstrate that Cell-ID signatures are reproducible across different donors, tissues of origin, species and single-cell omics technologies, and can be used for automatic cell-type annotation and cell matching across datasets. Cell-ID improves biological interpretation at individual cell level, enabling discovery of previously uncharacterized rare cell types or cell states. Cell-ID is distributed as an open-source R software package.
Collapse
Affiliation(s)
- Akira Cortal
- Clinical Bioinformatics Laboratory, Université de Paris, INSERM UMR1163, Imagine Institute, Paris, France
| | - Loredana Martignetti
- Clinical Bioinformatics Laboratory, Université de Paris, INSERM UMR1163, Imagine Institute, Paris, France
| | - Emmanuelle Six
- Laboratory of Human Lymphohematopoiesis, Université de Paris, INSERM UMR1163, Imagine Institute, Paris, France
| | - Antonio Rausell
- Clinical Bioinformatics Laboratory, Université de Paris, INSERM UMR1163, Imagine Institute, Paris, France. .,Molecular Genetics Service, AP-HP, Necker Hospital for Sick Children, Paris, France.
| |
Collapse
|
41
|
Wang T, Bai J, Nabavi S. Single-cell classification using graph convolutional networks. BMC Bioinformatics 2021; 22:364. [PMID: 34238220 PMCID: PMC8268184 DOI: 10.1186/s12859-021-04278-2] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2020] [Accepted: 06/24/2021] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND Analyzing single-cell RNA sequencing (scRNAseq) data plays an important role in understanding the intrinsic and extrinsic cellular processes in biological and biomedical research. One significant effort in this area is the identification of cell types. With the availability of a huge amount of single cell sequencing data and discovering more and more cell types, classifying cells into known cell types has become a priority nowadays. Several methods have been introduced to classify cells utilizing gene expression data. However, incorporating biological gene interaction networks has been proved valuable in cell classification procedures. RESULTS In this study, we propose a multimodal end-to-end deep learning model, named sigGCN, for cell classification that combines a graph convolutional network (GCN) and a neural network to exploit gene interaction networks. We used standard classification metrics to evaluate the performance of the proposed method on the within-dataset classification and the cross-dataset classification. We compared the performance of the proposed method with those of the existing cell classification tools and traditional machine learning classification methods. CONCLUSIONS Results indicate that the proposed method outperforms other commonly used methods in terms of classification accuracy and F1 scores. This study shows that the integration of prior knowledge about gene interactions with gene expressions using GCN methodologies can extract effective features improving the performance of cell classification.
Collapse
Affiliation(s)
- Tianyu Wang
- Computer Science and Engineering Department, University of Connecticut, Storrs, CT USA
| | - Jun Bai
- Computer Science and Engineering Department, University of Connecticut, Storrs, CT USA
| | - Sheida Nabavi
- Computer Science and Engineering Department, University of Connecticut, Storrs, CT USA
| |
Collapse
|
42
|
Computational principles and challenges in single-cell data integration. Nat Biotechnol 2021; 39:1202-1215. [PMID: 33941931 DOI: 10.1038/s41587-021-00895-7] [Citation(s) in RCA: 210] [Impact Index Per Article: 52.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2020] [Accepted: 03/16/2021] [Indexed: 02/07/2023]
Abstract
The development of single-cell multimodal assays provides a powerful tool for investigating multiple dimensions of cellular heterogeneity, enabling new insights into development, tissue homeostasis and disease. A key challenge in the analysis of single-cell multimodal data is to devise appropriate strategies for tying together data across different modalities. The term 'data integration' has been used to describe this task, encompassing a broad collection of approaches ranging from batch correction of individual omics datasets to association of chromatin accessibility and genetic variation with transcription. Although existing integration strategies exploit similar mathematical ideas, they typically have distinct goals and rely on different principles and assumptions. Consequently, new definitions and concepts are needed to contextualize existing methods and to enable development of new methods.
Collapse
|
43
|
Jolly MK, Murphy RJ, Bhatia S, Whitfield HJ, Redfern A, Davis MJ, Thompson EW. Measuring and Modelling the Epithelial- Mesenchymal Hybrid State in Cancer: Clinical Implications. Cells Tissues Organs 2021; 211:110-133. [PMID: 33902034 DOI: 10.1159/000515289] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2020] [Accepted: 01/25/2021] [Indexed: 11/19/2022] Open
Abstract
The epithelial-mesenchymal (E/M) hybrid state has emerged as an important mediator of elements of cancer progression, facilitated by epithelial mesenchymal plasticity (EMP). We review here evidence for the presence, prognostic significance, and therapeutic potential of the E/M hybrid state in carcinoma. We further assess modelling predictions and validation studies to demonstrate stabilised E/M hybrid states along the spectrum of EMP, as well as computational approaches for characterising and quantifying EMP phenotypes, with particular attention to the emerging realm of single-cell approaches through RNA sequencing and protein-based techniques.
Collapse
Affiliation(s)
- Mohit Kumar Jolly
- Centre for BioSystems Science and Engineering, Indian Institute of Science, Bangalore, India
| | - Ryan J Murphy
- Queensland University of Technology, School of Mathematical Sciences, Brisbane, Queensland, Australia
| | - Sugandha Bhatia
- Queensland University of Technology, Institute of Health and Biomedical Innovation and School of Biomedical Sciences, Brisbane, Queensland, Australia.,Queensland University of Technology, Translational Research Institute, Brisbane, Queensland, Australia.,The University of Queensland Diamantina Institute, The University of Queensland, Brisbane, Queensland, Australia
| | - Holly J Whitfield
- Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, Parkville, Victoria, Australia.,Department of Medical Biology, University of Melbourne, Parkville, Victoria, Australia
| | - Andrew Redfern
- Department of Medicine, School of Medicine, University of Western Australia, Fiona Stanley Hospital Campus, Perth, Washington, Australia
| | - Melissa J Davis
- Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, Parkville, Victoria, Australia.,Department of Medical Biology, University of Melbourne, Parkville, Victoria, Australia.,Department of Clinical Pathology, Faculty of Medicine, Dentistry and Health Sciences, University of Melbourne, Parkville, Victoria, Australia
| | - Erik W Thompson
- Queensland University of Technology, Institute of Health and Biomedical Innovation and School of Biomedical Sciences, Brisbane, Queensland, Australia.,Queensland University of Technology, Translational Research Institute, Brisbane, Queensland, Australia
| |
Collapse
|
44
|
Cui Z, Cui Y, Gao Y, Jiang T, Zang T, Wang Y. Enhancement and Imputation of Peak Signal Enables Accurate Cell-Type Classification in scATAC-seq. Front Genet 2021; 12:658352. [PMID: 33889181 PMCID: PMC8056015 DOI: 10.3389/fgene.2021.658352] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2021] [Accepted: 02/22/2021] [Indexed: 11/16/2022] Open
Abstract
Single-cell Assay Transposase Accessible Chromatin sequencing (scATAC-seq) has been widely used in profiling genome-wide chromatin accessibility in thousands of individual cells. However, compared with single-cell RNA-seq, the peaks of scATAC-seq are much sparser due to the lower copy numbers (diploid in humans) and the inherent missing signals, which makes it more challenging to classify cell type based on specific expressed gene or other canonical markers. Here, we present svmATAC, a support vector machine (SVM)-based method for accurately identifying cell types in scATAC-seq datasets by enhancing peak signal strength and imputing signals through patterns of co-accessibility. We applied svmATAC to several scATAC-seq data from human immune cells, human hematopoietic system cells, and peripheral blood mononuclear cells. The benchmark results showed that svmATAC is free of literature-based markers and robust across datasets in different libraries and platforms. The source code of svmATAC is available at https://github.com/mrcuizhe/svmATAC under the MIT license.
Collapse
Affiliation(s)
- Zhe Cui
- Centre for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Ya Cui
- College of Life Science, University of Chinese Academy of Sciences, Beijing, China
| | - Yan Gao
- Centre for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Tao Jiang
- Centre for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Tianyi Zang
- Centre for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Yadong Wang
- Centre for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
45
|
Maseda F, Cang Z, Nie Q. DEEPsc: A Deep Learning-Based Map Connecting Single-Cell Transcriptomics and Spatial Imaging Data. Front Genet 2021; 12:636743. [PMID: 33833776 PMCID: PMC8021700 DOI: 10.3389/fgene.2021.636743] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2020] [Accepted: 02/23/2021] [Indexed: 11/13/2022] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) data provides unprecedented information on cell fate decisions; however, the spatial arrangement of cells is often lost. Several recent computational methods have been developed to impute spatial information onto a scRNA-seq dataset through analyzing known spatial expression patterns of a small subset of genes known as a reference atlas. However, there is a lack of comprehensive analysis of the accuracy, precision, and robustness of the mappings, along with the generalizability of these methods, which are often designed for specific systems. We present a system-adaptive deep learning-based method (DEEPsc) to impute spatial information onto a scRNA-seq dataset from a given spatial reference atlas. By introducing a comprehensive set of metrics that evaluate the spatial mapping methods, we compare DEEPsc with four existing methods on four biological systems. We find that while DEEPsc has comparable accuracy to other methods, an improved balance between precision and robustness is achieved. DEEPsc provides a data-adaptive tool to connect scRNA-seq datasets and spatial imaging datasets to analyze cell fate decisions. Our implementation with a uniform API can serve as a portal with access to all the methods investigated in this work for spatial exploration of cell fate decisions in scRNA-seq data. All methods evaluated in this work are implemented as an open-source software with a uniform interface.
Collapse
Affiliation(s)
- Floyd Maseda
- Department of Mathematics, University of California, Irvine, Irvine, CA, United States
| | - Zixuan Cang
- Department of Mathematics, University of California, Irvine, Irvine, CA, United States
- The NSF-Simons Center for Multiscale Cell Fate Research, University of California, Irvine, Irvine, CA, United States
| | - Qing Nie
- Department of Mathematics, University of California, Irvine, Irvine, CA, United States
- The NSF-Simons Center for Multiscale Cell Fate Research, University of California, Irvine, Irvine, CA, United States
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA, United States
| |
Collapse
|
46
|
Hérault L, Poplineau M, Mazuel A, Platet N, Remy É, Duprez E. Single-cell RNA-seq reveals a concomitant delay in differentiation and cell cycle of aged hematopoietic stem cells. BMC Biol 2021; 19:19. [PMID: 33526011 PMCID: PMC7851934 DOI: 10.1186/s12915-021-00955-z] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2020] [Accepted: 01/08/2021] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND Hematopoietic stem cells (HSCs) are the guarantor of the proper functioning of hematopoiesis due to their incredible diversity of potential. During aging, heterogeneity of HSCs changes, contributing to the deterioration of the immune system. In this study, we revisited mouse HSC compartment and its transcriptional plasticity during aging at unicellular scale. RESULTS Through the analysis of 15,000 young and aged transcriptomes, we identified 15 groups of HSCs revealing rare and new specific HSC abilities that change with age. The implantation of new trajectories complemented with the analysis of transcription factor activities pointed consecutive states of HSC differentiation that were delayed by aging and explained the bias in differentiation of older HSCs. Moreover, reassigning cell cycle phases for each HSC clearly highlighted an imbalance of the cell cycle regulators of very immature aged HSCs that may contribute to their accumulation in an undifferentiated state. CONCLUSIONS Our results establish a new reference map of HSC differentiation in young and aged mice and reveal a potential mechanism that delays the differentiation of aged HSCs and could promote the emergence of age-related hematologic diseases.
Collapse
Affiliation(s)
- Léonard Hérault
- Epigenetic Factors in Normal and Malignant Hematopoiesis Team, Aix Marseille Université, CNRS, INSERM, Institut Paoli-Calmettes, CRCM, Marseille, France
- Aix Marseille Université, CNRS, Centrale Marseille, I2M, Marseille, France
| | - Mathilde Poplineau
- Epigenetic Factors in Normal and Malignant Hematopoiesis Team, Aix Marseille Université, CNRS, INSERM, Institut Paoli-Calmettes, CRCM, Marseille, France
| | - Adrien Mazuel
- Epigenetic Factors in Normal and Malignant Hematopoiesis Team, Aix Marseille Université, CNRS, INSERM, Institut Paoli-Calmettes, CRCM, Marseille, France
| | - Nadine Platet
- Epigenetic Factors in Normal and Malignant Hematopoiesis Team, Aix Marseille Université, CNRS, INSERM, Institut Paoli-Calmettes, CRCM, Marseille, France
| | - Élisabeth Remy
- Aix Marseille Université, CNRS, Centrale Marseille, I2M, Marseille, France
| | - Estelle Duprez
- Epigenetic Factors in Normal and Malignant Hematopoiesis Team, Aix Marseille Université, CNRS, INSERM, Institut Paoli-Calmettes, CRCM, Marseille, France.
| |
Collapse
|
47
|
Ge S, Wang H, Alavi A, Xing E, Bar-Joseph Z. Supervised Adversarial Alignment of Single-Cell RNA-seq Data. J Comput Biol 2021; 28:501-513. [PMID: 33470876 DOI: 10.1089/cmb.2020.0439] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Dimensionality reduction is an important first step in the analysis of single-cell RNA-sequencing (scRNA-seq) data. In addition to enabling the visualization of the profiled cells, such representations are used by many downstream analyses methods ranging from pseudo-time reconstruction to clustering to alignment of scRNA-seq data from different experiments, platforms, and laboratories. Both supervised and unsupervised methods have been proposed to reduce the dimension of scRNA-seq. However, all methods to date are sensitive to batch effects. When batches correlate with cell types, as is often the case, their impact can lead to representations that are batch rather than cell-type specific. To overcome this, we developed a domain adversarial neural network model for learning a reduced dimension representation of scRNA-seq data. The adversarial model tries to simultaneously optimize two objectives. The first is the accuracy of cell-type assignment and the second is the inability to distinguish the batch (domain). We tested the method by using the resulting representation to align several different data sets. As we show, by overcoming batch effects our method was able to correctly separate cell types, improving on several prior methods suggested for this task. Analysis of the top features used by the network indicates that by taking the batch impact into account, the reduced representation is much better able to focus on key genes for each cell type.
Collapse
Affiliation(s)
- Songwei Ge
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Haohan Wang
- Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Amir Alavi
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Eric Xing
- Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA.,Machine Learning Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Ziv Bar-Joseph
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA.,Machine Learning Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| |
Collapse
|
48
|
Pasquini G, Rojo Arias JE, Schäfer P, Busskamp V. Automated methods for cell type annotation on scRNA-seq data. Comput Struct Biotechnol J 2021; 19:961-969. [PMID: 33613863 PMCID: PMC7873570 DOI: 10.1016/j.csbj.2021.01.015] [Citation(s) in RCA: 112] [Impact Index Per Article: 28.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2020] [Revised: 01/13/2021] [Accepted: 01/13/2021] [Indexed: 12/22/2022] Open
Abstract
The advent of single-cell sequencing started a new era of transcriptomic and genomic research, advancing our knowledge of the cellular heterogeneity and dynamics. Cell type annotation is a crucial step in analyzing single-cell RNA sequencing data, yet manual annotation is time-consuming and partially subjective. As an alternative, tools have been developed for automatic cell type identification. Different strategies have emerged to ultimately associate gene expression profiles of single cells with a cell type either by using curated marker gene databases, correlating reference expression data, or transferring labels by supervised classification. In this review, we present an overview of the available tools and the underlying approaches to perform automated cell type annotations on scRNA-seq data.
Collapse
Affiliation(s)
- Giovanni Pasquini
- Technische Universität Dresden, Center for Molecular and Cellular Bioengineering (CMCB), Center for Regenerative Therapies Dresden (CRTD), Dresden 01307, Germany
- Universitäts-Augenklinik Bonn, University of Bonn, Department of Ophthalmology, Bonn 53127, Germany
| | - Jesus Eduardo Rojo Arias
- Wellcome-MRC Cambridge Stem Cell Institute, Jeffrey Cheah Biomedical Centre, Cambridge Biomedical Campus, University of Cambridge, Cambridge, UK
| | - Patrick Schäfer
- Technische Universität Dresden, Center for Molecular and Cellular Bioengineering (CMCB), Center for Regenerative Therapies Dresden (CRTD), Dresden 01307, Germany
| | - Volker Busskamp
- Technische Universität Dresden, Center for Molecular and Cellular Bioengineering (CMCB), Center for Regenerative Therapies Dresden (CRTD), Dresden 01307, Germany
- Universitäts-Augenklinik Bonn, University of Bonn, Department of Ophthalmology, Bonn 53127, Germany
| |
Collapse
|
49
|
Forcato M, Romano O, Bicciato S. Computational methods for the integrative analysis of single-cell data. Brief Bioinform 2021; 22:20-29. [PMID: 32363378 PMCID: PMC7820847 DOI: 10.1093/bib/bbaa042] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2019] [Revised: 03/05/2020] [Accepted: 01/03/2020] [Indexed: 01/05/2023] Open
Abstract
Recent advances in single-cell technologies are providing exciting opportunities for dissecting tissue heterogeneity and investigating cell identity, fate and function. This is a pristine, exploding field that is flooding biologists with a new wave of data, each with its own specificities in terms of complexity and information content. The integrative analysis of genomic data, collected at different molecular layers from diverse cell populations, holds promise to address the full-scale complexity of biological systems. However, the combination of different single-cell genomic signals is computationally challenging, as these data are intrinsically heterogeneous for experimental, technical and biological reasons. Here, we describe the computational methods for the integrative analysis of single-cell genomic data, with a focus on the integration of single-cell RNA sequencing datasets and on the joint analysis of multimodal signals from individual cells.
Collapse
Affiliation(s)
- Mattia Forcato
- Molecular Biology and Bioinformatics at the University of Modena and Reggio Emilia. His research interests include the development and application of bioinformatics methods for the analysis of next-generation sequencing data
| | - Oriana Romano
- Molecular Biology and Bioinformatics at the University of Modena and Reggio Emilia. Her research activities are mainly focused on the integrative analysis of transcriptional and epigenomic bulk and single-cell data
| | - Silvio Bicciato
- Industrial Bioengineering at the University of Modena and Reggio Emilia. His research activity is the development and application of computational approaches for the analysis of multi -omics data
| |
Collapse
|
50
|
Bernstein MN, Ma Z, Gleicher M, Dewey CN. CellO: comprehensive and hierarchical cell type classification of human cells with the Cell Ontology. iScience 2020; 24:101913. [PMID: 33364592 PMCID: PMC7753962 DOI: 10.1016/j.isci.2020.101913] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2020] [Revised: 10/28/2020] [Accepted: 12/02/2020] [Indexed: 12/15/2022] Open
Abstract
Cell type annotation is a fundamental task in the analysis of single-cell RNA-sequencing data. In this work, we present CellO, a machine learning-based tool for annotating human RNA-seq data with the Cell Ontology. CellO enables accurate and standardized cell type classification of cell clusters by considering the rich hierarchical structure of known cell types. Furthermore, CellO comes pre-trained on a comprehensive data set of human, healthy, untreated primary samples in the Sequence Read Archive. CellO's comprehensive training set enables it to run out of the box on diverse cell types and achieves competitive or even superior performance when compared to existing state-of-the-art methods. Lastly, CellO's linear models are easily interpreted, thereby enabling exploration of cell-type-specific expression signatures across the ontology. To this end, we also present the CellO Viewer: a web application for exploring CellO's models across the ontology.
Collapse
Affiliation(s)
| | - Zhongjie Ma
- Department of Computer Sciences, University of Wisconsin - Madison, Madison, WI 53706, USA
| | - Michael Gleicher
- Department of Computer Sciences, University of Wisconsin - Madison, Madison, WI 53706, USA
| | - Colin N Dewey
- Department of Computer Sciences, University of Wisconsin - Madison, Madison, WI 53706, USA.,Department of Biostatistics and Medical Informatics, University of Wisconsin - Madison, Madison, WI 53792, USA
| |
Collapse
|