1
|
Gonzalez-Ferrer J, Lehrer J, O'Farrell A, Paten B, Teodorescu M, Haussler D, Jonsson VD, Mostajo-Radji MA. SIMS: A deep-learning label transfer tool for single-cell RNA sequencing analysis. CELL GENOMICS 2024; 4:100581. [PMID: 38823397 PMCID: PMC11228957 DOI: 10.1016/j.xgen.2024.100581] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/17/2023] [Revised: 04/02/2024] [Accepted: 05/09/2024] [Indexed: 06/03/2024]
Abstract
Cell atlases serve as vital references for automating cell labeling in new samples, yet existing classification algorithms struggle with accuracy. Here we introduce SIMS (scalable, interpretable machine learning for single cell), a low-code data-efficient pipeline for single-cell RNA classification. We benchmark SIMS against datasets from different tissues and species. We demonstrate SIMS's efficacy in classifying cells in the brain, achieving high accuracy even with small training sets (<3,500 cells) and across different samples. SIMS accurately predicts neuronal subtypes in the developing brain, shedding light on genetic changes during neuronal differentiation and postmitotic fate refinement. Finally, we apply SIMS to single-cell RNA datasets of cortical organoids to predict cell identities and uncover genetic variations between cell lines. SIMS identifies cell-line differences and misannotated cell lineages in human cortical organoids derived from different pluripotent stem cell lines. Altogether, we show that SIMS is a versatile and robust tool for cell-type classification from single-cell datasets.
Collapse
Affiliation(s)
- Jesus Gonzalez-Ferrer
- Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA 95060, USA; Live Cell Biotechnology Discovery Lab, University of California, Santa Cruz, Santa Cruz, CA 95060, USA; Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA 95060, USA
| | - Julian Lehrer
- Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA 95060, USA; Live Cell Biotechnology Discovery Lab, University of California, Santa Cruz, Santa Cruz, CA 95060, USA; Department of Applied Mathematics, University of California, Santa Cruz, Santa Cruz, CA 95060, USA
| | - Ash O'Farrell
- Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA 95060, USA
| | - Benedict Paten
- Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA 95060, USA; Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA 95060, USA
| | - Mircea Teodorescu
- Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA 95060, USA; Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA 95060, USA; Department of Electrical and Computer Engineering, University of California, Santa Cruz, Santa Cruz, CA 95060, USA
| | - David Haussler
- Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA 95060, USA; Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA 95060, USA
| | - Vanessa D Jonsson
- Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA 95060, USA; Department of Applied Mathematics, University of California, Santa Cruz, Santa Cruz, CA 95060, USA.
| | - Mohammed A Mostajo-Radji
- Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA 95060, USA; Live Cell Biotechnology Discovery Lab, University of California, Santa Cruz, Santa Cruz, CA 95060, USA.
| |
Collapse
|
2
|
Gonzalez-Ferrer J, Lehrer J, O’Farrell A, Paten B, Teodorescu M, Haussler D, Jonsson VD, Mostajo-Radji MA. Unraveling Neuronal Identities Using SIMS: A Deep Learning Label Transfer Tool for Single-Cell RNA Sequencing Analysis. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.02.28.529615. [PMID: 36909548 PMCID: PMC10002667 DOI: 10.1101/2023.02.28.529615] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/06/2023]
Abstract
Large single-cell RNA datasets have contributed to unprecedented biological insight. Often, these take the form of cell atlases and serve as a reference for automating cell labeling of newly sequenced samples. Yet, classification algorithms have lacked the capacity to accurately annotate cells, particularly in complex datasets. Here we present SIMS (Scalable, Interpretable Machine Learning for Single-Cell), an end-to-end data-efficient machine learning pipeline for discrete classification of single-cell data that can be applied to new datasets with minimal coding. We benchmarked SIMS against common single-cell label transfer tools and demonstrated that it performs as well or better than state of the art algorithms. We then use SIMS to classify cells in one of the most complex tissues: the brain. We show that SIMS classifies cells of the adult cerebral cortex and hippocampus at a remarkably high accuracy. This accuracy is maintained in trans-sample label transfers of the adult human cerebral cortex. We then apply SIMS to classify cells in the developing brain and demonstrate a high level of accuracy at predicting neuronal subtypes, even in periods of fate refinement, shedding light on genetic changes affecting specific cell types across development. Finally, we apply SIMS to single cell datasets of cortical organoids to predict cell identities and unveil genetic variations between cell lines. SIMS identifies cell-line differences and misannotated cell lineages in human cortical organoids derived from different pluripotent stem cell lines. When cell types are obscured by stress signals, label transfer from primary tissue improves the accuracy of cortical organoid annotations, serving as a reliable ground truth. Altogether, we show that SIMS is a versatile and robust tool for cell-type classification from single-cell datasets.
Collapse
Affiliation(s)
- Jesus Gonzalez-Ferrer
- These authors contributed equally to this work
- Genomics Institute, University of California Santa Cruz, Santa Cruz, 95060, CA, USA
- Live Cell Biotechnology Discovery Lab, University of California Santa Cruz, Santa Cruz, 95060, CA, USA
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, 95060, CA, USA
| | - Julian Lehrer
- These authors contributed equally to this work
- Genomics Institute, University of California Santa Cruz, Santa Cruz, 95060, CA, USA
- Live Cell Biotechnology Discovery Lab, University of California Santa Cruz, Santa Cruz, 95060, CA, USA
- Department of Applied Mathematics, University of California Santa Cruz, Santa Cruz, 95060, CA, USA
| | - Ash O’Farrell
- Genomics Institute, University of California Santa Cruz, Santa Cruz, 95060, CA, USA
| | - Benedict Paten
- Genomics Institute, University of California Santa Cruz, Santa Cruz, 95060, CA, USA
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, 95060, CA, USA
| | - Mircea Teodorescu
- Genomics Institute, University of California Santa Cruz, Santa Cruz, 95060, CA, USA
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, 95060, CA, USA
- Department of Electrical and Computer Engineering, University of California Santa Cruz, Santa Cruz, 95060, CA, USA
| | - David Haussler
- Genomics Institute, University of California Santa Cruz, Santa Cruz, 95060, CA, USA
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, 95060, CA, USA
| | - Vanessa D. Jonsson
- Department of Applied Mathematics, University of California Santa Cruz, Santa Cruz, 95060, CA, USA
- Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, 95060, CA, USA
- Co-senior authors
| | - Mohammed A. Mostajo-Radji
- Genomics Institute, University of California Santa Cruz, Santa Cruz, 95060, CA, USA
- Live Cell Biotechnology Discovery Lab, University of California Santa Cruz, Santa Cruz, 95060, CA, USA
- Co-senior authors
| |
Collapse
|
3
|
Salama SR. The Complexity of the Mammalian Transcriptome. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2022; 1363:11-22. [PMID: 35220563 DOI: 10.1007/978-3-030-92034-0_2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Draft genome assemblies for multiple mammalian species combined with new technologies to map transcripts from diverse RNA samples to these genomes developed in the early 2000s revealed that the mammalian transcriptome was vastly larger and more complex than previously anticipated. Efforts to comprehensively catalog the identity and features of transcripts present in a variety of species, tissues and cell lines revealed that a large fraction of the mammalian genome is transcribed in at least some settings. A large number of these transcripts encode long non-coding RNAs (lncRNAs). Many lncRNAs overlap or are anti-sense to protein coding genes and others overlap small RNAs. However, a large number are independent of any previously known mRNA or small RNA. While the functions of a majority of these lncRNAs are unknown, many appear to play roles in gene regulation. Many lncRNAs have species-specific and cell type specific expression patterns and their evolutionary origins are varied. While technological challenges have hindered getting a full picture of the diversity and transcript structure of all of the transcripts arising from lncRNA loci, new technologies including single molecule nanopore sequencing and single cell RNA sequencing promise to generate a comprehensive picture of the mammalian transcriptome.
Collapse
Affiliation(s)
- Sofie R Salama
- UC Santa Cruz Genomics Institute, Department of Biomolecular Engineering and Howard Hughes Medical Institute, University of California, Santa Cruz, Santa Cruz, CA, USA.
| |
Collapse
|
4
|
Leo L, Colonna Romano N. Emerging Single-Cell Technological Approaches to Investigate Chromatin Dynamics and Centromere Regulation in Human Health and Disease. Int J Mol Sci 2021; 22:ijms22168809. [PMID: 34445507 PMCID: PMC8395756 DOI: 10.3390/ijms22168809] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2021] [Revised: 08/09/2021] [Accepted: 08/12/2021] [Indexed: 12/12/2022] Open
Abstract
Epigenetic regulators play a crucial role in establishing and maintaining gene expression states. To date, the main efforts to study cellular heterogeneity have focused on elucidating the variable nature of the chromatin landscape. Specific chromatin organisation is fundamental for normal organogenesis and developmental homeostasis and can be affected by different environmental factors. The latter can lead to detrimental alterations in gene transcription, as well as pathological conditions such as cancer. Epigenetic marks regulate the transcriptional output of cells. Centromeres are chromosome structures that are epigenetically regulated and are crucial for accurate segregation. The advent of single-cell epigenetic profiling has provided finer analytical resolution, exposing the intrinsic peculiarities of different cells within an apparently homogenous population. In this review, we discuss recent advances in methodologies applied to epigenetics, such as CUT&RUN and CUT&TAG. Then, we compare standard and emerging single-cell techniques and their relevance for investigating human diseases. Finally, we describe emerging methodologies that investigate centromeric chromatin specification and neocentromere formation.
Collapse
|
5
|
Different Flavors of Astrocytes: Revising the Origins of Astrocyte Diversity and Epigenetic Signatures to Understand Heterogeneity after Injury. Int J Mol Sci 2021; 22:ijms22136867. [PMID: 34206710 PMCID: PMC8268487 DOI: 10.3390/ijms22136867] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2021] [Revised: 05/31/2021] [Accepted: 06/02/2021] [Indexed: 12/11/2022] Open
Abstract
Astrocytes are a specific type of neuroglial cells that confer metabolic and structural support to neurons. Astrocytes populate all regions of the nervous system and adopt a variety of phenotypes depending on their location and their respective functions, which are also pleiotropic in nature. For example, astrocytes adapt to pathological conditions with a specific cellular response known as reactive astrogliosis, which includes extensive phenotypic and transcriptional changes. Reactive astrocytes may lose some of their homeostatic functions and gain protective or detrimental properties with great impact on damage propagation. Different astrocyte subpopulations seemingly coexist in reactive astrogliosis, however, the source of such heterogeneity is not completely understood. Altered cellular signaling in pathological compared to healthy conditions might be one source fueling astrocyte heterogeneity. Moreover, diversity might also be encoded cell-autonomously, for example as a result of astrocyte subtype specification during development. We hypothesize and propose here that elucidating the epigenetic signature underlying the phenotype of each astrocyte subtype is of high relevance to understand another regulative layer of astrocyte heterogeneity, in general as well as after injury or as a result of other pathological conditions. High resolution methods should allow enlightening diverse cell states and subtypes of astrocyte, their adaptation to pathological conditions and ultimately allow controlling and manipulating astrocyte functions in disease states. Here, we review novel literature reporting on astrocyte diversity from a developmental perspective and we focus on epigenetic signatures that might account for cell type specification.
Collapse
|
6
|
Keo A, Mahfouz A, Ingrassia AMT, Meneboo JP, Villenet C, Mutez E, Comptdaer T, Lelieveldt BPF, Figeac M, Chartier-Harlin MC, van de Berg WDJ, van Hilten JJ, Reinders MJT. Transcriptomic signatures of brain regional vulnerability to Parkinson's disease. Commun Biol 2020; 3:101. [PMID: 32139796 PMCID: PMC7058608 DOI: 10.1038/s42003-020-0804-9] [Citation(s) in RCA: 51] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2019] [Accepted: 01/28/2020] [Indexed: 01/11/2023] Open
Abstract
The molecular mechanisms underlying caudal-to-rostral progression of Lewy body pathology in Parkinson's disease remain poorly understood. Here, we identified transcriptomic signatures across brain regions involved in Braak Lewy body stages in non-neurological adults from the Allen Human Brain Atlas. Among the genes that are indicative of regional vulnerability, we found known genetic risk factors for Parkinson's disease: SCARB2, ELOVL7, SH3GL2, SNCA, BAP1, and ZNF184. Results were confirmed in two datasets of non-neurological subjects, while in two datasets of Parkinson's disease patients we found altered expression patterns. Co-expression analysis across vulnerable regions identified a module enriched for genes associated with dopamine synthesis and microglia, and another module related to the immune system, blood-oxygen transport, and endothelial cells. Both were highly expressed in regions involved in the preclinical stages of the disease. Finally, alterations in genes underlying these region-specific functions may contribute to the selective regional vulnerability in Parkinson's disease brains.
Collapse
Affiliation(s)
- Arlin Keo
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden, The Netherlands
- Delft Bioinformatics Lab, Delft University of Technology, Delft, The Netherlands
| | - Ahmed Mahfouz
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden, The Netherlands
- Delft Bioinformatics Lab, Delft University of Technology, Delft, The Netherlands
| | - Angela M T Ingrassia
- Department of Anatomy and Neurosciences, Amsterdam Neuroscience, Amsterdam UMC, location VUmc, Amsterdam, The Netherlands
| | - Jean-Pascal Meneboo
- University Lille, Plate-forme de génomique fonctionnelle et Structurale, F-59000, Lille, France
- University lille. Bilille, F-59000, Lille, France
| | - Celine Villenet
- University Lille, Plate-forme de génomique fonctionnelle et Structurale, F-59000, Lille, France
| | - Eugénie Mutez
- University Lille, Inserm, CHU Lille, UMR-S 1172 - JPArc - Centre de Recherche Jean-Pierre AUBERT Neurosciences et Cancer, F-59000, Lille, France
- Inserm, UMR-S 1172, Early Stages of Parkinson's Disease, F-59000, Lille, France
- University Lille, Service de Neurologie et Pathologie du mouvement, centre expert Parkinson, F-59000, Lille, France
| | - Thomas Comptdaer
- University Lille, Inserm, CHU Lille, UMR-S 1172 - JPArc - Centre de Recherche Jean-Pierre AUBERT Neurosciences et Cancer, F-59000, Lille, France
- Inserm, UMR-S 1172, Early Stages of Parkinson's Disease, F-59000, Lille, France
| | - Boudewijn P F Lelieveldt
- Delft Bioinformatics Lab, Delft University of Technology, Delft, The Netherlands
- Department of Radiology, Leiden University Medical Center, Leiden, The Netherlands
| | - Martin Figeac
- University Lille, Plate-forme de génomique fonctionnelle et Structurale, F-59000, Lille, France
- University lille. Bilille, F-59000, Lille, France
| | - Marie-Christine Chartier-Harlin
- University Lille, Inserm, CHU Lille, UMR-S 1172 - JPArc - Centre de Recherche Jean-Pierre AUBERT Neurosciences et Cancer, F-59000, Lille, France.
- Inserm, UMR-S 1172, Early Stages of Parkinson's Disease, F-59000, Lille, France.
| | - Wilma D J van de Berg
- Department of Anatomy and Neurosciences, Amsterdam Neuroscience, Amsterdam UMC, location VUmc, Amsterdam, The Netherlands.
| | - Jacobus J van Hilten
- Department of Neurology, Leiden University Medical Center, Leiden, The Netherlands.
| | - Marcel J T Reinders
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden, The Netherlands.
- Delft Bioinformatics Lab, Delft University of Technology, Delft, The Netherlands.
| |
Collapse
|