1
|
Sumanaweera D, Suo C, Cujba AM, Muraro D, Dann E, Polanski K, Steemers AS, Lee W, Oliver AJ, Park JE, Meyer KB, Dumitrascu B, Teichmann SA. Gene-level alignment of single-cell trajectories. Nat Methods 2025; 22:68-81. [PMID: 39300283 PMCID: PMC11725504 DOI: 10.1038/s41592-024-02378-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2023] [Accepted: 07/12/2024] [Indexed: 09/22/2024]
Abstract
Single-cell data analysis can infer dynamic changes in cell populations, for example across time, space or in response to perturbation, thus deriving pseudotime trajectories. Current approaches comparing trajectories often use dynamic programming but are limited by assumptions such as the existence of a definitive match. Here we describe Genes2Genes, a Bayesian information-theoretic dynamic programming framework for aligning single-cell trajectories. It is able to capture sequential matches and mismatches of individual genes between a reference and query trajectory, highlighting distinct clusters of alignment patterns. Across both real world and simulated datasets, it accurately inferred alignments and demonstrated its utility in disease cell-state trajectory analysis. In a proof-of-concept application, Genes2Genes revealed that T cells differentiated in vitro match an immature in vivo state while lacking expression of genes associated with TNF signaling. This demonstrates that precise trajectory alignment can pinpoint divergence from the in vivo system, thus guiding the optimization of in vitro culture conditions.
Collapse
Affiliation(s)
- Dinithi Sumanaweera
- Wellcome Sanger Institute; Wellcome Genome Campus, Hinxton, Cambridge, UK
- Theory of Condensed Matter, Cavendish Laboratory, Department of Physics, University of Cambridge, Cambridge, UK
| | - Chenqu Suo
- Wellcome Sanger Institute; Wellcome Genome Campus, Hinxton, Cambridge, UK
- Department of Paediatrics, Cambridge University Hospitals; Hills Road, Cambridge, UK
| | - Ana-Maria Cujba
- Wellcome Sanger Institute; Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Daniele Muraro
- Wellcome Sanger Institute; Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Emma Dann
- Wellcome Sanger Institute; Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Krzysztof Polanski
- Wellcome Sanger Institute; Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Alexander S Steemers
- Wellcome Sanger Institute; Wellcome Genome Campus, Hinxton, Cambridge, UK
- Princess Máxima Center for Pediatric Oncology, Utrecht, Netherlands
| | - Woochan Lee
- Wellcome Sanger Institute; Wellcome Genome Campus, Hinxton, Cambridge, UK
- Department of Biomedical Sciences, Seoul National University, Seoul, Korea
| | - Amanda J Oliver
- Wellcome Sanger Institute; Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Jong-Eun Park
- Wellcome Sanger Institute; Wellcome Genome Campus, Hinxton, Cambridge, UK
- Graduate School of Medical Science and Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea
| | - Kerstin B Meyer
- Wellcome Sanger Institute; Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Bianca Dumitrascu
- Department of Statistics, Columbia University, New York, NY, USA
- Irving Institute for Cancer Dynamics, Columbia University, New York, NY, USA
| | - Sarah A Teichmann
- Wellcome Sanger Institute; Wellcome Genome Campus, Hinxton, Cambridge, UK.
- Cambridge Stem Cell Institute, Jeffrey Cheah Biomedical Centre, Cambridge Biomedical Campus, University of Cambridge, Cambridge, UK.
- Department of Medicine, University of Cambridge, Cambridge, UK.
- Co-director of CIFAR Macmillan Research Program, Toronto, Ontario, Canada.
| |
Collapse
|
2
|
Sun Y, Kong L, Huang J, Deng H, Bian X, Li X, Cui F, Dou L, Cao C, Zou Q, Zhang Z. A comprehensive survey of dimensionality reduction and clustering methods for single-cell and spatial transcriptomics data. Brief Funct Genomics 2024; 23:733-744. [PMID: 38860675 DOI: 10.1093/bfgp/elae023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2023] [Revised: 02/29/2024] [Accepted: 05/27/2024] [Indexed: 06/12/2024] Open
Abstract
In recent years, the application of single-cell transcriptomics and spatial transcriptomics analysis techniques has become increasingly widespread. Whether dealing with single-cell transcriptomic or spatial transcriptomic data, dimensionality reduction and clustering are indispensable. Both single-cell and spatial transcriptomic data are often high-dimensional, making the analysis and visualization of such data challenging. Through dimensionality reduction, it becomes possible to visualize the data in a lower-dimensional space, allowing for the observation of relationships and differences between cell subpopulations. Clustering enables the grouping of similar cells into the same cluster, aiding in the identification of distinct cell subpopulations and revealing cellular diversity, providing guidance for downstream analyses. In this review, we systematically summarized the most widely recognized algorithms employed for the dimensionality reduction and clustering analysis of single-cell transcriptomic and spatial transcriptomic data. This endeavor provides valuable insights and ideas that can contribute to the development of novel tools in this rapidly evolving field.
Collapse
Affiliation(s)
- Yidi Sun
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Lingling Kong
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Jiayi Huang
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Hongyan Deng
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Xinling Bian
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Xingfeng Li
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Feifei Cui
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Lijun Dou
- Genomic Medicine Institute, Lerner Research Institute, Cleveland, OH 44106, United States
| | - Chen Cao
- School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing 210029, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China
| | - Zilong Zhang
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| |
Collapse
|
3
|
Jena SG, Verma A, Engelhardt BE. Answering open questions in biology using spatial genomics and structured methods. BMC Bioinformatics 2024; 25:291. [PMID: 39232666 PMCID: PMC11375982 DOI: 10.1186/s12859-024-05912-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Accepted: 08/22/2024] [Indexed: 09/06/2024] Open
Abstract
Genomics methods have uncovered patterns in a range of biological systems, but obscure important aspects of cell behavior: the shapes, relative locations, movement, and interactions of cells in space. Spatial technologies that collect genomic or epigenomic data while preserving spatial information have begun to overcome these limitations. These new data promise a deeper understanding of the factors that affect cellular behavior, and in particular the ability to directly test existing theories about cell state and variation in the context of morphology, location, motility, and signaling that could not be tested before. Rapid advancements in resolution, ease-of-use, and scale of spatial genomics technologies to address these questions also require an updated toolkit of statistical methods with which to interrogate these data. We present a framework to respond to this new avenue of research: four open biological questions that can now be answered using spatial genomics data paired with methods for analysis. We outline spatial data modalities for each open question that may yield specific insights, discuss how conflicting theories may be tested by comparing the data to conceptual models of biological behavior, and highlight statistical and machine learning-based tools that may prove particularly helpful to recover biological understanding.
Collapse
Affiliation(s)
- Siddhartha G Jena
- Department of Stem Cell and Regenerative Biology, Harvard, 7 Divinity Ave, Cambridge, MA, USA
| | - Archit Verma
- Gladstone Institutes, 1650 Owens Street, San Francisco, CA, 94158, USA
| | | |
Collapse
|
4
|
Schuster V, Krogh A. The Deep Generative Decoder: MAP estimation of representations improves modelling of single-cell RNA data. Bioinformatics 2023; 39:btad497. [PMID: 37572301 PMCID: PMC10483129 DOI: 10.1093/bioinformatics/btad497] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2022] [Revised: 07/12/2023] [Accepted: 08/10/2023] [Indexed: 08/14/2023] Open
Abstract
MOTIVATION Learning low-dimensional representations of single-cell transcriptomics has become instrumental to its downstream analysis. The state of the art is currently represented by neural network models, such as variational autoencoders, which use a variational approximation of the likelihood for inference. RESULTS We here present the Deep Generative Decoder (DGD), a simple generative model that computes model parameters and representations directly via maximum a posteriori estimation. The DGD handles complex parameterized latent distributions naturally unlike variational autoencoders, which typically use a fixed Gaussian distribution, because of the complexity of adding other types. We first show its general functionality on a commonly used benchmark set, Fashion-MNIST. Secondly, we apply the model to multiple single-cell datasets. Here, the DGD learns low-dimensional, meaningful, and well-structured latent representations with sub-clustering beyond the provided labels. The advantages of this approach are its simplicity and its capability to provide representations of much smaller dimensionality than a comparable variational autoencoder. AVAILABILITY AND IMPLEMENTATION scDGD is available as a python package at https://github.com/Center-for-Health-Data-Science/scDGD. The remaining code is made available here: https://github.com/Center-for-Health-Data-Science/dgd.
Collapse
Affiliation(s)
- Viktoria Schuster
- Center for Health Data Science, University of Copenhagen, 2200 Copenhagen, Denmark
| | - Anders Krogh
- Center for Health Data Science, University of Copenhagen, 2200 Copenhagen, Denmark
- Department of Computer Science, University of Copenhagen, 2100 Copenhagen, Denmark
| |
Collapse
|
5
|
Xue A, Yazar S, Neavin D, Powell JE. Pitfalls and opportunities for applying latent variables in single-cell eQTL analyses. Genome Biol 2023; 24:33. [PMID: 36823676 PMCID: PMC9948363 DOI: 10.1186/s13059-023-02873-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2022] [Accepted: 02/13/2023] [Indexed: 02/25/2023] Open
Abstract
Using latent variables in gene expression data can help correct unobserved confounders and increase statistical power for expression quantitative trait Loci (eQTL) detection. The probabilistic estimation of expression residuals (PEER) and principal component analysis (PCA) are widely used methods that can remove unwanted variation and improve eQTL discovery power in bulk RNA-seq analysis. However, their performance has not been evaluated extensively in single-cell eQTL analysis, especially for different cell types. Potential challenges arise due to the structure of single-cell RNA-seq data, including sparsity, skewness, and mean-variance relationship. Here, we show by a series of analyses that PEER and PCA require additional quality control and data transformation steps on the pseudo-bulk matrix to obtain valid latent variables; otherwise, it can result in highly correlated factors (Pearson's correlation r = 0.63 ~ 0.99). Incorporating valid PFs/PCs in the eQTL association model would identify 1.7 ~ 13.3% more eGenes. Sensitivity analysis showed that the pattern of change between the number of eGenes detected and fitted PFs/PCs varied significantly in different cell types. In addition, using highly variable genes to generate latent variables could achieve similar eGenes discovery power as using all genes but save considerable computational resources (~ 6.2-fold faster).
Collapse
Affiliation(s)
- Angli Xue
- Garvan-Weizmann Centre for Cellular Genomics, Garvan Institute of Medical Research, Sydney, NSW, 2010, Australia.
- School of Biomedical Sciences, University of New South Wales, Sydney, NSW, 2052, Australia.
| | - Seyhan Yazar
- Garvan-Weizmann Centre for Cellular Genomics, Garvan Institute of Medical Research, Sydney, NSW, 2010, Australia
| | - Drew Neavin
- Garvan-Weizmann Centre for Cellular Genomics, Garvan Institute of Medical Research, Sydney, NSW, 2010, Australia
| | - Joseph E Powell
- Garvan-Weizmann Centre for Cellular Genomics, Garvan Institute of Medical Research, Sydney, NSW, 2010, Australia.
- UNSW Cellular Genomics Futures Institute, University of New South Wales, Sydney, NSW, 2052, Australia.
| |
Collapse
|
6
|
Bhattacharya A, Hirbo JB, Zhou D, Zhou W, Zheng J, Kanai M, the Global Biobank Meta-analysis Initiative, Pasaniuc B, Gamazon ER, Cox NJ. Best practices for multi-ancestry, meta-analytic transcriptome-wide association studies: Lessons from the Global Biobank Meta-analysis Initiative. CELL GENOMICS 2022; 2:100180. [PMID: 36341024 PMCID: PMC9631681 DOI: 10.1016/j.xgen.2022.100180] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/28/2021] [Revised: 08/09/2022] [Accepted: 09/01/2022] [Indexed: 12/13/2022]
Abstract
The Global Biobank Meta-analysis Initiative (GBMI), through its diversity, provides a valuable opportunity to study population-wide and ancestry-specific genetic associations. However, with multiple ascertainment strategies and multi-ancestry study populations across biobanks, GBMI presents unique challenges in implementing statistical genetics methods. Transcriptome-wide association studies (TWASs) boost detection power for and provide biological context to genetic associations by integrating genetic variant-to-trait associations from genome-wide association studies (GWASs) with predictive models of gene expression. TWASs present unique challenges beyond GWASs, especially in a multi-biobank, meta-analytic setting. Here, we present the GBMI TWAS pipeline, outlining practical considerations for ancestry and tissue specificity, meta-analytic strategies, and open challenges at every step of the framework. We advise conducting ancestry-stratified TWASs using ancestry-specific expression models and meta-analyzing results using inverse-variance weighting, showing the least test statistic inflation. Our work provides a foundation for adding transcriptomic context to biobank-linked GWASs, allowing for ancestry-aware discovery to accelerate genomic medicine.
Collapse
Affiliation(s)
- Arjun Bhattacharya
- Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
- Institute of Quantitative and Computational Biosciences, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
| | - Jibril B. Hirbo
- Department of Medicine, Division of Genetic Medicine, Vanderbilt University School of Medicine, Nashville, TN, USA
- Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Dan Zhou
- Department of Medicine, Division of Genetic Medicine, Vanderbilt University School of Medicine, Nashville, TN, USA
- Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Wei Zhou
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Stanley Center for Psychiatric Research, Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - Jie Zheng
- MRC Integrative Epidemiology Unit (IEU), Bristol Medical School, University of Bristol, Oakfield House, Oakfield Grove, Bristol BS8 2BN, UK
| | - Masahiro Kanai
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Stanley Center for Psychiatric Research, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Department of Statistical Genetics, Osaka University Graduate School of Medicine, Suita 565-0871, Japan
| | - the Global Biobank Meta-analysis Initiative
- Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
- Institute of Quantitative and Computational Biosciences, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Medicine, Division of Genetic Medicine, Vanderbilt University School of Medicine, Nashville, TN, USA
- Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN, USA
- Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Stanley Center for Psychiatric Research, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- MRC Integrative Epidemiology Unit (IEU), Bristol Medical School, University of Bristol, Oakfield House, Oakfield Grove, Bristol BS8 2BN, UK
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Department of Statistical Genetics, Osaka University Graduate School of Medicine, Suita 565-0871, Japan
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Computational Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
- MRC Epidemiology Unit, University of Cambridge, Cambridge, UK
| | - Bogdan Pasaniuc
- Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Computational Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
| | - Eric R. Gamazon
- Department of Medicine, Division of Genetic Medicine, Vanderbilt University School of Medicine, Nashville, TN, USA
- Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN, USA
- MRC Epidemiology Unit, University of Cambridge, Cambridge, UK
| | - Nancy J. Cox
- Department of Medicine, Division of Genetic Medicine, Vanderbilt University School of Medicine, Nashville, TN, USA
- Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN, USA
| |
Collapse
|
7
|
Cost function for low-dimensional manifold topology assessment. Sci Rep 2022; 12:14496. [PMID: 36008473 PMCID: PMC9411209 DOI: 10.1038/s41598-022-18655-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2022] [Accepted: 08/17/2022] [Indexed: 12/02/2022] Open
Abstract
In reduced-order modeling, complex systems that exhibit high state-space dimensionality are described and evolved using a small number of parameters. These parameters can be obtained in a data-driven way, where a high-dimensional dataset is projected onto a lower-dimensional basis. A complex system is then restricted to states on a low-dimensional manifold where it can be efficiently modeled. While this approach brings computational benefits, obtaining a good quality of the manifold topology becomes a crucial aspect when models, such as nonlinear regression, are built on top of the manifold. Here, we present a quantitative metric for characterizing manifold topologies. Our metric pays attention to non-uniqueness and spatial gradients in physical quantities of interest, and can be applied to manifolds of arbitrary dimensionality. Using the metric as a cost function in optimization algorithms, we show that optimized low-dimensional projections can be found. We delineate a few applications of the cost function to datasets representing argon plasma, reacting flows and atmospheric pollutant dispersion. We demonstrate how the cost function can assess various dimensionality reduction and manifold learning techniques as well as data preprocessing strategies in their capacity to yield quality low-dimensional projections. We show that improved manifold topologies can facilitate building nonlinear regression models.
Collapse
|
8
|
Sarkar A, Stephens M. Separating measurement and expression models clarifies confusion in single-cell RNA sequencing analysis. Nat Genet 2021; 53:770-777. [PMID: 34031584 PMCID: PMC8370014 DOI: 10.1038/s41588-021-00873-4] [Citation(s) in RCA: 109] [Impact Index Per Article: 27.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2020] [Accepted: 04/22/2021] [Indexed: 01/21/2023]
Abstract
The high proportion of zeros in typical single-cell RNA sequencing datasets has led to widespread but inconsistent use of terminology such as dropout and missing data. Here, we argue that much of this terminology is unhelpful and confusing, and outline simple ideas to help to reduce confusion. These include: (1) observed single-cell RNA sequencing counts reflect both true gene expression levels and measurement error, and carefully distinguishing between these contributions helps to clarify thinking; and (2) method development should start with a Poisson measurement model, rather than more complex models, because it is simple and generally consistent with existing data. We outline how several existing methods can be viewed within this framework and highlight how these methods differ in their assumptions about expression variation. We also illustrate how our perspective helps to address questions of biological interest, such as whether messenger RNA expression levels are multimodal among cells.
Collapse
Affiliation(s)
- Abhishek Sarkar
- Department of Human Genetics, University of Chicago, Chicago, IL, USA.
| | - Matthew Stephens
- Department of Human Genetics, University of Chicago, Chicago, IL, USA.
- Department of Statistics, University of Chicago, Chicago, IL, USA.
| |
Collapse
|
9
|
Joint analysis of expression levels and histological images identifies genes associated with tissue morphology. Nat Commun 2021; 12:1609. [PMID: 33707455 PMCID: PMC7952575 DOI: 10.1038/s41467-021-21727-x] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2017] [Accepted: 02/05/2021] [Indexed: 01/01/2023] Open
Abstract
Histopathological images are used to characterize complex phenotypes such as tumor stage. Our goal is to associate features of stained tissue images with high-dimensional genomic markers. We use convolutional autoencoders and sparse canonical correlation analysis (CCA) on paired histological images and bulk gene expression to identify subsets of genes whose expression levels in a tissue sample correlate with subsets of morphological features from the corresponding sample image. We apply our approach, ImageCCA, to two TCGA data sets, and find gene sets associated with the structure of the extracellular matrix and cell wall infrastructure, implicating uncharacterized genes in extracellular processes. We find sets of genes associated with specific cell types, including neuronal cells and cells of the immune system. We apply ImageCCA to the GTEx v6 data, and find image features that capture population variation in thyroid and in colon tissues associated with genetic variants (image morphology QTLs, or imQTLs), suggesting that genetic variation regulates population variation in tissue morphological traits.
Collapse
|