1
|
González-Velasco O, Simon M, Yilmaz R, Parlato R, Weishaupt J, Imbusch C, Brors B. Identifying similar populations across independent single cell studies without data integration. NAR Genom Bioinform 2025; 7:lqaf042. [PMID: 40276039 PMCID: PMC12019640 DOI: 10.1093/nargab/lqaf042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2024] [Revised: 03/13/2025] [Accepted: 03/26/2025] [Indexed: 04/26/2025] Open
Abstract
Supervised and unsupervised methods have emerged to address the complexity of single cell data analysis in the context of large pools of independent studies. Here, we present ClusterFoldSimilarity (CFS), a novel statistical method design to quantify the similarity between cell groups across any number of independent datasets, without the need for data correction or integration. By bypassing these processes, CFS avoids the introduction of artifacts and loss of information, offering a simple, efficient, and scalable solution. This method match groups of cells that exhibit conserved phenotypes across datasets, including different tissues and species, and in a multimodal scenario, including single-cell RNA-Seq, ATAC-Seq, single-cell proteomics, or, more broadly, data exhibiting differential abundance effects among groups of cells. Additionally, CFS performs feature selection, obtaining cross-dataset markers of the similar phenotypes observed, providing an inherent interpretability of relationships between cell populations. To showcase the effectiveness of our methodology, we generated single-nuclei RNA-Seq data from the motor cortex and spinal cord of adult mice. By using CFS, we identified three distinct sub-populations of astrocytes conserved on both tissues. CFS includes various visualization methods for the interpretation of the similarity scores and similar cell populations.
Collapse
Affiliation(s)
- Oscar González-Velasco
- Division Applied Bioinformatics, German Cancer Research Center (DKFZ), 69120 Heidelberg, Germany
- Division of Neurodegenerative Disorders, Department of Neurology, Medical Faculty Mannheim, Mannheim Center for Translational Neurosciences, Heidelberg University, 68167 Mannheim, Germany
| | - Malte Simon
- Division Applied Bioinformatics, German Cancer Research Center (DKFZ), 69120 Heidelberg, Germany
- Leibniz Institute for Immunotherapy, 93053 Regensburg, Germany
| | - Rüstem Yilmaz
- Division of Neurodegenerative Disorders, Department of Neurology, Medical Faculty Mannheim, Mannheim Center for Translational Neurosciences, Heidelberg University, 68167 Mannheim, Germany
| | - Rosanna Parlato
- Division of Neurodegenerative Disorders, Department of Neurology, Medical Faculty Mannheim, Mannheim Center for Translational Neurosciences, Heidelberg University, 68167 Mannheim, Germany
| | - Jochen Weishaupt
- Division of Neurodegenerative Disorders, Department of Neurology, Medical Faculty Mannheim, Mannheim Center for Translational Neurosciences, Heidelberg University, 68167 Mannheim, Germany
| | - Charles D Imbusch
- Division Applied Bioinformatics, German Cancer Research Center (DKFZ), 69120 Heidelberg, Germany
- Institute of Immunology, University Medical Center Mainz, 55131 Mainz, Germany
- Research Center for Immunotherapy, University Medical Center Mainz, 55131 Mainz, Germany
| | - Benedikt Brors
- Division Applied Bioinformatics, German Cancer Research Center (DKFZ), 69120 Heidelberg, Germany
- German Cancer Consortium (DKTK), Core Center Heidelberg, Im Neuenheimer Feld 280, 69120 Heidelberg, Germany
- Medical Faculty Heidelberg and Faculty of Biosciences, Heidelberg University, 69120 Heidelberg, Germany
| |
Collapse
|
2
|
Gao S, Li H, Wu Z, Mizumaki H, Kajigaya S, Young NS. GSNCASCR: An R Package to Identify Differentially Co-Expressed Curated Gene Sets with Single-Cell RNA-Seq Data. Int J Mol Sci 2025; 26:4771. [PMID: 40429912 PMCID: PMC12112291 DOI: 10.3390/ijms26104771] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2025] [Revised: 05/06/2025] [Accepted: 05/13/2025] [Indexed: 05/29/2025] Open
Abstract
(1) Differential co-expression analysis between two phenotypes with a known gene set helps to uncover gene regulation alterations. (2) GSNCASCR uses CSCORE to estimate the gene pair correlations for network reconstruction and GSNCA to quantify the structure changes of co-expression networks of the predefined gene sets. It also ranks genes based on their "importance" in the weighted network. The method is implemented with free R software (version 0.1.0, available on GitHub), allowing users to analyze their data with the help of demo vignettes included in the package. (3) With analysis of both simulated and real datasets, we demonstrate that the statistical tests performed with GSNCASCR are able to identify differentially co-expressed gene sets with higher precision than tests with Gene Set Co-Expression Analysis (GSCA, version 1.1.1) and Gene Sets Net Correlations Analysis (GSNCA, version 1.42.0). Specifically, GSNCASCR achieved an AUC value of 0.985, while GSNCA and GSCA achieved 0.817 and 0.893, respectively, when positive and negative pathways are defined as having more than 40% and less than 20% co-expressed gene pairs in the simulated data, respectively. Furthermore, across simulated data with varying noise levels, pathway sizes, and positive/negative pathway definitions, GSNCASCR consistently performs best in over 90% of scenarios, as evaluated by AUC values. With an available COVID-19 dataset, we show CD4+ T cell dysfunction in severe COVID-19 as TNF-α/TNF receptor 1-dependent immune pathways. In the weighted network of a gene set of IFN-γ, IFITM3 was identified as a hub gene, which has been evidenced by a genome-wide association study and functional studies. (4) We developed a bioinformatics tool, GSNCASCR, that analyzes differentially co-expressed pathways with single-cell RNA-sequencing data and also evaluates the importance of the genes within pathways. This tool combines the advantages of two algorithms, enabling the quantification and examination of cell type-specific co-expression changes within pathways. The package allows for the analysis of shared and unique disease-affected pathways across different cell types.
Collapse
Affiliation(s)
- Shouguo Gao
- Hematopoiesis and Bone Marrow Failure Laboratory, Hematology Branch, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | | | | | | | | | | |
Collapse
|
3
|
Fu S, Li WV. Predicting and comparing transcription start sites in single cell populations. PLoS Comput Biol 2025; 21:e1012878. [PMID: 40179341 PMCID: PMC11968111 DOI: 10.1371/journal.pcbi.1012878] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2024] [Accepted: 02/15/2025] [Indexed: 04/05/2025] Open
Abstract
The advent of 5' single-cell RNA sequencing (scRNA-seq) technologies offers unique opportunities to identify and analyze transcription start sites (TSSs) at a single-cell resolution. These technologies have the potential to uncover the complexities of transcription initiation and alternative TSS usage across different cell types and conditions. Despite the emergence of computational methods designed to analyze 5' RNA sequencing data, current methods often lack comparative evaluations in single-cell contexts and are predominantly tailored for paired-end data, neglecting the potential of single-end data. This study introduces scTSS, a computational pipeline developed to bridge this gap by accommodating both paired-end and single-end 5' scRNA-seq data. scTSS enables joint analysis of multiple single-cell samples, starting with TSS cluster prediction and quantification, followed by differential TSS usage analysis. It employs a Binomial generalized linear mixed model to accurately and efficiently detect differential TSS usage. We demonstrate the utility of scTSS through its application in analyzing transcriptional initiation from single-cell data of two distinct diseases. The results illustrate scTSS's ability to discern alternative TSS usage between different cell types or biological conditions and to identify cell subpopulations characterized by unique TSS-level expression profiles.
Collapse
Affiliation(s)
- Shiwei Fu
- Department of Statistics, University of California, Riverside, Riveside, California, United States of America
| | - Wei Vivian Li
- Department of Statistics, University of California, Riverside, Riveside, California, United States of America
| |
Collapse
|
4
|
Liang X, Torkel M, Cao Y, Yang JYH. Multi-task benchmarking of spatially resolved gene expression simulation models. Genome Biol 2025; 26:57. [PMID: 40098171 PMCID: PMC11912772 DOI: 10.1186/s13059-025-03505-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2024] [Accepted: 02/12/2025] [Indexed: 03/19/2025] Open
Abstract
BACKGROUND Computational methods for spatially resolved transcriptomics (SRT) are often developed and assessed using simulated data. The effectiveness of these evaluations relies on the ability of simulation methods to accurately reflect experimental data. However, a systematic evaluation framework for spatial simulators is currently lacking. RESULTS Here, we present SpatialSimBench, a comprehensive evaluation framework that assesses 13 simulation methods using ten distinct STR datasets. We introduce simAdaptor, a tool that extends single-cell simulators by incorporating spatial variables, enabling them to simulate spatial data. SimAdaptor ensures SpatialSimBench is backwards compatible, facilitating direct comparisons between spatially aware simulators and existing non-spatial single-cell simulators through the adaption. Using SpatialSimBench, we demonstrate the feasibility of leveraging existing single-cell simulators for SRT data and highlight performance differences among methods. Additionally, we evaluate the simulation methods based on a total of 35 metrics across data property estimation, various downstream analyses, and scalability. In total, we generated 4550 results from 13 simulation methods, ten spatial datasets, and 35 metrics. CONCLUSIONS Our findings reveal that model estimation can be influenced by distribution assumptions and dataset characteristics. In summary, our evaluation framework provides guidelines for selecting appropriate methods for specific scenarios and informs future method development.
Collapse
Affiliation(s)
- Xiaoqi Liang
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW, 2006, Australia
- Sydney Precision Data Science Centre, The University of Sydney, Sydney, NSW, 2006, Australia
- Charles Perkins Centre, The University of Sydney, Sydney, NSW, 2006, Australia
| | - Marni Torkel
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW, 2006, Australia
- Sydney Precision Data Science Centre, The University of Sydney, Sydney, NSW, 2006, Australia
- Charles Perkins Centre, The University of Sydney, Sydney, NSW, 2006, Australia
| | - Yue Cao
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW, 2006, Australia.
- Sydney Precision Data Science Centre, The University of Sydney, Sydney, NSW, 2006, Australia.
- Charles Perkins Centre, The University of Sydney, Sydney, NSW, 2006, Australia.
- Laboratory of Data Discovery for Health Limited (D24H), Science Park, Hong Kong SAR, China.
| | - Jean Yee Hwa Yang
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW, 2006, Australia.
- Sydney Precision Data Science Centre, The University of Sydney, Sydney, NSW, 2006, Australia.
- Charles Perkins Centre, The University of Sydney, Sydney, NSW, 2006, Australia.
- Laboratory of Data Discovery for Health Limited (D24H), Science Park, Hong Kong SAR, China.
| |
Collapse
|
5
|
Song B, Liu D, Dai W, McMyn NF, Wang Q, Yang D, Krejci A, Vasilyev A, Untermoser N, Loregger A, Song D, Williams B, Rosen B, Cheng X, Chao L, Kale HT, Zhang H, Diao Y, Bürckstümmer T, Siliciano JD, Li JJ, Siliciano RF, Huangfu D, Li W. Decoding heterogeneous single-cell perturbation responses. Nat Cell Biol 2025; 27:493-504. [PMID: 40011559 PMCID: PMC11906366 DOI: 10.1038/s41556-025-01626-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Accepted: 01/20/2025] [Indexed: 02/28/2025]
Abstract
Understanding how cells respond differently to perturbation is crucial in cell biology, but existing methods often fail to accurately quantify and interpret heterogeneous single-cell responses. Here we introduce the perturbation-response score (PS), a method to quantify diverse perturbation responses at a single-cell level. Applied to single-cell perturbation datasets such as Perturb-seq, PS outperforms existing methods in quantifying partial gene perturbations. PS further enables single-cell dosage analysis without needing to titrate perturbations, and identifies 'buffered' and 'sensitive' response patterns of essential genes, depending on whether their moderate perturbations lead to strong downstream effects. PS reveals differential cellular responses on perturbing key genes in contexts such as T cell stimulation, latent HIV-1 expression and pancreatic differentiation. Notably, we identified a previously unknown role for the coiled-coil domain containing 6 (CCDC6) in regulating liver and pancreatic cell fate decisions. PS provides a powerful method for dose-to-function analysis, offering deeper insights from single-cell perturbation data.
Collapse
Affiliation(s)
- Bicna Song
- Center for Genetic Medicine Research, Children's National Hospital, Washington, DC, USA
- Department of Genomics and Precision Medicine, George Washington University, Washington, DC, USA
| | - Dingyu Liu
- Developmental Biology Program, Sloan Kettering Institute, New York City, NY, USA
- Louis V. Gerstner Jr. Graduate School of Biomedical Sciences, Memorial Sloan Kettering Cancer Center, New York City, NY, USA
| | - Weiwei Dai
- Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA
- Howard Hughes Medical Institute, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Natalie F McMyn
- Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Qingyang Wang
- Department of Statistics and Data Science, University of California, Los Angeles, CA, USA
| | - Dapeng Yang
- Developmental Biology Program, Sloan Kettering Institute, New York City, NY, USA
| | | | | | | | | | - Dongyuan Song
- Bioinformatics Interdepartmental PhD Program, University of California, Los Angeles, CA, USA
- Department of Genetics and Genome Sciences, University of Connecticut Health Center, Farmington, CT, USA
| | - Breanna Williams
- Developmental Biology Program, Sloan Kettering Institute, New York City, NY, USA
| | - Bess Rosen
- Developmental Biology Program, Sloan Kettering Institute, New York City, NY, USA
- Weill Cornell Graduate School of Medical Sciences, Weill Cornell Medicine, New York, NY, USA
| | - Xiaolong Cheng
- Center for Genetic Medicine Research, Children's National Hospital, Washington, DC, USA
- Department of Genomics and Precision Medicine, George Washington University, Washington, DC, USA
| | - Lumen Chao
- Center for Genetic Medicine Research, Children's National Hospital, Washington, DC, USA
- Department of Genomics and Precision Medicine, George Washington University, Washington, DC, USA
| | - Hanuman T Kale
- Developmental Biology Program, Sloan Kettering Institute, New York City, NY, USA
| | - Hao Zhang
- Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Yarui Diao
- Department of Cell Biology, Duke University Medical Center, Durham, NC, USA
| | | | - Janet D Siliciano
- Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Jingyi Jessica Li
- Department of Statistics and Data Science, University of California, Los Angeles, CA, USA
- Bioinformatics Interdepartmental PhD Program, University of California, Los Angeles, CA, USA
- Department of Human Genetics, University of California, Los Angeles, CA, USA
- Department of Biostatistics, University of California, Los Angeles, CA, USA
- Department of Computational Medicine, University of California, Los Angeles, CA, USA
| | - Robert F Siliciano
- Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA
- Howard Hughes Medical Institute, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Danwei Huangfu
- Developmental Biology Program, Sloan Kettering Institute, New York City, NY, USA
| | - Wei Li
- Center for Genetic Medicine Research, Children's National Hospital, Washington, DC, USA.
- Department of Genomics and Precision Medicine, George Washington University, Washington, DC, USA.
| |
Collapse
|
6
|
Dong S, Cui Z, Liu D, Lei J. scRDiT: Generating Single-cell RNA-seq Data by Diffusion Transformers and Accelerating Sampling. Interdiscip Sci 2025:10.1007/s12539-025-00688-5. [PMID: 39982678 DOI: 10.1007/s12539-025-00688-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2024] [Revised: 01/07/2025] [Accepted: 01/08/2025] [Indexed: 02/22/2025]
Abstract
Single-cell RNA sequencing (scRNA-seq) is a groundbreaking technology extensively utilized in biological research, facilitating the examination of gene expression at the individual cell level within a given tissue sample. While numerous tools have been developed for scRNA-seq data analysis, the challenge persists in capturing the distinct features of such data and replicating virtual datasets that share analogous statistical properties. Our study introduces a generative approach termed scRNA-seq Diffusion Transformer (scRDiT). This method generates virtual scRNA-seq data by leveraging a real dataset. The method is a neural network constructed based on Denoising Diffusion Probabilistic Models (DDPMs) and Diffusion Transformers (DiTs). This involves subjecting Gaussian noises to the real dataset through iterative noise-adding steps and ultimately restoring the noises to form scRNA-seq samples. This scheme allows us to learn data features from actual scRNA-seq samples during model training. Our experiments, conducted on two distinct scRNA-seq datasets, demonstrate superior performance. Additionally, the model sampling process is expedited by incorporating Denoising Diffusion Implicit Models (DDIMs). scRDiT presents a unified methodology empowering users to train neural network models with their unique scRNA-seq datasets, enabling the generation of numerous high-quality scRNA-seq samples.
Collapse
Affiliation(s)
- Shengze Dong
- School of Computer Science and Technology, Tiangong University, Tianjin, 300387, China
| | - Zhuorui Cui
- School of Computer Science and Technology, Tiangong University, Tianjin, 300387, China
| | - Ding Liu
- School of Computer Science and Technology, Tiangong University, Tianjin, 300387, China.
| | - Jinzhi Lei
- School of Mathematical Sciences, Tiangong University, Tianjin, 300387, China.
| |
Collapse
|
7
|
Yang J, Grant GR, Brooks TG. Generating Correlated Data for Omics Simulation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.01.31.634335. [PMID: 39975030 PMCID: PMC11838456 DOI: 10.1101/2025.01.31.634335] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/21/2025]
Abstract
Simulation of realistic omics data is a key input for benchmarking studies that help users obtain optimal computational pipelines. Omics data involves large numbers of measured features on each samples and these measures are generally correlated with each other. However, simulation too often ignores these correlations, perhaps due to the inconvenience and computational hurdles of doing so. To alleviate this, we describe in detail three approaches for quickly generating omics-scale data with correlated measures which mimic real data sets. These approaches all are based on a Gaussian copula approach with a covariance matrix that decomposes into a diagonal part and a low-rank part. We use these approaches to demonstrate the importance of including correlation in two benchmarking applications. First, we show that variance of results from the popular DESeq2 method increases when dependence is included. Second, we demonstrate that CYCLOPS, a method for inferring circadian time of collection from transcriptomics, improves in performance when given gene-gene dependencies in some circumstances. We provide an R package, dependentsimr, that has efficient implementations of these methods and can generate dependent data with arbitrary distributions, including discrete (binary, ordered categorical, Poisson, negative binomial), continuous (normal), or with an empirical distribution.
Collapse
Affiliation(s)
- Jianing Yang
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania
- Chronobiology and Sleep Institute, University of Pennsylvania
| | - Gregory R Grant
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania
| | - Thomas G Brooks
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania
| |
Collapse
|
8
|
Van Hecke M, Beerenwinkel N, Lootens T, Fostier J, Raedt R, Marchal K. ELLIPSIS: robust quantification of splicing in scRNA-seq. Bioinformatics 2025; 41:btaf028. [PMID: 39936571 PMCID: PMC11878791 DOI: 10.1093/bioinformatics/btaf028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2024] [Revised: 12/09/2024] [Accepted: 02/10/2025] [Indexed: 02/13/2025] Open
Abstract
MOTIVATION Alternative splicing is a tightly regulated biological process, that due to its cell type specific behavior, calls for analysis at the single cell level. However, quantifying differential splicing in scRNA-seq is challenging due to low and uneven coverage. Hereto, we developed ELLIPSIS, a tool for robust quantification of splicing in scRNA-seq that leverages locally observed read coverage with conservation of flow and intra-cell type similarity properties. Additionally, it is also able to quantify splicing in novel splicing events, which is extremely important in cancer cells where lots of novel splicing events occur. RESULTS Application of ELLIPSIS to simulated data proves that our method is able to robustly estimate Percent Spliced In values in simulated data, and allows to reliably detect differential splicing between cell types. Using ELLIPSIS on glioblastoma scRNA-seq data, we identified genes that are differentially spliced between cancer cells in the tumor core and infiltrating cancer cells found in peripheral tissue. These genes showed to play a role in a.o. cell migration and motility, cell projection organization, and neuron projection guidance. AVAILABILITY AND IMPLEMENTATION ELLIPSIS quantification tool: https://github.com/MarchalLab/ELLIPSIS.git.
Collapse
Affiliation(s)
- Marie Van Hecke
- IDLab, Department of Information Technology, Ghent University-IMEC, 9052 Ghent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium
- Cancer Research Institute Ghent (CRIG), Ghent University, 9000 Ghent, Belgium
| | - Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH Zürich, 4056 Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, 4051 Basel, Switzerland
| | - Thibault Lootens
- Cancer Research Institute Ghent (CRIG), Ghent University, 9000 Ghent, Belgium
- 4Brain, Department of Head and Skin, Ghent University, 9000 Ghent, Belgium
- Laboratory of Experimental Cancer Research, Department of Human Structure and Repair, Ghent University, 9000 Ghent, Belgium
| | - Jan Fostier
- Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium
| | - Robrecht Raedt
- Cancer Research Institute Ghent (CRIG), Ghent University, 9000 Ghent, Belgium
- 4Brain, Department of Head and Skin, Ghent University, 9000 Ghent, Belgium
| | - Kathleen Marchal
- IDLab, Department of Information Technology, Ghent University-IMEC, 9052 Ghent, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium
- Cancer Research Institute Ghent (CRIG), Ghent University, 9000 Ghent, Belgium
| |
Collapse
|
9
|
Jiang H, Miao X, Thairu MW, Beebe M, Grupe DW, Davidson RJ, Handelsman J, Sankaran K. Multimedia: multimodal mediation analysis of microbiome data. Microbiol Spectr 2025; 13:e0113124. [PMID: 39688588 PMCID: PMC11792470 DOI: 10.1128/spectrum.01131-24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2024] [Accepted: 10/30/2024] [Indexed: 12/18/2024] Open
Abstract
Mediation analysis has emerged as a versatile tool for answering mechanistic questions in microbiome research because it provides a statistical framework for attributing treatment effects to alternative causal pathways. Using a series of linked regressions, this analysis quantifies how complementary data relate to one another and respond to treatments. Despite these advances, existing software's rigid assumptions often result in users viewing mediation analysis as a black box. We designed the multimedia R package to make advanced mediation analysis techniques accessible, ensuring that statistical components are interpretable and adaptable. The package provides a uniform interface to direct and indirect effect estimation, synthetic null hypothesis testing, bootstrap confidence interval construction, and sensitivity analysis, enabling experimentation with various mediator and outcome models while maintaining a simple overall workflow. The software includes modules for regularized linear, compositional, random forest, hierarchical, and hurdle modeling, making it well-suited to microbiome data. We illustrate the package through two case studies. The first re-analyzes a study of the microbiome and metabolome of Inflammatory Bowel Disease patients, uncovering potential mechanistic interactions between the microbiome and disease-associated metabolites, not found in the original study. The second analyzes new data about the influence of mindfulness practice on the microbiome. The mediation analysis highlights shifts in taxa previously associated with depression that cannot be explained indirectly by diet or sleep behaviors alone. A gallery of examples and further documentation can be found at https://go.wisc.edu/830110. IMPORTANCE Microbiome studies routinely gather complementary data to capture different aspects of a microbiome's response to a change, such as the introduction of a therapeutic. Mediation analysis clarifies the extent to which responses occur sequentially via mediators, thereby supporting causal, rather than purely descriptive, interpretation. Multimedia is a modular R package with close ties to the wider microbiome software ecosystem that makes statistically rigorous, flexible mediation analysis easily accessible, setting the stage for precise and causally informed microbiome engineering.
Collapse
Affiliation(s)
- Hanying Jiang
- Statistics Department, University of Wisconsin—Madison, Madison, Wisconsin, USA
| | - Xinran Miao
- Statistics Department, University of Wisconsin—Madison, Madison, Wisconsin, USA
| | - Margaret W. Thairu
- Wisconsin Institute for Discovery, University of Wisconsin—Madison, Madison, Wisconsin, USA
| | - Mara Beebe
- Wisconsin Institute for Discovery, University of Wisconsin—Madison, Madison, Wisconsin, USA
| | - Dan W. Grupe
- Center for Healthy Minds, University of Wisconsin—Madison, Madison, Wisconsin, USA
| | - Richard J. Davidson
- Center for Healthy Minds, University of Wisconsin—Madison, Madison, Wisconsin, USA
- Psychology Department, University of Wisconsin—Madison, Madison, Wisconsin, USA
- Psychiatry Department, University of Wisconsin—Madison, Madison, Wisconsin, USA
| | - Jo Handelsman
- Wisconsin Institute for Discovery, University of Wisconsin—Madison, Madison, Wisconsin, USA
- Plant Pathology Department, University of Wisconsin—Madison, Madison, Wisconsin, USA
| | - Kris Sankaran
- Statistics Department, University of Wisconsin—Madison, Madison, Wisconsin, USA
- Wisconsin Institute for Discovery, University of Wisconsin—Madison, Madison, Wisconsin, USA
| |
Collapse
|
10
|
Tian R, Yu Z, Xue Z, Wu J, Wu L, Cai S, Gao B, He B, Zhao Y, Yao J, Lu L, Liu W. Evaluation of T Cell Receptor Construction Methods from scRNA-Seq Data. GENOMICS, PROTEOMICS & BIOINFORMATICS 2025; 22:qzae086. [PMID: 39666949 PMCID: PMC11846667 DOI: 10.1093/gpbjnl/qzae086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/08/2023] [Revised: 11/26/2024] [Accepted: 12/09/2024] [Indexed: 12/14/2024]
Abstract
T cell receptors (TCRs) serve key roles in the adaptive immune system by enabling recognition and response to pathogens and irregular cells. Various methods have been developed for TCR construction from single-cell RNA sequencing (scRNA-seq) datasets, each with its unique characteristics. Yet, a comprehensive evaluation of their relative performance under different conditions remains elusive. In this study, we conducted a benchmark analysis utilizing experimental single-cell immune profiling datasets. Additionally, we introduced a novel simulator, YASIM-scTCR (Yet Another SIMulator for single-cell TCR), capable of generating scTCR-seq reads containing diverse TCR-derived sequences with different sequencing depths and read lengths. Our results consistently showed that TRUST4 and MiXCR outperformed others across multiple datasets, while DeRR demonstrated considerable accuracy. We also discovered that the sequencing depth inherently imposes a critical constraint on successful TCR construction from scRNA-seq data. In summary, we present a benchmark study to aid researchers in choosing the appropriate method for reconstructing TCRs from scRNA-seq data.
Collapse
Affiliation(s)
- Ruonan Tian
- Department of Rheumatology and Immunology of the Second Affiliated Hospital, and Centre of Biomedical Systems and Informatics of Zhejiang University-University of Edinburgh Institute, Zhejiang University School of Medicine, Hangzhou 310003, China
- Future Health Laboratory, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing 314100, China
| | - Zhejian Yu
- Department of Rheumatology and Immunology of the Second Affiliated Hospital, and Centre of Biomedical Systems and Informatics of Zhejiang University-University of Edinburgh Institute, Zhejiang University School of Medicine, Hangzhou 310003, China
| | - Ziwei Xue
- Department of Rheumatology and Immunology of the Second Affiliated Hospital, and Centre of Biomedical Systems and Informatics of Zhejiang University-University of Edinburgh Institute, Zhejiang University School of Medicine, Hangzhou 310003, China
- Future Health Laboratory, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing 314100, China
| | - Jiaxin Wu
- Department of Rheumatology and Immunology of the Second Affiliated Hospital, and Centre of Biomedical Systems and Informatics of Zhejiang University-University of Edinburgh Institute, Zhejiang University School of Medicine, Hangzhou 310003, China
| | - Lize Wu
- Future Health Laboratory, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing 314100, China
- Institute of Immunology and Department of Dermatology and Rheumatology at Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou 310058, China
| | - Shuo Cai
- Department of Rheumatology and Immunology of the Second Affiliated Hospital, and Centre of Biomedical Systems and Informatics of Zhejiang University-University of Edinburgh Institute, Zhejiang University School of Medicine, Hangzhou 310003, China
| | - Bing Gao
- Department of Rheumatology and Immunology of the Second Affiliated Hospital, and Centre of Biomedical Systems and Informatics of Zhejiang University-University of Edinburgh Institute, Zhejiang University School of Medicine, Hangzhou 310003, China
| | - Bing He
- AI Lab, Tencent, Shenzhen 518000, China
| | - Yu Zhao
- AI Lab, Tencent, Shenzhen 518000, China
| | | | - Linrong Lu
- Future Health Laboratory, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing 314100, China
- Institute of Immunology and Department of Dermatology and Rheumatology at Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou 310058, China
- Shanghai Immune Therapy Institute, Shanghai Jiao Tong University School of Medicine Affiliated Renji Hospital, Shanghai 200025, China
| | - Wanlu Liu
- Department of Rheumatology and Immunology of the Second Affiliated Hospital, and Centre of Biomedical Systems and Informatics of Zhejiang University-University of Edinburgh Institute, Zhejiang University School of Medicine, Hangzhou 310003, China
- Future Health Laboratory, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing 314100, China
| |
Collapse
|
11
|
Song X, Chavez-Fuentes JC, Ma W, Fu W, Wang P, Yuan GC. sCCIgen: A high-fidelity spatially resolved transcriptomics data simulator for cell-cell interaction studies. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.01.07.631830. [PMID: 39829773 PMCID: PMC11741276 DOI: 10.1101/2025.01.07.631830] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/22/2025]
Abstract
Spatially resolved transcriptomics (SRT) provides an invaluable avenue for examining cell-cell interactions within native tissue environments. The development and evaluation of analytical tools for SRT data necessitate tools for generating synthetic datasets with known ground truth of cell-cell interaction induced features. To address this gap, we introduce sCCIgen, a novel real-data-based simulator tailored to generate high-fidelity SRT data with a focus on cell-cell interactions. sCCIgen preserves transcriptomic and spatial characteristics in SRT data, while comprehensively models various cell-cell interaction features, including cell colocalization, spatial dependence among gene expressions, and gene-gene interactions between nearby cells. We implemented sCCIgen as an interactive, easy-to-use, realistic, reproducible, and well-documented tool for studying cellular interactions and spatial biology.
Collapse
|
12
|
Sun F, Li H, Sun D, Fu S, Gu L, Shao X, Wang Q, Dong X, Duan B, Xing F, Wu J, Xiao M, Zhao F, Han JDJ, Liu Q, Fan X, Li C, Wang C, Shi T. Single-cell omics: experimental workflow, data analyses and applications. SCIENCE CHINA. LIFE SCIENCES 2025; 68:5-102. [PMID: 39060615 DOI: 10.1007/s11427-023-2561-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/07/2023] [Accepted: 04/18/2024] [Indexed: 07/28/2024]
Abstract
Cells are the fundamental units of biological systems and exhibit unique development trajectories and molecular features. Our exploration of how the genomes orchestrate the formation and maintenance of each cell, and control the cellular phenotypes of various organismsis, is both captivating and intricate. Since the inception of the first single-cell RNA technology, technologies related to single-cell sequencing have experienced rapid advancements in recent years. These technologies have expanded horizontally to include single-cell genome, epigenome, proteome, and metabolome, while vertically, they have progressed to integrate multiple omics data and incorporate additional information such as spatial scRNA-seq and CRISPR screening. Single-cell omics represent a groundbreaking advancement in the biomedical field, offering profound insights into the understanding of complex diseases, including cancers. Here, we comprehensively summarize recent advances in single-cell omics technologies, with a specific focus on the methodology section. This overview aims to guide researchers in selecting appropriate methods for single-cell sequencing and related data analysis.
Collapse
Affiliation(s)
- Fengying Sun
- Department of Clinical Laboratory, the Affiliated Wuhu Hospital of East China Normal University (The Second People's Hospital of Wuhu City), Wuhu, 241000, China
| | - Haoyan Li
- Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China
| | - Dongqing Sun
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Frontier Science Center for Stem Cells, School of Life Sciences and Technology, Tongji University, Shanghai, 200092, China
| | - Shaliu Fu
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Research Institute of Intelligent Computing, Zhejiang Lab, Hangzhou, 311121, China
- Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai, 201210, China
| | - Lei Gu
- Center for Single-cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, China
| | - Xin Shao
- Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China
- National Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing, 314103, China
| | - Qinqin Wang
- Center for Single-cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, China
| | - Xin Dong
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Frontier Science Center for Stem Cells, School of Life Sciences and Technology, Tongji University, Shanghai, 200092, China
| | - Bin Duan
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Research Institute of Intelligent Computing, Zhejiang Lab, Hangzhou, 311121, China
- Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai, 201210, China
| | - Feiyang Xing
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Frontier Science Center for Stem Cells, School of Life Sciences and Technology, Tongji University, Shanghai, 200092, China
| | - Jun Wu
- Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, the Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, 200241, China
| | - Minmin Xiao
- Department of Clinical Laboratory, the Affiliated Wuhu Hospital of East China Normal University (The Second People's Hospital of Wuhu City), Wuhu, 241000, China.
| | - Fangqing Zhao
- Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing, 100101, China.
| | - Jing-Dong J Han
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Center for Quantitative Biology (CQB), Peking University, Beijing, 100871, China.
| | - Qi Liu
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China.
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China.
- Research Institute of Intelligent Computing, Zhejiang Lab, Hangzhou, 311121, China.
- Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai, 201210, China.
| | - Xiaohui Fan
- Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China.
- National Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing, 314103, China.
- Zhejiang Key Laboratory of Precision Diagnosis and Therapy for Major Gynecological Diseases, Women's Hospital, Zhejiang University School of Medicine, Hangzhou, 310006, China.
| | - Chen Li
- Center for Single-cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, China.
| | - Chenfei Wang
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China.
- Frontier Science Center for Stem Cells, School of Life Sciences and Technology, Tongji University, Shanghai, 200092, China.
| | - Tieliu Shi
- Department of Clinical Laboratory, the Affiliated Wuhu Hospital of East China Normal University (The Second People's Hospital of Wuhu City), Wuhu, 241000, China.
- Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, the Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, 200241, China.
- Key Laboratory of Advanced Theory and Application in Statistics and Data Science-MOE, School of Statistics, East China Normal University, Shanghai, 200062, China.
| |
Collapse
|
13
|
Song D, Chen S, Lee C, Li K, Ge X, Li JJ. Synthetic control removes spurious discoveries from double dipping in single-cell and spatial transcriptomics data analyses. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.07.21.550107. [PMID: 37546812 PMCID: PMC10401959 DOI: 10.1101/2023.07.21.550107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/08/2023]
Abstract
Double dipping is a well-known pitfall in single-cell and spatial transcriptomics data analysis: after a clustering algorithm finds clusters as putative cell types or spatial domains, statistical tests are applied to the same data to identify differentially expressed (DE) genes as potential cell-type or spatial-domain markers. Because the genes that contribute to clustering are inherently likely to be identified as DE genes, double dipping can result in false-positive cell-type or spatial-domain markers, especially when clusters are spurious, leading to ambiguously defined cell types or spatial domains. To address this challenge, we propose ClusterDE, a statistical method designed to identify post-clustering DE genes as reliable markers of cell types and spatial domains, while controlling the false discovery rate (FDR) regardless of clustering quality. The core of ClusterDE involves generating synthetic null data as an in silico negative control that contains only one cell type or spatial domain, allowing for the detection and removal of spurious discoveries caused by double dipping. We demonstrate that ClusterDE controls the FDR and identifies canonical cell-type and spatial-domain markers as top DE genes, distinguishing them from housekeeping genes. ClusterDE's ability to discover reliable markers, or the absence of such markers, can be used to determine whether two ambiguous clusters should be merged. Additionally, ClusterDE is compatible with state-of-the-art analysis pipelines like Seurat and Scanpy.
Collapse
Affiliation(s)
- Dongyuan Song
- Department of Genetics and Genome Sciences, University of Connecticut Health Center, Farmington, CT 06030
- Interdepartmental Program of Bioinformatics, University of California, Los Angeles, CA 90095-7246
| | - Siqi Chen
- Department of Statistics and Data Science, University of California, Los Angeles, CA 90095-1554
| | - Christy Lee
- Department of Statistics and Data Science, University of California, Los Angeles, CA 90095-1554
| | - Kexin Li
- Department of Statistics and Data Science, University of California, Los Angeles, CA 90095-1554
| | - Xinzhou Ge
- Department of Statistics and Data Science, University of California, Los Angeles, CA 90095-1554
- Department of Statistics, Oregon State University, Corvallis, OR 97331-4606
| | - Jingyi Jessica Li
- Interdepartmental Program of Bioinformatics, University of California, Los Angeles, CA 90095-7246
- Department of Statistics and Data Science, University of California, Los Angeles, CA 90095-1554
- Department of Human Genetics, University of California, Los Angeles, CA 90095-7088
- Department of Computational Medicine, University of California, Los Angeles, CA 90095-1766
- Department of Biostatistics, University of California, Los Angeles, CA 90095-1772
| |
Collapse
|
14
|
Sankaran K, Kodikara S, Li JJ, Cao KAL. Semisynthetic simulation for microbiome data analysis. Brief Bioinform 2024; 26:bbaf051. [PMID: 39927858 PMCID: PMC11808806 DOI: 10.1093/bib/bbaf051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2024] [Revised: 12/19/2024] [Accepted: 01/23/2025] [Indexed: 02/11/2025] Open
Abstract
High-throughput sequencing data lie at the heart of modern microbiome research. Effective analysis of these data requires careful preprocessing, modeling, and interpretation to detect subtle signals and avoid spurious associations. In this review, we discuss how simulation can serve as a sandbox to test candidate approaches, creating a setting that mimics real data while providing ground truth. This is particularly valuable for power analysis, methods benchmarking, and reliability analysis. We explain the probability, multivariate analysis, and regression concepts behind modern simulators and how different implementations make trade-offs between generality, faithfulness, and controllability. Recognizing that all simulators only approximate reality, we review methods to evaluate how accurately they reflect key properties. We also present case studies demonstrating the value of simulation in differential abundance testing, dimensionality reduction, network analysis, and data integration. Code for these examples is available in an online tutorial (https://go.wisc.edu/8994yz) that can be easily adapted to new problem settings.
Collapse
Affiliation(s)
- Kris Sankaran
- Department of Statistics, University of Wisconsin-Madison, 1300 University Ave, Madison,WI 53703, United States
| | - Saritha Kodikara
- Melbourne Integrative Genomics, School of Mathematics and Statistics, University of Melbourne, Building 184/30 Royal Parade, Melbourne, VIC 3052, Australia
| | - Jingyi Jessica Li
- Department of Statistics and Data Science, University of California, Los Angeles, 520 Portola Plaza, Los Angeles, CA 90095, United States
- Department of Human Genetics, University of California, Los Angeles, 695 Charles E Young Dr S, Los Angeles, CA 90095, United States
- Department of Biostatistics, University of California, Los Angeles, 650 Charles E. Young Dr S, Los Angeles, CA 90095, United States
| | - Kim-Anh Lê Cao
- Melbourne Integrative Genomics, School of Mathematics and Statistics, University of Melbourne, Building 184/30 Royal Parade, Melbourne, VIC 3052, Australia
| |
Collapse
|
15
|
Shan X, Zhao H. Inferring Cell-Type-Specific Co-Expressed Genes from Single Cell Data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.11.08.622700. [PMID: 39605403 PMCID: PMC11601408 DOI: 10.1101/2024.11.08.622700] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2024]
Abstract
Background Cell-type-specific gene co-expression networks are widely used to characterize gene relationships. Although many methods have been developed to infer such co-expression networks from single-cell data, the lack of consideration of false positive control in many evaluations may lead to incorrect conclusions because higher reproducibility, higher functional coherence, and a larger overlap with known biological networks may not imply better performance if the false positives are not well controlled. Results In this study, we have developed an efficient and effective simulation tool to derive empirical p-values in co-expression inference to appropriately control false positives in assessing method performance. We studied the power of the p-value-based approach in inferring cell-type-specific co-expressions from single-cell data using both simulated and real data. We also highlight the need to adjust for random overlaps between the inferred and known networks when the number of selected correlated gene pairs varies substantially across different methods. We further illustrate the expression level bias in known biological networks and the impact of such bias in method assessment. Conclusion Our study indicates the importance of controlling false positives in the inference of co-expressed genes to achieve more reliable results and proposes a simulation-based p-value method to achieve this.
Collapse
|
16
|
Jiang H, Miao X, Thairu MW, Beebe M, Grupe DW, Davidson RJ, Handelsman J, Sankaran K. multimedia: Multimodal Mediation Analysis of Microbiome Data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.27.587024. [PMID: 38585817 PMCID: PMC10996591 DOI: 10.1101/2024.03.27.587024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/09/2024]
Abstract
Mediation analysis has emerged as a versatile tool for answering mechanistic questions in microbiome research because it provides a statistical framework for attributing treatment effects to alternative causal pathways. Using a series of linked regressions, this analysis quantifies how complementary data relate to one another and respond to treatments. Despite these advances, existing software's rigid assumptions often result in users viewing mediation analysis as a black box. We designed the multimedia R package to make advanced mediation analysis techniques accessible, ensuring that statistical components are interpretable and adaptable. The package provides a uniform interface to direct and indirect effect estimation, synthetic null hypothesis testing, bootstrap confidence interval construction, and sensitivity analysis, enabling experimentation with various mediator and outcome models while maintaining a simple overall workflow. The software includes modules for regularized linear, compositional, random forest, hierarchical, and hurdle modeling, making it well-suited to microbiome data. We illustrate the package through two case studies. The first re-analyzes a study of the microbiome and metabolome of Inflammatory Bowel Disease patients, uncovering potential mechanistic interactions between the microbiome and disease-associated metabolites, not found in the original study. The second analyzes new data about the influence of mindfulness practice on the microbiome. The mediation analysis highlights shifts in taxa previously associated with depression that cannot be explained indirectly by diet or sleep behaviors alone. A gallery of examples and further documentation can be found at https://go.wisc.edu/830110.
Collapse
Affiliation(s)
| | - Xinran Miao
- Statistics Department, UW-Madison, Madison, WI, USA
| | | | - Mara Beebe
- Wisconsin Institute for Discovery, UW-Madison, Madison, WI, USA
| | - Dan W. Grupe
- Center for Healthy Minds, UW-Madison, Madison, WI, USA
| | - Richard J. Davidson
- Center for Healthy Minds, UW-Madison, Madison, WI, USA
- Psychology Department, UW-Madison, Madison, WI, USA
- Psychiatry Department, UW-Madison, Madison, WI, USA
| | - Jo Handelsman
- Wisconsin Institute for Discovery, UW-Madison, Madison, WI, USA
- Plant Pathology Department, UW-Madison, Madison, WI, USA
| | - Kris Sankaran
- Statistics Department, UW-Madison, Madison, WI, USA
- Wisconsin Institute for Discovery, UW-Madison, Madison, WI, USA
| |
Collapse
|
17
|
Subedi S, Sumida TS, Park YP. A scalable approach to topic modelling in single-cell data by approximate pseudobulk projection. Life Sci Alliance 2024; 7:e202402713. [PMID: 39107066 PMCID: PMC11303850 DOI: 10.26508/lsa.202402713] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2024] [Revised: 07/29/2024] [Accepted: 07/30/2024] [Indexed: 08/09/2024] Open
Abstract
Probabilistic topic modelling has become essential in many types of single-cell data analysis. Based on probabilistic topic assignments in each cell, we identify the latent representation of cellular states. A dictionary matrix, consisting of topic-specific gene frequency vectors, provides interpretable bases to be compared with known cell type-specific marker genes and other pathway annotations. However, fitting a topic model on a large number of cells would require heavy computational resources-specialized computing units, computing time and memory. Here, we present a scalable approximation method customized for single-cell RNA-seq data analysis, termed ASAP, short for Annotating a Single-cell data matrix by Approximate Pseudobulk estimation. Our approach is more accurate than existing methods but requires orders of magnitude less computing time, leaving much lower memory consumption. We also show that our approach is widely applicable for atlas-scale data analysis; our method seamlessly integrates single-cell and bulk data in joint analysis, not requiring additional preprocessing or feature selection steps.
Collapse
Affiliation(s)
- Sishir Subedi
- Graduate Program, University of British Columbia, Vancouver, Canada
- BC Cancer Research, Vancouver, Canada
| | - Tomokazu S Sumida
- Neurology, Program for Neuroinflammation, Yale School of Medicine, New Haven, CT, USA
| | - Yongjin P Park
- BC Cancer Research, Vancouver, Canada
- Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, Canada
- Department of Statistics, University of British Columbia, Vancouver, Canada
| |
Collapse
|
18
|
Stomma P, Rudnicki WR. HCS-hierarchical algorithm for simulation of omics datasets. Bioinformatics 2024; 40:ii98-ii104. [PMID: 39230692 PMCID: PMC11373347 DOI: 10.1093/bioinformatics/btae392] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/05/2024] Open
Abstract
MOTIVATION Analysis of the omics data with the help of machine learning (ML) methods is limited by small sample sizes and a large number of variables. One possible approach to deal with such data is using algorithms for feature selection and reducing the dataset to include only those variables that are related to the studied phenomena. Existing simulators of the omics data were mostly developed with the goal of improving the methods for generations of high-quality data, that correspond with the highest possible fidelity to the real level of molecular markers in the biological materials. The current study aims to simulate the data on a higher level of generalization. Such datasets can then be used to perform tests of the feature selection and ML algorithms on systems that have structures mimicking those of real data, but where the ground truth may be implanted by design. They can also be used to generate contrast variables with the desired correlation structure for the feature selection. RESULTS We proposed the algorithm for the reconstruction of the omic dataset that, with high fidelity, preserves the correlation structure of the original data with a reduced number of parameters. It is based on the hierarchical clustering of variables and uses principal components of the clusters. It reproduces well topological descriptors of the correlation structure. The correlation structure of the principal components of the clusters then is used to obtain datasets with correlation structures similar to the original data but not correlated with the original variables. AVAILABILITY AND IMPLEMENTATION The code and data is available at: https://github.com/p100mma/hcrs_omics.
Collapse
Affiliation(s)
- Piotr Stomma
- Faculty of Computer Science, University of Białystok, Białystok 15-245, Poland
- Computational Centre, University of Białystok, Białystok 15-245, Poland
| | - Witold R Rudnicki
- Faculty of Computer Science, University of Białystok, Białystok 15-245, Poland
- Computational Centre, University of Białystok, Białystok 15-245, Poland
| |
Collapse
|
19
|
Zhang J, Larschan E, Bigness J, Singh R. scNODE : generative model for temporal single cell transcriptomic data prediction. Bioinformatics 2024; 40:ii146-ii154. [PMID: 39230694 PMCID: PMC11373355 DOI: 10.1093/bioinformatics/btae393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/05/2024] Open
Abstract
SUMMARY Measurement of single-cell gene expression at different timepoints enables the study of cell development. However, due to the resource constraints and technical challenges associated with the single-cell experiments, researchers can only profile gene expression at discrete and sparsely sampled timepoints. This missing timepoint information impedes downstream cell developmental analyses. We propose scNODE, an end-to-end deep learning model that can predict in silico single-cell gene expression at unobserved timepoints. scNODE integrates a variational autoencoder with neural ordinary differential equations to predict gene expression using a continuous and nonlinear latent space. Importantly, we incorporate a dynamic regularization term to learn a latent space that is robust against distribution shifts when predicting single-cell gene expression at unobserved timepoints. Our evaluations on three real-world scRNA-seq datasets show that scNODE achieves higher predictive performance than state-of-the-art methods. We further demonstrate that scNODE's predictions help cell trajectory inference under the missing timepoint paradigm and the learned latent space is useful for in silico perturbation analysis of relevant genes along a developmental cell path. AVAILABILITY AND IMPLEMENTATION The data and code are publicly available at https://github.com/rsinghlab/scNODE.
Collapse
Affiliation(s)
- Jiaqi Zhang
- Department of Computer Science, Brown University, Providence, RI 02906, United States
| | - Erica Larschan
- Center for Computational Molecular Biology, Brown University, Providence, RI 02912, United States
- Department of Molecular Biology, Cell Biology and Biochemistry, Brown University, Providence, RI 02912, United States
| | - Jeremy Bigness
- Center for Computational Molecular Biology, Brown University, Providence, RI 02912, United States
| | - Ritambhara Singh
- Department of Computer Science, Brown University, Providence, RI 02906, United States
- Center for Computational Molecular Biology, Brown University, Providence, RI 02912, United States
| |
Collapse
|
20
|
Chen Z, Wang C, Huang S, Shi Y, Xi R. Directly selecting cell-type marker genes for single-cell clustering analyses. CELL REPORTS METHODS 2024; 4:100810. [PMID: 38981475 PMCID: PMC11294843 DOI: 10.1016/j.crmeth.2024.100810] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Revised: 03/16/2024] [Accepted: 06/12/2024] [Indexed: 07/11/2024]
Abstract
In single-cell RNA sequencing (scRNA-seq) studies, cell types and their marker genes are often identified by clustering and differentially expressed gene (DEG) analysis. A common practice is to select genes using surrogate criteria such as variance and deviance, then cluster them using selected genes and detect markers by DEG analysis assuming known cell types. The surrogate criteria can miss important genes or select unimportant genes, while DEG analysis has the selection-bias problem. We present Festem, a statistical method for the direct selection of cell-type markers for downstream clustering. Festem distinguishes marker genes with heterogeneous distribution across cells that are cluster informative. Simulation and scRNA-seq applications demonstrate that Festem can sensitively select markers with high precision and enables the identification of cell types often missed by other methods. In a large intrahepatic cholangiocarcinoma dataset, we identify diverse CD8+ T cell types and potential prognostic marker genes.
Collapse
Affiliation(s)
- Zihao Chen
- School of Mathematical Sciences and Center for Statistical Science, Peking University, Beijing 100871, China
| | - Changhu Wang
- School of Mathematical Sciences and Center for Statistical Science, Peking University, Beijing 100871, China
| | - Siyuan Huang
- Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, China
| | - Yang Shi
- BeiGene (Beijing) Co., Ltd., Beijing 100871, China
| | - Ruibin Xi
- School of Mathematical Sciences and Center for Statistical Science, Peking University, Beijing 100871, China.
| |
Collapse
|
21
|
Sarkar H, Chitra U, Gold J, Raphael BJ. A count-based model for delineating cell-cell interactions in spatial transcriptomics data. Bioinformatics 2024; 40:i481-i489. [PMID: 38940134 PMCID: PMC11211854 DOI: 10.1093/bioinformatics/btae219] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
MOTIVATION Cell-cell interactions (CCIs) consist of cells exchanging signals with themselves and neighboring cells by expressing ligand and receptor molecules and play a key role in cellular development, tissue homeostasis, and other critical biological functions. Since direct measurement of CCIs is challenging, multiple methods have been developed to infer CCIs by quantifying correlations between the gene expression of the ligands and receptors that mediate CCIs, originally from bulk RNA-sequencing data and more recently from single-cell or spatially resolved transcriptomics (SRT) data. SRT has a particular advantage over single-cell approaches, since ligand-receptor correlations can be computed between cells or spots that are physically close in the tissue. However, the transcript counts of individual ligands and receptors in SRT data are generally low, complicating the inference of CCIs from expression correlations. RESULTS We introduce Copulacci, a count-based model for inferring CCIs from SRT data. Copulacci uses a Gaussian copula to model dependencies between the expression of ligands and receptors from nearby spatial locations even when the transcript counts are low. On simulated data, Copulacci outperforms existing CCI inference methods based on the standard Spearman and Pearson correlation coefficients. Using several real SRT datasets, we show that Copulacci discovers biologically meaningful ligand-receptor interactions that are lowly expressed and undiscoverable by existing CCI inference methods. AVAILABILITY AND IMPLEMENTATION Copulacci is implemented in Python and available at https://github.com/raphael-group/copulacci.
Collapse
Affiliation(s)
- Hirak Sarkar
- Department of Computer Science, Princeton University, Princeton, NJ, 08540, United States
- Ludwig Cancer Institute, Princeton Branch, Princeton University, Princeton, NJ, 08540, United States
| | - Uthsav Chitra
- Department of Computer Science, Princeton University, Princeton, NJ, 08540, United States
| | - Julian Gold
- Department of Computer Science, Princeton University, Princeton, NJ, 08540, United States
- Center for Statistics and Machine Learning, Princeton University, Princeton, NJ, 08540, United States
| | - Benjamin J Raphael
- Department of Computer Science, Princeton University, Princeton, NJ, 08540, United States
| |
Collapse
|
22
|
Qian J, Bao H, Shao X, Fang Y, Liao J, Chen Z, Li C, Guo W, Hu Y, Li A, Yao Y, Fan X, Cheng Y. Simulating multiple variability in spatially resolved transcriptomics with scCube. Nat Commun 2024; 15:5021. [PMID: 38866768 PMCID: PMC11169532 DOI: 10.1038/s41467-024-49445-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2023] [Accepted: 06/03/2024] [Indexed: 06/14/2024] Open
Abstract
A pressing challenge in spatially resolved transcriptomics (SRT) is to benchmark the computational methods. A widely-used approach involves utilizing simulated data. However, biases exist in terms of the currently available simulated SRT data, which seriously affects the accuracy of method evaluation and validation. Herein, we present scCube ( https://github.com/ZJUFanLab/scCube ), a Python package for independent, reproducible, and technology-diverse simulation of SRT data. scCube not only enables the preservation of spatial expression patterns of genes in reference-based simulations, but also generates simulated data with different spatial variability (covering the spatial pattern type, the resolution, the spot arrangement, the targeted gene type, and the tissue slice dimension, etc.) in reference-free simulations. We comprehensively benchmark scCube with existing single-cell or SRT simulators, and demonstrate the utility of scCube in benchmarking spot deconvolution, gene imputation, and resolution enhancement methods in detail through three applications.
Collapse
Affiliation(s)
- Jingyang Qian
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China
- National Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, 314100, Jiaxing, China
| | - Hudong Bao
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China
| | - Xin Shao
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China
- National Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, 314100, Jiaxing, China
| | - Yin Fang
- College of Computer Science and Technology, Zhejiang University, Hangzhou, 310013, China
| | - Jie Liao
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China
- National Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, 314100, Jiaxing, China
| | - Zhuo Chen
- College of Computer Science and Technology, Zhejiang University, Hangzhou, 310013, China
| | - Chengyu Li
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China
- National Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, 314100, Jiaxing, China
| | - Wenbo Guo
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China
- National Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, 314100, Jiaxing, China
| | - Yining Hu
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China
- National Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, 314100, Jiaxing, China
| | - Anyao Li
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China
- National Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, 314100, Jiaxing, China
| | - Yue Yao
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China
- National Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, 314100, Jiaxing, China
| | - Xiaohui Fan
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China.
- National Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, 314100, Jiaxing, China.
- Zhejiang Key Laboratory of Precision Diagnosis and Therapy for Major Gynecological Diseases, Women's Hospital, Zhejiang University School of Medicine, Hangzhou, 310006, China.
| | - Yiyu Cheng
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China.
- National Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, 314100, Jiaxing, China.
| |
Collapse
|
23
|
Wang W, Cen Y, Lu Z, Xu Y, Sun T, Xiao Y, Liu W, Li JJ, Wang C. scCDC: a computational method for gene-specific contamination detection and correction in single-cell and single-nucleus RNA-seq data. Genome Biol 2024; 25:136. [PMID: 38783325 PMCID: PMC11112958 DOI: 10.1186/s13059-024-03284-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Accepted: 05/16/2024] [Indexed: 05/25/2024] Open
Abstract
In droplet-based single-cell and single-nucleus RNA-seq assays, systematic contamination of ambient RNA molecules biases the quantification of gene expression levels. Existing methods correct the contamination for all genes globally. However, there lacks specific evaluation of correction efficacy for varying contamination levels. Here, we show that DecontX and CellBender under-correct highly contaminating genes, while SoupX and scAR over-correct lowly/non-contaminating genes. Here, we develop scCDC as the first method to detect the contamination-causing genes and only correct expression levels of these genes, some of which are cell-type markers. Compared with existing decontamination methods, scCDC excels in decontaminating highly contaminating genes while avoiding over-correction of other genes.
Collapse
Affiliation(s)
- Weijian Wang
- Centre of Biomedical Systems and Informatics, International Campus, ZJU-UoE Institute, Zhejiang University School of Medicine, Zhejiang University, Haining, Zhejiang, 314400, China
| | - Yihui Cen
- Centre of Biomedical Systems and Informatics, International Campus, ZJU-UoE Institute, Zhejiang University School of Medicine, Zhejiang University, Haining, Zhejiang, 314400, China
| | - Zezhen Lu
- Centre of Biomedical Systems and Informatics, International Campus, ZJU-UoE Institute, Zhejiang University School of Medicine, Zhejiang University, Haining, Zhejiang, 314400, China
| | - Yueqing Xu
- Centre of Biomedical Systems and Informatics, International Campus, ZJU-UoE Institute, Zhejiang University School of Medicine, Zhejiang University, Haining, Zhejiang, 314400, China
| | - Tianyi Sun
- Department of Statistics and Data Science, University of California, Los Angeles, CA, 90095, USA
| | - Ying Xiao
- Sir Run Run Shaw Hospital, School of Medicine, Zhejiang University, Hangzhou, Zhejiang, 310020, China
| | - Wanlu Liu
- Centre of Biomedical Systems and Informatics, International Campus, ZJU-UoE Institute, Zhejiang University School of Medicine, Zhejiang University, Haining, Zhejiang, 314400, China
| | - Jingyi Jessica Li
- Department of Statistics and Data Science, University of California, Los Angeles, CA, 90095, USA.
| | - Chaochen Wang
- Centre of Biomedical Systems and Informatics, International Campus, ZJU-UoE Institute, Zhejiang University School of Medicine, Zhejiang University, Haining, Zhejiang, 314400, China.
- Department of Gynecology, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, Zhejiang, 310020, China.
- Biomedical and Health Translational Research Centre, Zhejiang University, Haining, Zhejiang, 314400, China.
| |
Collapse
|
24
|
Kim H, Chang W, Chae SJ, Park JE, Seo M, Kim JK. scLENS: data-driven signal detection for unbiased scRNA-seq data analysis. Nat Commun 2024; 15:3575. [PMID: 38678050 PMCID: PMC11519519 DOI: 10.1038/s41467-024-47884-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2023] [Accepted: 04/14/2024] [Indexed: 04/29/2024] Open
Abstract
High dimensionality and noise have limited the new biological insights that can be discovered in scRNA-seq data. While dimensionality reduction tools have been developed to extract biological signals from the data, they often require manual determination of signal dimension, introducing user bias. Furthermore, a common data preprocessing method, log normalization, can unintentionally distort signals in the data. Here, we develop scLENS, a dimensionality reduction tool that circumvents the long-standing issues of signal distortion and manual input. Specifically, we identify the primary cause of signal distortion during log normalization and effectively address it by uniformizing cell vector lengths with L2 normalization. Furthermore, we utilize random matrix theory-based noise filtering and a signal robustness test to enable data-driven determination of the threshold for signal dimensions. Our method outperforms 11 widely used dimensionality reduction tools and performs particularly well for challenging scRNA-seq datasets with high sparsity and variability. To facilitate the use of scLENS, we provide a user-friendly package that automates accurate signal detection of scRNA-seq data without manual time-consuming tuning.
Collapse
Affiliation(s)
- Hyun Kim
- Biomedical Mathematics Group, Pioneer Research Center for Mathematical and Computational Sciences, Institute for Basic Science, Daejeon, 34126, Republic of Korea
| | - Won Chang
- Division of Statistics and Data Science, University of Cincinnati, Cincinnati, OH, 45221, USA
| | - Seok Joo Chae
- Biomedical Mathematics Group, Pioneer Research Center for Mathematical and Computational Sciences, Institute for Basic Science, Daejeon, 34126, Republic of Korea
- Department of Mathematical Sciences, KAIST, Daejeon, 34141, Republic of Korea
| | - Jong-Eun Park
- Graduate School of Medical Science and Engineering, KAIST, Daejeon, 34141, Republic of Korea
| | - Minseok Seo
- Department of Computer and Information Science, Korea University, Sejong, 30019, Republic of Korea
| | - Jae Kyoung Kim
- Biomedical Mathematics Group, Pioneer Research Center for Mathematical and Computational Sciences, Institute for Basic Science, Daejeon, 34126, Republic of Korea.
- Department of Mathematical Sciences, KAIST, Daejeon, 34141, Republic of Korea.
| |
Collapse
|
25
|
Peng M, Lin B, Zhang J, Zhou Y, Lin B. scFSNN: a feature selection method based on neural network for single-cell RNA-seq data. BMC Genomics 2024; 25:264. [PMID: 38459442 PMCID: PMC10924397 DOI: 10.1186/s12864-024-10160-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Accepted: 02/25/2024] [Indexed: 03/10/2024] Open
Abstract
While single-cell RNA sequencing (scRNA-seq) allows researchers to analyze gene expression in individual cells, its unique characteristics like over-dispersion, zero-inflation, high gene-gene correlation, and large data volume with many features pose challenges for most existing feature selection methods. In this paper, we present a feature selection method based on neural network (scFSNN) to solve classification problem for the scRNA-seq data. scFSNN is an embedded method that can automatically select features (genes) during model training, control the false discovery rate of selected features and adaptively determine the number of features to be eliminated. Extensive simulation and real data studies demonstrate its excellent feature selection ability and predictive performance.
Collapse
Affiliation(s)
- Minjiao Peng
- School of Mathematical Sciences, Shenzhen University, Nanshan, Shenzhen, 518060, Guangdong, China
- School of Mathematics and Statistics and KLAS, Northeast Normal University, Renmin Street, Changchun, 130000, Jilin, China
| | - Baoqin Lin
- Experimental Center, The First Affiliated Hospital of Guangzhou University of Chinese Medicine, Guangzhou, Guangdong, 510405, China
| | - Jun Zhang
- School of Mathematical Sciences, Shenzhen University, Nanshan, Shenzhen, 518060, Guangdong, China
| | - Yan Zhou
- School of Mathematical Sciences, Shenzhen University, Nanshan, Shenzhen, 518060, Guangdong, China
| | - Bingqing Lin
- School of Mathematical Sciences, Shenzhen University, Nanshan, Shenzhen, 518060, Guangdong, China.
| |
Collapse
|
26
|
Singhal V, Chou N, Lee J, Yue Y, Liu J, Chock WK, Lin L, Chang YC, Teo EML, Aow J, Lee HK, Chen KH, Prabhakar S. BANKSY unifies cell typing and tissue domain segmentation for scalable spatial omics data analysis. Nat Genet 2024; 56:431-441. [PMID: 38413725 PMCID: PMC10937399 DOI: 10.1038/s41588-024-01664-3] [Citation(s) in RCA: 28] [Impact Index Per Article: 28.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2023] [Accepted: 01/16/2024] [Indexed: 02/29/2024]
Abstract
Spatial omics data are clustered to define both cell types and tissue domains. We present Building Aggregates with a Neighborhood Kernel and Spatial Yardstick (BANKSY), an algorithm that unifies these two spatial clustering problems by embedding cells in a product space of their own and the local neighborhood transcriptome, representing cell state and microenvironment, respectively. BANKSY's spatial feature augmentation strategy improved performance on both tasks when tested on diverse RNA (imaging, sequencing) and protein (imaging) datasets. BANKSY revealed unexpected niche-dependent cell states in the mouse brain and outperformed competing methods on domain segmentation and cell typing benchmarks. BANKSY can also be used for quality control of spatial transcriptomics data and for spatially aware batch effect correction. Importantly, it is substantially faster and more scalable than existing methods, enabling the processing of millions of cell datasets. In summary, BANKSY provides an accurate, biologically motivated, scalable and versatile framework for analyzing spatially resolved omics data.
Collapse
Affiliation(s)
- Vipul Singhal
- Spatial and Single Cell Systems Domain, Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), Singapore, Republic of Singapore
| | - Nigel Chou
- Spatial and Single Cell Systems Domain, Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), Singapore, Republic of Singapore
| | - Joseph Lee
- Faculty of Science, National University of Singapore, Singapore, Republic of Singapore
| | - Yifei Yue
- Department of Chemical and Biomolecular Engineering, National University of Singapore, Singapore, Republic of Singapore
| | - Jinyue Liu
- Spatial and Single Cell Systems Domain, Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), Singapore, Republic of Singapore
| | - Wan Kee Chock
- Spatial and Single Cell Systems Domain, Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), Singapore, Republic of Singapore
| | - Li Lin
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), Singapore, Republic of Singapore
| | | | | | - Jonathan Aow
- Spatial and Single Cell Systems Domain, Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), Singapore, Republic of Singapore
| | - Hwee Kuan Lee
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), Singapore, Republic of Singapore
- School of Computing, National University of Singapore, Singapore, Republic of Singapore
- Singapore Eye Research Institute, Singapore, Republic of Singapore
- International Research Laboratory on Artificial Intelligence, Singapore, Republic of Singapore
- School of Biological Sciences, Nanyang Technological University, Singapore, Republic of Singapore
- Singapore Institute for Clinical Sciences, Agency for Science, Technology and Research, Singapore, Republic of Singapore
| | - Kok Hao Chen
- Spatial and Single Cell Systems Domain, Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), Singapore, Republic of Singapore.
| | - Shyam Prabhakar
- Spatial and Single Cell Systems Domain, Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), Singapore, Republic of Singapore.
- Population and Global Health, Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Republic of Singapore.
- Cancer Science Institute of Singapore, National University of Singapore, Singapore, Republic of Singapore.
| |
Collapse
|
27
|
Song D, Wang Q, Yan G, Liu T, Sun T, Li JJ. scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics. Nat Biotechnol 2024; 42:247-252. [PMID: 37169966 PMCID: PMC11182337 DOI: 10.1038/s41587-023-01772-1] [Citation(s) in RCA: 34] [Impact Index Per Article: 34.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2022] [Accepted: 03/30/2023] [Indexed: 05/13/2023]
Abstract
We present a statistical simulator, scDesign3, to generate realistic single-cell and spatial omics data, including various cell states, experimental designs and feature modalities, by learning interpretable parameters from real data. Using a unified probabilistic model for single-cell and spatial omics data, scDesign3 infers biologically meaningful parameters; assesses the goodness-of-fit of inferred cell clusters, trajectories and spatial locations; and generates in silico negative and positive controls for benchmarking computational tools.
Collapse
Affiliation(s)
- Dongyuan Song
- Bioinformatics Interdepartmental Ph.D. Program, University of California, Los Angeles, CA, USA
| | - Qingyang Wang
- Department of Statistics, University of California, Los Angeles, CA, USA
| | - Guanao Yan
- Department of Statistics, University of California, Los Angeles, CA, USA
| | - Tianyang Liu
- Department of Statistics, University of California, Los Angeles, CA, USA
| | - Tianyi Sun
- Department of Statistics, University of California, Los Angeles, CA, USA
| | - Jingyi Jessica Li
- Bioinformatics Interdepartmental Ph.D. Program, University of California, Los Angeles, CA, USA.
- Department of Statistics, University of California, Los Angeles, CA, USA.
- Department of Human Genetics, University of California, Los Angeles, CA, USA.
- Department of Computational Medicine, University of California, Los Angeles, CA, USA.
- Department of Biostatistics, University of California, Los Angeles, CA, USA.
- Radcliffe Institute for Advanced Study, Harvard University, Cambridge, MA, USA.
| |
Collapse
|
28
|
Su C, Zhang J, Zhao H. Estimating cell-type-specific gene co-expression networks from bulk gene expression data with an application to Alzheimer's disease. J Am Stat Assoc 2024; 119:811-824. [PMID: 39280354 PMCID: PMC11394578 DOI: 10.1080/01621459.2023.2297467] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Revised: 11/20/2023] [Accepted: 12/13/2023] [Indexed: 09/18/2024]
Abstract
Inferring and characterizing gene co-expression networks has led to important insights on the molecular mechanisms of complex diseases. Most co-expression analyses to date have been performed on gene expression data collected from bulk tissues with different cell type compositions across samples. As a result, the co-expression estimates only offer an aggregated view of the underlying gene regulations and can be confounded by heterogeneity in cell type compositions, failing to reveal gene coordination that may be distinct across different cell types. In this paper, we introduce a flexible framework for estimating cell-type-specific gene co-expression networks from bulk sample data, without making specific assumptions on the distributions of gene expression profiles in different cell types. We develop a novel sparse least squares estimator, referred to as CSNet, that is efficient to implement and has good theoretical properties. Using CSNet, we analyzed the bulk gene expression data from a cohort study on Alzheimer's disease and identified previously unknown cell-type-specific co-expressions among Alzheimer's disease risk genes, suggesting cell-type-specific disease mechanisms.
Collapse
Affiliation(s)
- Chang Su
- Department of Biostatistics and Bioinformatics, Emory University
- Department of Biostatistics, Yale University
| | - Jingfei Zhang
- Information Systems and Operations Management, Emory University
| | - Hongyu Zhao
- Department of Biostatistics, Yale University
| |
Collapse
|
29
|
Yang Y, Wang K, Lu Z, Wang T, Wang X. Cytomulate: accurate and efficient simulation of CyTOF data. Genome Biol 2023; 24:262. [PMID: 37974276 PMCID: PMC10652542 DOI: 10.1186/s13059-023-03099-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2022] [Accepted: 10/24/2023] [Indexed: 11/19/2023] Open
Abstract
Recently, many analysis tools have been devised to offer insights into data generated via cytometry by time-of-flight (CyTOF). However, objective evaluations of these methods remain absent as most evaluations are conducted against real data where the ground truth is generally unknown. In this paper, we develop Cytomulate, a reproducible and accurate simulation algorithm of CyTOF data, which could serve as a foundation for future method development and evaluation. We demonstrate that Cytomulate can capture various characteristics of CyTOF data and is superior in learning overall data distributions than single-cell RNA-seq-oriented methods such as scDesign2, Splatter, and generative models like LAMBDA.
Collapse
Affiliation(s)
- Yuqiu Yang
- Department of Statistics and Data Science, Southern Methodist University, Dallas, TX, 75275, USA
- Quantitative Biomedical Research Center, Peter O'Donnell Jr. School of Public Health, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Kaiwen Wang
- Department of Statistics and Data Science, Southern Methodist University, Dallas, TX, 75275, USA
| | - Zeyu Lu
- Department of Statistics and Data Science, Southern Methodist University, Dallas, TX, 75275, USA
- Quantitative Biomedical Research Center, Peter O'Donnell Jr. School of Public Health, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Tao Wang
- Quantitative Biomedical Research Center, Peter O'Donnell Jr. School of Public Health, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
- Center for the Genetics of Host Defense, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
| | - Xinlei Wang
- Department of Statistics and Data Science, Southern Methodist University, Dallas, TX, 75275, USA.
- Department of Mathematics, University of Texas at Arlington, Arlington, 76019, USA.
- Center for Data Science Research and Education, College of Science, University of Texas at Arlington, Arlington, 76019, USA.
| |
Collapse
|
30
|
Liu J, Kreimer A, Li WV. Differential variability analysis of single-cell gene expression data. Brief Bioinform 2023; 24:bbad294. [PMID: 37598422 PMCID: PMC10516347 DOI: 10.1093/bib/bbad294] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Revised: 07/18/2023] [Accepted: 07/29/2023] [Indexed: 08/22/2023] Open
Abstract
The advent of single-cell RNA sequencing (scRNA-seq) technologies has enabled gene expression profiling at the single-cell resolution, thereby enabling the quantification and comparison of transcriptional variability among individual cells. Although alterations in transcriptional variability have been observed in various biological states, statistical methods for quantifying and testing differential variability between groups of cells are still lacking. To identify the best practices in differential variability analysis of single-cell gene expression data, we propose and compare 12 statistical pipelines using different combinations of methods for normalization, feature selection, dimensionality reduction and variability calculation. Using high-quality synthetic scRNA-seq datasets, we benchmarked the proposed pipelines and found that the most powerful and accurate pipeline performs simple library size normalization, retains all genes in analysis and uses denSNE-based distances to cluster medoids as the variability measure. By applying this pipeline to scRNA-seq datasets of COVID-19 and autism patients, we have identified cellular variability changes between patients with different severity status or between patients and healthy controls.
Collapse
Affiliation(s)
- Jiayi Liu
- Graduate Programs in Molecular Biosciences, Rutgers, The State University of New Jersey, 604 Allison Rd, Piscataway, 08854, NJ, USA
- Department of Biochemistry and Molecular Biology, Rutgers, The State University of New Jersey, 604 Allison Road, Piscataway, 08854, NJ, USA
- Center for Advanced Biotechnology and Medicine, Rutgers, The State University of New Jersey, 679 Hoes Lane West, Piscataway, Piscataway, 08854, NJ, USA
| | - Anat Kreimer
- Department of Biochemistry and Molecular Biology, Rutgers, The State University of New Jersey, 604 Allison Road, Piscataway, 08854, NJ, USA
- Center for Advanced Biotechnology and Medicine, Rutgers, The State University of New Jersey, 679 Hoes Lane West, Piscataway, Piscataway, 08854, NJ, USA
| | - Wei Vivian Li
- Department of Statistics, University of California, Riverside, 900 University Ave, Riverside, 92521, CA, USA
- Previous affiliation where part of the work was completed: Department of Biostatistics and Epidemiology, Rutgers, The State University of New Jersey, 683 Hoes Lane West, Piscataway, 08854, NJ, USA
| |
Collapse
|
31
|
Li H, Zhang Z, Squires M, Chen X, Zhang X. scMultiSim: simulation of single cell multi-omics and spatial data guided by gene regulatory networks and cell-cell interactions. RESEARCH SQUARE 2023:rs.3.rs-3301625. [PMID: 37790516 PMCID: PMC10543280 DOI: 10.21203/rs.3.rs-3301625/v1] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 10/05/2023]
Abstract
Simulated single-cell data is essential for designing and evaluating computational methods in the absence of experimental ground truth. Existing simulators typically focus on modeling one or two specific biological factors or mechanisms that affect the output data, which limits their capacity to simulate the complexity and multi-modality in real data. Here, we present scMultiSim, an in silico simulator that generates multi-modal single-cell data, including gene expression, chromatin accessibility, RNA velocity, and spatial cell locations while accounting for the relationships between modalities. scMultiSim jointly models various biological factors that affect the output data, including cell identity, within-cell gene regulatory networks (GRNs), cell-cell interactions (CCIs), and chromatin accessibility, hile also incorporating technical noises. Moreover, it allows users to adjust each factor's effect easily. We validated scMultiSim's simulated biological effects and demonstrated its applications by benchmarking a wide range of computational tasks, including multi-modal and multi-batch data integration, RNA velocity estimation, GRN inference and CCI inference using spatially resolved gene expression data, many of them were not benchmarked before due to the lack of proper tools. Compared to existing simulators, scMultiSim can benchmark a much broader range of existing computational problems and even new potential tasks.
Collapse
Affiliation(s)
- Hechen Li
- Georgia Institute of Technology, Atlanta, USA
| | - Ziqi Zhang
- Georgia Institute of Technology, Atlanta, USA
| | | | - Xi Chen
- Southern University of Science and Technology, Shenzhen, China
| | | |
Collapse
|
32
|
Ma Y, Deng C, Zhou Y, Zhang Y, Qiu F, Jiang D, Zheng G, Li J, Shuai J, Zhang Y, Yang J, Su J. Polygenic regression uncovers trait-relevant cellular contexts through pathway activation transformation of single-cell RNA sequencing data. CELL GENOMICS 2023; 3:100383. [PMID: 37719150 PMCID: PMC10504677 DOI: 10.1016/j.xgen.2023.100383] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/23/2023] [Revised: 05/26/2023] [Accepted: 07/25/2023] [Indexed: 09/19/2023]
Abstract
Advances in single-cell RNA sequencing (scRNA-seq) techniques have accelerated functional interpretation of disease-associated variants discovered from genome-wide association studies (GWASs). However, identification of trait-relevant cell populations is often impeded by inherent technical noise and high sparsity in scRNA-seq data. Here, we developed scPagwas, a computational approach that uncovers trait-relevant cellular context by integrating pathway activation transformation of scRNA-seq data and GWAS summary statistics. scPagwas effectively prioritizes trait-relevant genes, which facilitates identification of trait-relevant cell types/populations with high accuracy in extensive simulated and real datasets. Cellular-level association results identified a novel subpopulation of naive CD8+ T cells related to COVID-19 severity and oligodendrocyte progenitor cell and microglia subsets with critical pathways by which genetic variants influence Alzheimer's disease. Overall, our approach provides new insights for the discovery of trait-relevant cell types and improves the mechanistic understanding of disease variants from a pathway perspective.
Collapse
Affiliation(s)
- Yunlong Ma
- School of Biomedical Engineering, School of OphthalmoFlogy & Optometry and Eye Hospital, Wenzhou Medical University, Wenzhou, Zhejiang 325027, China
- Oujiang Laboratory, Zhejiang Lab for Regenerative Medicine, Vision and Brain Health, Wenzhou, Zhejiang 325101, China
| | - Chunyu Deng
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang 150080, China
| | - Yijun Zhou
- School of Biomedical Engineering, School of OphthalmoFlogy & Optometry and Eye Hospital, Wenzhou Medical University, Wenzhou, Zhejiang 325027, China
- Oujiang Laboratory, Zhejiang Lab for Regenerative Medicine, Vision and Brain Health, Wenzhou, Zhejiang 325101, China
| | - Yaru Zhang
- School of Biomedical Engineering, School of OphthalmoFlogy & Optometry and Eye Hospital, Wenzhou Medical University, Wenzhou, Zhejiang 325027, China
- Oujiang Laboratory, Zhejiang Lab for Regenerative Medicine, Vision and Brain Health, Wenzhou, Zhejiang 325101, China
| | - Fei Qiu
- School of Biomedical Engineering, School of OphthalmoFlogy & Optometry and Eye Hospital, Wenzhou Medical University, Wenzhou, Zhejiang 325027, China
| | - Dingping Jiang
- School of Biomedical Engineering, School of OphthalmoFlogy & Optometry and Eye Hospital, Wenzhou Medical University, Wenzhou, Zhejiang 325027, China
| | - Gongwei Zheng
- School of Biomedical Engineering, School of OphthalmoFlogy & Optometry and Eye Hospital, Wenzhou Medical University, Wenzhou, Zhejiang 325027, China
| | - Jingjing Li
- School of Biomedical Engineering, School of OphthalmoFlogy & Optometry and Eye Hospital, Wenzhou Medical University, Wenzhou, Zhejiang 325027, China
| | - Jianwei Shuai
- Oujiang Laboratory, Zhejiang Lab for Regenerative Medicine, Vision and Brain Health, Wenzhou, Zhejiang 325101, China
| | - Yan Zhang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang 150080, China
| | - Jian Yang
- School of Life Sciences, Westlake University, Hangzhou, Zhejiang 310012, China
- Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, Zhejiang 310024, China
| | - Jianzhong Su
- School of Biomedical Engineering, School of OphthalmoFlogy & Optometry and Eye Hospital, Wenzhou Medical University, Wenzhou, Zhejiang 325027, China
- Oujiang Laboratory, Zhejiang Lab for Regenerative Medicine, Vision and Brain Health, Wenzhou, Zhejiang 325101, China
| |
Collapse
|
33
|
He X, Qian K, Wang Z, Zeng S, Li H, Li WV. scAce: an adaptive embedding and clustering method for single-cell gene expression data. Bioinformatics 2023; 39:btad546. [PMID: 37672035 PMCID: PMC10500084 DOI: 10.1093/bioinformatics/btad546] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Revised: 08/01/2023] [Accepted: 09/05/2023] [Indexed: 09/07/2023] Open
Abstract
MOTIVATION Since the development of single-cell RNA sequencing (scRNA-seq) technologies, clustering analysis of single-cell gene expression data has been an essential tool for distinguishing cell types and identifying novel cell types. Even though many methods have been available for scRNA-seq clustering analysis, the majority of them are constrained by the requirement on predetermined cluster numbers or the dependence on selected initial cluster assignment. RESULTS In this article, we propose an adaptive embedding and clustering method named scAce, which constructs a variational autoencoder to simultaneously learn cell embeddings and cluster assignments. In the scAce method, we develop an adaptive cluster merging approach which achieves improved clustering results without the need to estimate the number of clusters in advance. In addition, scAce provides an option to perform clustering enhancement, which can update and enhance cluster assignments based on previous clustering results from other methods. Based on computational analysis of both simulated and real datasets, we demonstrate that scAce outperforms state-of-the-art clustering methods for scRNA-seq data, and achieves better clustering accuracy and robustness. AVAILABILITY AND IMPLEMENTATION The scAce package is implemented in python 3.8 and is freely available from https://github.com/sldyns/scAce.
Collapse
Affiliation(s)
- Xinwei He
- School of Mathematics and Physics, China University of Geosciences, Wuhan 430074, China
| | - Kun Qian
- School of Mathematics and Physics, China University of Geosciences, Wuhan 430074, China
| | - Ziqian Wang
- School of Mathematics and Physics, China University of Geosciences, Wuhan 430074, China
| | - Shirou Zeng
- School of Mathematics and Physics, China University of Geosciences, Wuhan 430074, China
| | - Hongwei Li
- School of Mathematics and Physics, China University of Geosciences, Wuhan 430074, China
| | - Wei Vivian Li
- Department of Statistics, University of California, Riverside, Riverside 92521, United States
| |
Collapse
|
34
|
Song D, Li K, Ge X, Li JJ. ClusterDE: a post-clustering differential expression (DE) method robust to false-positive inflation caused by double dipping. RESEARCH SQUARE 2023:rs.3.rs-3211191. [PMID: 37577698 PMCID: PMC10418557 DOI: 10.21203/rs.3.rs-3211191/v1] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/15/2023]
Abstract
In typical single-cell RNA-seq (scRNA-seq) data analysis, a clustering algorithm is applied to find putative cell types as clusters, and then a statistical differential expression (DE) test is employed to identify the differentially expressed (DE) genes between the cell clusters. However, this common procedure uses the same data twice, an issue known as "double dipping": the same data is used twice to define cell clusters as potential cell types and DE genes as potential cell-type marker genes, leading to false-positive cell-type marker genes even when the cell clusters are spurious. To overcome this challenge, we propose ClusterDE, a post-clustering DE method for controlling the false discovery rate (FDR) of identified DE genes regardless of clustering quality, which can work as an add-on to popular pipelines such as Seurat. The core idea of ClusterDE is to generate real-data-based synthetic null data containing only one cluster, as contrast to the real data, for evaluating the whole procedure of clustering followed by a DE test. Using comprehensive simulation and real data analysis, we show that ClusterDE has not only solid FDR control but also the ability to identify cell-type marker genes as top DE genes and distinguish them from housekeeping genes. ClusterDE is fast, transparent, and adaptive to a wide range of clustering algorithms and DE tests. Besides scRNA-seq data, ClusterDE is generally applicable to post-clustering DE analysis, including single-cell multi-omics data analysis.
Collapse
Affiliation(s)
- Dongyuan Song
- Bioinformatics Interdepartmental Ph.D. Program, University of California, Los Angeles, CA 90095-7246
| | - Kexin Li
- Department of Statistics, University of California, Los Angeles, CA 90095-1554
| | - Xinzhou Ge
- Department of Statistics, University of California, Los Angeles, CA 90095-1554
| | - Jingyi Jessica Li
- Bioinformatics Interdepartmental Ph.D. Program, University of California, Los Angeles, CA 90095-7246
- Department of Statistics, University of California, Los Angeles, CA 90095-1554
- Department of Human Genetics, University of California, Los Angeles, CA 90095-7088
- Department of Computational Medicine, University of California, Los Angeles, CA 90095-1766
- Department of Biostatistics, University of California, Los Angeles, CA 90095-1772
- Radcliffe Institute for Advanced Study, Harvard University, Cambridge, MA 02138
| |
Collapse
|
35
|
Li C, Chen X, Chen S, Jiang R, Zhang X. simCAS: an embedding-based method for simulating single-cell chromatin accessibility sequencing data. Bioinformatics 2023; 39:btad453. [PMID: 37494428 PMCID: PMC10394124 DOI: 10.1093/bioinformatics/btad453] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2023] [Revised: 06/25/2023] [Accepted: 07/25/2023] [Indexed: 07/28/2023] Open
Abstract
MOTIVATION Single-cell chromatin accessibility sequencing (scCAS) technology provides an epigenomic perspective to characterize gene regulatory mechanisms at single-cell resolution. With an increasing number of computational methods proposed for analyzing scCAS data, a powerful simulation framework is desirable for evaluation and validation of these methods. However, existing simulators generate synthetic data by sampling reads from real data or mimicking existing cell states, which is inadequate to provide credible ground-truth labels for method evaluation. RESULTS We present simCAS, an embedding-based simulator, for generating high-fidelity scCAS data from both cell- and peak-wise embeddings. We demonstrate simCAS outperforms existing simulators in resembling real data and show that simCAS can generate cells of different states with user-defined cell populations and differentiation trajectories. Additionally, simCAS can simulate data from different batches and encode user-specified interactions of chromatin regions in the synthetic data, which provides ground-truth labels more than cell states. We systematically demonstrate that simCAS facilitates the benchmarking of four core tasks in downstream analysis: cell clustering, trajectory inference, data integration, and cis-regulatory interaction inference. We anticipate simCAS will be a reliable and flexible simulator for evaluating the ongoing computational methods applied on scCAS data. AVAILABILITY AND IMPLEMENTATION simCAS is freely available at https://github.com/Chen-Li-17/simCAS.
Collapse
Affiliation(s)
- Chen Li
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division of BNRIST, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Xiaoyang Chen
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division of BNRIST, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Shengquan Chen
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin 300071, China
| | - Rui Jiang
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division of BNRIST, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Xuegong Zhang
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division of BNRIST, Department of Automation, Tsinghua University, Beijing 100084, China
- Center for Synthetic and Systems Biology, School of Life Sciences and School of Medicine, Tsinghua University, Beijing 100084, China
| |
Collapse
|
36
|
Mohammad-Taheri S, Tewari V, Kapre R, Rahiminasab E, Sachs K, Tapley Hoyt C, Zucker J, Vitek O. Optimal adjustment sets for causal query estimation in partially observed biomolecular networks. Bioinformatics 2023; 39:i494-i503. [PMID: 37387179 PMCID: PMC10311316 DOI: 10.1093/bioinformatics/btad270] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
Causal query estimation in biomolecular networks commonly selects a 'valid adjustment set', i.e. a subset of network variables that eliminates the bias of the estimator. A same query may have multiple valid adjustment sets, each with a different variance. When networks are partially observed, current methods use graph-based criteria to find an adjustment set that minimizes asymptotic variance. Unfortunately, many models that share the same graph topology, and therefore same functional dependencies, may differ in the processes that generate the observational data. In these cases, the topology-based criteria fail to distinguish the variances of the adjustment sets. This deficiency can lead to sub-optimal adjustment sets, and to miss-characterization of the effect of the intervention. We propose an approach for deriving 'optimal adjustment sets' that takes into account the nature of the data, bias and finite-sample variance of the estimator, and cost. It empirically learns the data generating processes from historical experimental data, and characterizes the properties of the estimators by simulation. We demonstrate the utility of the proposed approach in four biomolecular Case studies with different topologies and different data generation processes. The implementation and reproducible Case studies are at https://github.com/srtaheri/OptimalAdjustmentSet.
Collapse
Affiliation(s)
- Sara Mohammad-Taheri
- Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, USA
| | - Vartika Tewari
- Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, USA
| | - Rohan Kapre
- Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, USA
| | | | - Karen Sachs
- Next Generation Analytics, Palo Alto California, USA
- Modulo Bio Inc, Los Altos, California, USA
- Answer ALS, New Orleans, LA, USA
| | - Charles Tapley Hoyt
- Laboratory of Systems Pharmacology, Harvard Medical School, Boston, Massachusetts, USA
| | - Jeremy Zucker
- Pacific Northwest National Laboratory, Richland, Washington 99354, USA
| | - Olga Vitek
- Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, USA
| |
Collapse
|
37
|
Lu S, Keleş S. Debiased personalized gene coexpression networks for population-scale scRNA-seq data. Genome Res 2023; 33:932-947. [PMID: 37295843 PMCID: PMC10519377 DOI: 10.1101/gr.277363.122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2022] [Accepted: 06/07/2023] [Indexed: 06/12/2023]
Abstract
Population-scale single-cell RNA-seq (scRNA-seq) data sets create unique opportunities for quantifying expression variation across individuals at the gene coexpression network level. Estimation of coexpression networks is well established for bulk RNA-seq; however, single-cell measurements pose novel challenges owing to technical limitations and noise levels of this technology. Gene-gene correlation estimates from scRNA-seq tend to be severely biased toward zero for genes with low and sparse expression. Here, we present Dozer to debias gene-gene correlation estimates from scRNA-seq data sets and accurately quantify network-level variation across individuals. Dozer corrects correlation estimates in the general Poisson measurement model and provides a metric to quantify genes measured with high noise. Computational experiments establish that Dozer estimates are robust to mean expression levels of the genes and the sequencing depths of the data sets. Compared with alternatives, Dozer results in fewer false-positive edges in the coexpression networks, yields more accurate estimates of network centrality measures and modules, and improves the faithfulness of networks estimated from separate batches of the data sets. We showcase unique analyses enabled by Dozer in two population-scale scRNA-seq applications. Coexpression network-based centrality analysis of multiple differentiating human induced pluripotent stem cell (iPSC) lines yields biologically coherent gene groups that are associated with iPSC differentiation efficiency. Application with population-scale scRNA-seq of oligodendrocytes from postmortem human tissues of Alzheimer's disease and controls uniquely reveals coexpression modules of innate immune response with distinct coexpression levels between the diagnoses. Dozer represents an important advance in estimating personalized coexpression networks from scRNA-seq data.
Collapse
Affiliation(s)
- Shan Lu
- Department of Statistics, University of Wisconsin, Madison, Wisconsin 53706, USA
| | - Sündüz Keleş
- Department of Statistics, University of Wisconsin, Madison, Wisconsin 53706, USA;
- Department of Biostatistics and Medical Informatics, University of Wisconsin School of Medicine and Public Health, Madison, Wisconsin 53706, USA
| |
Collapse
|
38
|
Lu S, Keleş S. Dozer: Debiased personalized gene co-expression networks for population-scale scRNA-seq data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.04.25.538290. [PMID: 37163070 PMCID: PMC10168282 DOI: 10.1101/2023.04.25.538290] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
Population-scale single cell RNA-seq (scRNA-seq) datasets create unique opportunities for quantifying expression variation across individuals at the gene co-expression network level. Estimation of co-expression networks is well-established for bulk RNA-seq; however, single-cell measurements pose novel challenges due to technical limitations and noise levels of this technology. Gene-gene correlation estimates from scRNA-seq tend to be severely biased towards zero for genes with low and sparse expression. Here, we present Dozer to debias gene-gene correlation estimates from scRNA-seq datasets and accurately quantify network level variation across individuals. Dozer corrects correlation estimates in the general Poisson measurement model and provides a metric to quantify genes measured with high noise. Computational experiments establish that Dozer estimates are robust to mean expression levels of the genes and the sequencing depths of the datasets. Compared to alternatives, Dozer results in fewer false positive edges in the co-expression networks, yields more accurate estimates of network centrality measures and modules, and improves the faithfulness of networks estimated from separate batches of the datasets. We showcase unique analyses enabled by Dozer in two population-scale scRNA-seq applications. Co-expression network-based centrality analysis of multiple differentiating human induced pluripotent stem cell (iPSC) lines yields biologically coherent gene groups that are associated with iPSC differentiation efficiency. Application with population-scale scRNA-seq of oligodendrocytes from postmortem human tissues of Alzheimer disease and controls uniquely reveals co-expression modules of innate immune response with markedly different co-expression levels between the diagnoses. Dozer represents an important advance in estimating personalized co-expression networks from scRNA-seq data.
Collapse
Affiliation(s)
- Shan Lu
- Department of Statistics, University of Wisconsin, Madison, WI, USA
| | - Sündüz Keleş
- Department of Statistics, University of Wisconsin, Madison, WI, USA
- Department of Biostatistics and Medical Informatics, University of Wisconsin School of Medicine and Public Health, Madison, WI, USA
| |
Collapse
|
39
|
Crowell HL, Morillo Leonardo SX, Soneson C, Robinson MD. The shaky foundations of simulating single-cell RNA sequencing data. Genome Biol 2023; 24:62. [PMID: 36991470 PMCID: PMC10061781 DOI: 10.1186/s13059-023-02904-1] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2021] [Accepted: 03/20/2023] [Indexed: 03/31/2023] Open
Abstract
BACKGROUND With the emergence of hundreds of single-cell RNA-sequencing (scRNA-seq) datasets, the number of computational tools to analyze aspects of the generated data has grown rapidly. As a result, there is a recurring need to demonstrate whether newly developed methods are truly performant-on their own as well as in comparison to existing tools. Benchmark studies aim to consolidate the space of available methods for a given task and often use simulated data that provide a ground truth for evaluations, thus demanding a high quality standard results credible and transferable to real data. RESULTS Here, we evaluated methods for synthetic scRNA-seq data generation in their ability to mimic experimental data. Besides comparing gene- and cell-level quality control summaries in both one- and two-dimensional settings, we further quantified these at the batch- and cluster-level. Secondly, we investigate the effect of simulators on clustering and batch correction method comparisons, and, thirdly, which and to what extent quality control summaries can capture reference-simulation similarity. CONCLUSIONS Our results suggest that most simulators are unable to accommodate complex designs without introducing artificial effects, they yield over-optimistic performance of integration and potentially unreliable ranking of clustering methods, and it is generally unknown which summaries are important to ensure effective simulation-based method comparisons.
Collapse
Affiliation(s)
- Helena L Crowell
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland
| | | | - Charlotte Soneson
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland
- Current address: Friedrich Miescher Institute for Biomedical Research and SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Mark D Robinson
- Department of Molecular Life Sciences, University of Zurich, Zurich, Switzerland.
- SIB Swiss Institute of Bioinformatics, University of Zurich, Zurich, Switzerland.
| |
Collapse
|
40
|
Li H, Zhang Z, Squires M, Chen X, Zhang X. scMultiSim: simulation of multi-modality single cell data guided by cell-cell interactions and gene regulatory networks. RESEARCH SQUARE 2023:rs.3.rs-2675530. [PMID: 36993284 PMCID: PMC10055660 DOI: 10.21203/rs.3.rs-2675530/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Simulated single-cell data is essential for designing and evaluating computational methods in the absence of experimental ground truth. Existing simulators typically focus on modeling one or two specific biological factors or mechanisms that affect the output data, which limits their capacity to simulate the complexity and multi-modality in real data. Here, we present scMultiSim, an in silico simulator that generates multi-modal single-cell data, including gene expression, chromatin accessibility, RNA velocity, and spatial cell locations while accounting for the relationships between modalities. scMultiSim jointly models various biological factors that affect the output data, including cell identity, within-cell gene regulatory networks (GRNs), cell-cell interactions (CCIs), and chromatin accessibility, while also incorporating technical noises. Moreover, it allows users to adjust each factor's effect easily. We validated scMultiSim's simulated biological effects and demonstrated its applications by benchmarking a wide range of computational tasks, including cell clustering and trajectory inference, multi-modal and multi-batch data integration, RNA velocity estimation, GRN inference and CCI inference using spatially resolved gene expression data. Compared to existing simulators, scMultiSim can benchmark a much broader range of existing computational problems and even new potential tasks.
Collapse
Affiliation(s)
- Hechen Li
- Georgia Institute of Technology, Atlanta, USA
| | - Ziqi Zhang
- Georgia Institute of Technology, Atlanta, USA
| | | | - Xi Chen
- Southern University of Science and Technology, China
| | | |
Collapse
|
41
|
De Falco A, Caruso F, Su XD, Iavarone A, Ceccarelli M. A variational algorithm to detect the clonal copy number substructure of tumors from scRNA-seq data. Nat Commun 2023; 14:1074. [PMID: 36841879 PMCID: PMC9968345 DOI: 10.1038/s41467-023-36790-9] [Citation(s) in RCA: 34] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2021] [Accepted: 02/16/2023] [Indexed: 02/27/2023] Open
Abstract
Single-cell RNA sequencing is the reference technology to characterize the composition of the tumor microenvironment and to study tumor heterogeneity at high resolution. Here we report Single CEll Variational ANeuploidy analysis (SCEVAN), a fast variational algorithm for the deconvolution of the clonal substructure of tumors from single-cell RNA-seq data. It uses a multichannel segmentation algorithm exploiting the assumption that all the cells in a given copy number clone share the same breakpoints. Thus, the smoothed expression profile of every individual cell constitutes part of the evidence of the copy number profile in each subclone. SCEVAN can automatically and accurately discriminate between malignant and non-malignant cells, resulting in a practical framework to analyze tumors and their microenvironment. We apply SCEVAN to datasets encompassing 106 samples and 93,322 cells from different tumor types and technologies. We demonstrate its application to characterize the intratumor heterogeneity and geographic evolution of malignant brain tumors.
Collapse
Affiliation(s)
- Antonio De Falco
- Department of Electrical Engineering and Information Technology (DIETI), University of Naples 'Federico II', 80128, Naples, Italy.,BIOGEM Institute of Molecular Biology and Genetics, 83031, Ariano Irpino, Italy
| | - Francesca Caruso
- Department of Electrical Engineering and Information Technology (DIETI), University of Naples 'Federico II', 80128, Naples, Italy.,BIOGEM Institute of Molecular Biology and Genetics, 83031, Ariano Irpino, Italy
| | - Xiao-Dong Su
- Biomedical Pioneering Innovation Center (BIOPIC), School of Life Sciences, Peking University, 5 Yiheyuan Road, Haidian District, 100871, Beijing, China
| | - Antonio Iavarone
- Sylvester Comprehensive Cancer Center, University of Miami, Miller School of Medicine, Miami, FL, USA.,Department of Neurological Surgery, University of Miami, Miller School of Medicine, Miami, FL, USA
| | - Michele Ceccarelli
- Department of Electrical Engineering and Information Technology (DIETI), University of Naples 'Federico II', 80128, Naples, Italy. .,BIOGEM Institute of Molecular Biology and Genetics, 83031, Ariano Irpino, Italy.
| |
Collapse
|
42
|
Sun T, Song D, Li W, Li J. Author Correction: scDesign2: a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured. Genome Biol 2023; 24:32. [PMID: 36814256 PMCID: PMC9945685 DOI: 10.1186/s13059-023-02884-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/24/2023] Open
Affiliation(s)
- Tianyi Sun
- grid.19006.3e0000 0000 9632 6718Department of Statistics, University of California, 90095-1554 Los Angeles, CA USA
| | - Dongyuan Song
- grid.19006.3e0000 0000 9632 6718Interdepartmental Program of Bioinformatics, University of California, 90095-7246 Los Angeles, CA USA
| | - Wei Vivian Li
- grid.430387.b0000 0004 1936 8796Department of Biostatistics and Epidemiology, Rutgers School of Public Health, 08854 Piscataway, NJ USA
| | - Jingyi Jessica Li
- grid.19006.3e0000 0000 9632 6718Department of Statistics, University of California, 90095-1554 Los Angeles, CA USA ,grid.19006.3e0000 0000 9632 6718Department of Human Genetics, University of California, 90095-7088 Los Angeles, CA USA ,grid.19006.3e0000 0000 9632 6718Department of Computational Medicine, University of California, 90095-1766 Los Angeles, CA USA ,grid.19006.3e0000 0000 9632 6718Department of Biostatistics, University of California, 90095-1772 Los Angeles, CA USA
| |
Collapse
|
43
|
Sun L, Wang G, Zhang Z. SimCH: simulation of single-cell RNA sequencing data by modeling cellular heterogeneity at gene expression level. Brief Bioinform 2023; 24:6961608. [PMID: 36575569 DOI: 10.1093/bib/bbac590] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Revised: 11/08/2022] [Accepted: 12/02/2022] [Indexed: 12/29/2022] Open
Abstract
Single-cell ribonucleic acid (RNA) sequencing (scRNA-seq) has been a powerful technology for transcriptome analysis. However, the systematic validation of diverse computational tools used in scRNA-seq analysis remains challenging. Here, we propose a novel simulation tool, termed as Simulation of Cellular Heterogeneity (SimCH), for the flexible and comprehensive assessment of scRNA-seq computational methods. The Gaussian Copula framework is recruited to retain gene coexpression of experimental data shown to be associated with cellular heterogeneity. The synthetic count matrices generated by suitable SimCH modes closely match experimental data originating from either homogeneous or heterogeneous cell populations and either unique molecular identifier (UMI)-based or non-UMI-based techniques. We demonstrate how SimCH can benchmark several types of computational methods, including cell clustering, discovery of differentially expressed genes, trajectory inference, batch correction and imputation. Moreover, we show how SimCH can be used to conduct power evaluation of cell clustering methods. Given these merits, we believe that SimCH can accelerate single-cell research.
Collapse
Affiliation(s)
- Lei Sun
- School of Information Engineering, Yangzhou University, Yangzhou, P.R. China.,School of Artificial Intelligence, Yangzhou University, Yangzhou, P.R. China.,CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, and China National Center for Bioinformation, Beijing, P.R. China
| | - Gongming Wang
- School of Information Engineering, Yangzhou University, Yangzhou, P.R. China.,School of Artificial Intelligence, Yangzhou University, Yangzhou, P.R. China.,China Unicom Software Research Institute Jinan Branch, Jinan, P.R. China
| | - Zhihua Zhang
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, and China National Center for Bioinformation, Beijing, P.R. China.,School of Life Science, University of Chinese Academy of Sciences, Beijing, P.R. China
| |
Collapse
|
44
|
Su C, Xu Z, Shan X, Cai B, Zhao H, Zhang J. Cell-type-specific co-expression inference from single cell RNA-sequencing data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2022:2022.12.13.520181. [PMID: 36561173 DOI: 10.1101/2022.04.07.487499] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
The inference of gene co-expressions from microarray and RNA-sequencing data has led to rich insights on biological processes and disease mechanisms. However, the bulk samples analyzed in most studies are a mixture of different cell types. As a result, the inferred co-expressions are confounded by varying cell type compositions across samples and only offer an aggregated view of gene regulations that may be distinct across different cell types. The advancement of single cell RNA-sequencing (scRNA-seq) technology has enabled the direct inference of co-expressions in specific cell types, facilitating our understanding of cell-type-specific biological functions. However, the high sequencing depth variations and measurement errors in scRNA-seq data present significant challenges in inferring cell-type-specific gene co-expressions, and these issues have not been adequately addressed in the existing methods. We propose a statistical approach, CS-CORE, for estimating and testing cell-type-specific co-expressions, built on a general expression-measurement model that explicitly accounts for sequencing depth variations and measurement errors in the observed single cell data. Systematic evaluations show that most existing methods suffer from inflated false positives and biased co-expression estimates and clustering analysis, whereas CS-CORE has appropriate false positive control, unbiased co-expression estimates, good statistical power and satisfactory performance in downstream co-expression analysis. When applied to analyze scRNA-seq data from postmortem brain samples from Alzheimer’s disease patients and controls and blood samples from COVID-19 patients and controls, CS-CORE identified cell-type-specific co-expressions and differential co-expressions that were more reproducible and/or more enriched for relevant biological pathways than those inferred from other methods.
Collapse
|
45
|
Abstract
The single-cell revolution in the field of genomics is in full bloom, with clever new molecular biology tricks appearing regularly that allow researchers to explore new modalities or scale up their projects to millions of cells and beyond. Techniques abound to measure RNA expression, DNA alterations, protein abundance, chromatin accessibility, and more, all with single-cell resolution and often in combination. Despite such a rapidly changing technology landscape, there are several fundamental principles that are applicable to the majority of experimental workflows to help users avoid pitfalls and exploit the advantages of the chosen platform. In this overview article, we describe a variety of popular single-cell genomics technologies and address some common questions pertaining to study design, sample preparation, quality control, and sequencing strategy. As the majority of relevant publications currently revolve around single-cell RNA-seq, we will prioritize this genomics modality in our discussion. © 2022 Wiley Periodicals LLC.
Collapse
Affiliation(s)
- Claire Regan
- Cold Spring Harbor Laboratory, Cold Spring Harbor, New York
| | | |
Collapse
|
46
|
Wu G, Li Y. Distinct characteristics of correlation analysis at the single-cell and the population level. Stat Appl Genet Mol Biol 2022; 21:sagmb-2022-0015. [PMID: 35918809 DOI: 10.1515/sagmb-2022-0015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2022] [Accepted: 06/13/2022] [Indexed: 11/15/2022]
Abstract
Correlation analysis is widely used in biological studies to infer molecular relationships within biological networks. Recently, single-cell analysis has drawn tremendous interests, for its ability to obtain high-resolution molecular phenotypes. It turns out that there is little overlap of co-expressed genes identified in single-cell level investigations with that of population level investigations. However, the nature of the relationship of correlations between single-cell and population levels remains unclear. In this manuscript, we aimed to unveil the origin of the differences between the correlation coefficients at the single-cell level and that at the population level, and bridge the gap between them. Through developing formulations to link correlations at the single-cell and the population level, we illustrated that aggregated correlations could be stronger, weaker or equal to the corresponding individual correlations, depending on the variations and the correlations within the population. When the correlation within the population is weaker than the individual correlation, the aggregated correlation is stronger than the corresponding individual correlation. Besides, our data indicated that aggregated correlation is more likely to be stronger than the corresponding individual correlation, and it was rare to find gene-pairs exclusively strongly correlated at the single-cell level. Through a bottom-up approach to model interactions between molecules in a signaling cascade or a multi-regulator-controlled gene expression, we surprisingly found that the existence of interaction between two components could not be excluded simply based on their low correlation coefficients, suggesting a reconsideration of connectivity within biological networks which was derived solely from correlation analysis. We also investigated the impact of technical random measurement errors on the correlation coefficients for the single-cell level and the population level. The results indicate that the aggregated correlation is relatively robust and less affected. Because of the heterogeneity among single cells, correlation coefficients calculated based on data of the single-cell level might be different from that of the population level. Depending on the specific question we are asking, proper sampling and normalization procedure should be done before we draw any conclusions.
Collapse
Affiliation(s)
- Guoyu Wu
- School of Clinical Pharmacy, Guangdong Pharmaceutical University, Guangzhou, China
- Key Specialty of Clinical Pharmacy, The First Affiliated Hospital of Guangdong Pharmaceutical University, Guangzhou, China
- NMPA Key Laboratory for Technology Research and Evaluation of Pharmacovigilance, Guangdong Pharmaceutical University, Guangzhou, China
| | - Yuchao Li
- Max Planck Institute for Molecular Genetics, Berlin, Germany
- MegaLab, MegaRobo Technologies Co., Ltd, Beijing, China
| |
Collapse
|
47
|
Cao Y, Yang P, Yang JYH. A benchmark study of simulation methods for single-cell RNA sequencing data. Nat Commun 2021; 12:6911. [PMID: 34824223 PMCID: PMC8617278 DOI: 10.1038/s41467-021-27130-w] [Citation(s) in RCA: 36] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2021] [Accepted: 10/26/2021] [Indexed: 11/09/2022] Open
Abstract
Single-cell RNA-seq (scRNA-seq) data simulation is critical for evaluating computational methods for analysing scRNA-seq data especially when ground truth is experimentally unattainable. The reliability of evaluation depends on the ability of simulation methods to capture properties of experimental data. However, while many scRNA-seq data simulation methods have been proposed, a systematic evaluation of these methods is lacking. We develop a comprehensive evaluation framework, SimBench, including a kernel density estimation measure to benchmark 12 simulation methods through 35 scRNA-seq experimental datasets. We evaluate the simulation methods on a panel of data properties, ability to maintain biological signals, scalability and applicability. Our benchmark uncovers performance differences among the methods and highlights the varying difficulties in simulating data characteristics. Furthermore, we identify several limitations including maintaining heterogeneity of distribution. These results, together with the framework and datasets made publicly available as R packages, will guide simulation methods selection and their future development.
Collapse
Affiliation(s)
- Yue Cao
- Charles Perkins Centre, The University of Sydney, Sydney, Australia
- School of Mathematics and Statistics, The University of Sydney, Sydney, Australia
| | - Pengyi Yang
- Charles Perkins Centre, The University of Sydney, Sydney, Australia.
- School of Mathematics and Statistics, The University of Sydney, Sydney, Australia.
- Computational Systems Biology Group, Children's Medical Research Institute, Westmead, NSW, Australia.
| | - Jean Yee Hwa Yang
- Charles Perkins Centre, The University of Sydney, Sydney, Australia.
- School of Mathematics and Statistics, The University of Sydney, Sydney, Australia.
| |
Collapse
|
48
|
Ge X, Chen YE, Song D, McDermott M, Woyshner K, Manousopoulou A, Wang N, Li W, Wang LD, Li JJ. Clipper: p-value-free FDR control on high-throughput data from two conditions. Genome Biol 2021; 22:288. [PMID: 34635147 PMCID: PMC8504070 DOI: 10.1186/s13059-021-02506-9] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Accepted: 09/21/2021] [Indexed: 12/12/2022] Open
Abstract
High-throughput biological data analysis commonly involves identifying features such as genes, genomic regions, and proteins, whose values differ between two conditions, from numerous features measured simultaneously. The most widely used criterion to ensure the analysis reliability is the false discovery rate (FDR), which is primarily controlled based on p-values. However, obtaining valid p-values relies on either reasonable assumptions of data distribution or large numbers of replicates under both conditions. Clipper is a general statistical framework for FDR control without relying on p-values or specific data distributions. Clipper outperforms existing methods for a broad range of applications in high-throughput data analysis.
Collapse
Affiliation(s)
- Xinzhou Ge
- Department of Statistics, University of California, Los Angeles, 90095, CA, USA
| | - Yiling Elaine Chen
- Department of Statistics, University of California, Los Angeles, 90095, CA, USA
| | - Dongyuan Song
- Interdepartmental Program in Bioinformatics, University of California, Los Angeles, 90095, CA, USA
| | - MeiLu McDermott
- Beckman Research Institute, City of Hope National Medical Center, Duarte, 91010, CA, USA
- The Quantitative and Computational Biology section, University of Southern California, Los Angeles, 90089, CA, USA
| | - Kyla Woyshner
- Beckman Research Institute, City of Hope National Medical Center, Duarte, 91010, CA, USA
| | - Antigoni Manousopoulou
- Beckman Research Institute, City of Hope National Medical Center, Duarte, 91010, CA, USA
| | - Ning Wang
- Interdepartmental Program in Bioinformatics, University of California, Los Angeles, 90095, CA, USA
| | - Wei Li
- Division of Computational Biomedicine, Department of Biological Chemistry, School of Medicine, University of California, Irvine, 92697, CA, USA
| | - Leo D Wang
- Beckman Research Institute, City of Hope National Medical Center, Duarte, 91010, CA, USA
| | - Jingyi Jessica Li
- Department of Statistics, University of California, Los Angeles, 90095, CA, USA.
- Interdepartmental Program in Bioinformatics, University of California, Los Angeles, 90095, CA, USA.
- Department of Human Genetics, University of California, Los Angeles, 90095, CA, USA.
- Department of Computational Medicine, University of California, Los Angeles, 90095, CA, USA.
- Department of Biostatistics, University of California, Los Angeles, 90095, CA, USA.
| |
Collapse
|
49
|
Song D, Li K, Hemminger Z, Wollman R, Li JJ. scPNMF: sparse gene encoding of single cells to facilitate gene selection for targeted gene profiling. Bioinformatics 2021; 37:i358-i366. [PMID: 34252925 PMCID: PMC8275345 DOI: 10.1093/bioinformatics/btab273] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Motivation Single-cell RNA sequencing (scRNA-seq) captures whole transcriptome information of individual cells. While scRNA-seq measures thousands of genes, researchers are often interested in only dozens to hundreds of genes for a closer study. Then, a question is how to select those informative genes from scRNA-seq data. Moreover, single-cell targeted gene profiling technologies are gaining popularity for their low costs, high sensitivity and extra (e.g. spatial) information; however, they typically can only measure up to a few hundred genes. Then another challenging question is how to select genes for targeted gene profiling based on existing scRNA-seq data. Results Here, we develop the single-cell Projective Non-negative Matrix Factorization (scPNMF) method to select informative genes from scRNA-seq data in an unsupervised way. Compared with existing gene selection methods, scPNMF has two advantages. First, its selected informative genes can better distinguish cell types. Second, it enables the alignment of new targeted gene profiling data with reference data in a low-dimensional space to facilitate the prediction of cell types in the new data. Technically, scPNMF modifies the PNMF algorithm for gene selection by changing the initialization and adding a basis selection step, which selects informative bases to distinguish cell types. We demonstrate that scPNMF outperforms the state-of-the-art gene selection methods on diverse scRNA-seq datasets. Moreover, we show that scPNMF can guide the design of targeted gene profiling experiments and the cell-type annotation on targeted gene profiling data. Availability and implementation The R package is open-access and available at https://github.com/JSB-UCLA/scPNMF. The data used in this work are available at Zenodo: https://doi.org/10.5281/zenodo.4797997. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dongyuan Song
- Bioinformatics Interdepartmental Ph.D. Program, University of California, Los Angeles, CA 90095-7246, USA
| | - Kexin Li
- Department of Statistics, University of California, Los Angeles, CA 90095-1554, USA
| | - Zachary Hemminger
- Institute for Quantitative and Computational Biosciences, University of California, Los Angeles, CA 90095, USA.,Department of Integrative Biology and Physiology, University of California, Los Angeles, CA 90095-7239, USA
| | - Roy Wollman
- Institute for Quantitative and Computational Biosciences, University of California, Los Angeles, CA 90095, USA.,Department of Integrative Biology and Physiology, University of California, Los Angeles, CA 90095-7239, USA.,Department of Chemistry and Biochemistry, University of California, Los Angeles, CA 90095-1569, USA
| | - Jingyi Jessica Li
- Department of Statistics, University of California, Los Angeles, CA 90095-1554, USA.,Department of Human Genetics, University of California, Los Angeles, CA 90095-7088, USA.,Department of Computational Medicine, University of California, Los Angeles, CA 90095-1766, USA.,Department of Biostatistics, University of California Los Angeles, CA 90095-1772, USA
| |
Collapse
|
50
|
Sun T, Song D, Li WV, Li JJ. Publisher Correction: scDesign2: a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured. Genome Biol 2021; 22:177. [PMID: 34108038 PMCID: PMC8191178 DOI: 10.1186/s13059-021-02394-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Affiliation(s)
- Tianyi Sun
- Department of Statistics, University of California, Los Angeles, CA, 90095-1554, USA
| | - Dongyuan Song
- Interdepartmental Program of Bioinformatics, University of California, Los Angeles, CA, 90095-7246, USA
| | - Wei Vivian Li
- Department of Biostatistics and Epidemiology, Rutgers School of Public Health, Piscataway, NJ, 08854, USA.
| | - Jingyi Jessica Li
- Department of Statistics, University of California, Los Angeles, CA, 90095-1554, USA. .,Department of Human Genetics, University of California, Los Angeles, CA, 90095-7088, USA. .,Department of Computational Medicine, University of California, Los Angeles, CA, 90095-1766, USA. .,Department of Biostatistics, University of California, Los Angeles, CA, 90095-1772, USA.
| |
Collapse
|