1
|
Ghosh T, Baxter RM, Seal S, Lui VG, Rudra P, Vu T, Hsieh EW, Ghosh D. cytoKernel: Robust kernel embeddings for assessing differential expression of single cell data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.08.16.608287. [PMID: 39229233 PMCID: PMC11370373 DOI: 10.1101/2024.08.16.608287] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/05/2024]
Abstract
High-throughput sequencing of single-cell data can be used to rigorously evlauate cell specification and enable intricate variations between groups or conditions. Many popular existing methods for differential expression target differences in aggregate measurements (mean, median, sum) and limit their approaches to detect only global differential changes. We present a robust method for differential expression of single-cell data using a kernel-based score test, cytoKernel. cytoKernel is specifically designed to assess the differential expression of single cell RNA sequencing and high-dimensional flow or mass cytometry data using the full probability distribution pattern. cytoKernel is based on kernel embeddings which employs the probability distributions of the single cell data, by calculating the pairwise divergence/distance between distributions of subjects. It can detect both patterns involving aggregate changes, as well as more elusive variations that are often overlooked due to the multimodal characteristics of single cell data. We performed extensive benchmarks across both simulated and real data sets from mass cytometry data and single-cell RNA sequencing. The cytoKernel procedure effectively controls the False Discovery Rate (FDR) and shows favourable performance compared to existing methods. The method is able to identify more differential patterns than existing approaches. We apply cytoKernel to assess gene expression and protein marker expression differences from cell subpopulations in various publicly available single-cell RNAseq and mass cytometry data sets. The methods described in this paper are implemented in the open-source R package cytoKernel, which is freely available from Bioconductor at http://bioconductor.org/packages/cytoKernel.
Collapse
Affiliation(s)
- Tusharkanti Ghosh
- Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Ryan M Baxter
- Department of Immunology and Microbiology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Souvik Seal
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC, USA
| | - Victor G Lui
- Center for Translational Immunology, Benaroya Research Institute at Virginia Mason, Seattle, WA, USA
| | - Pratyaydipta Rudra
- Department of Statistics, Oklahoma State University, Stillwater, OK, USA
| | - Thao Vu
- Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Elena Wy Hsieh
- Department of Immunology and Microbiology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Debashis Ghosh
- Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| |
Collapse
|
2
|
Du M, Johnston K, Berrocal V, Li W, Xu X, Yu Z. ULV: A robust statistical method for clustered data, with applications to multi-subject, single-cell omics data. ARXIV 2024:arXiv:2406.06767v1. [PMID: 38947924 PMCID: PMC11213121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 07/02/2024]
Abstract
Molecular and genomic technological advancements have greatly enhanced our understanding of biological processes by allowing us to quantify key biological variables such as gene expression, protein levels, and microbiome compositions. These breakthroughs have enabled us to achieve increasingly higher levels of resolution in our measurements, exemplified by our ability to comprehensively profile biological information at the single-cell level. However, the analysis of such data faces several critical challenges: limited number of individuals, non-normality, potential dropouts, outliers, and repeated measurements from the same individual. In this article, we propose a novel method, which we call U-statistic based latent variable (ULV). Our proposed method takes advantage of the robustness of rank-based statistics and exploits the statistical efficiency of parametric methods for small sample sizes. It is a computationally feasible framework that addresses all the issues mentioned above simultaneously. We show that our method controls false positives at desired significance levels. An additional advantage of ULV is its flexibility in modeling various types of single-cell data, including both RNA and protein abundance. The usefulness of our method is demonstrated in two studies: a single-cell proteomics study of acute myelogenous leukemia (AML) and a single-cell RNA study of COVID-19 symptoms. In the AML study, ULV successfully identified differentially expressed proteins that would have been missed by the pseudobulk version of the Wilcoxon rank-sum test. In the COVID-19 study, ULV identified genes associated with covariates such as age and gender, and genes that would be missed without adjusting for covariates. The differentially expressed genes identified by our method are less biased toward genes with high expression levels. Furthermore, ULV identified additional gene pathways likely contributing to the mechanisms of COVID-19 severity.
Collapse
Affiliation(s)
- Mingyu Du
- Center for Complex Biological Systems, University of California, Irvine, 92697, CA, USA
| | - Kevin Johnston
- Department of Anatomy and Neurobiology, University of California, Irvine, 92697, CA, USA
| | - Veronica Berrocal
- Department of Statistics, University of California, Irvine, 92697, CA, USA
| | - Wei Li
- Division of Computational Biomedicine, Department of Biological Chemistry, School of Medicine, University of California, Irvine, 92697, CA, USA
| | - Xiangmin Xu
- Department of Anatomy and Neurobiology, University of California, Irvine, 92697, CA, USA
- Center for Neural Circuits Mapping, University of California, Irvine, 92697, CA, USA
| | - Zhaoxia Yu
- Department of Statistics, University of California, Irvine, 92697, CA, USA
- Center for Neural Circuits Mapping, University of California, Irvine, 92697, CA, USA
| |
Collapse
|
3
|
Guo X, Ning J, Chen Y, Liu G, Zhao L, Fan Y, Sun S. Recent advances in differential expression analysis for single-cell RNA-seq and spatially resolved transcriptomic studies. Brief Funct Genomics 2024; 23:95-109. [PMID: 37022699 DOI: 10.1093/bfgp/elad011] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2022] [Revised: 12/09/2022] [Accepted: 03/10/2023] [Indexed: 04/07/2023] Open
Abstract
Differential expression (DE) analysis is a necessary step in the analysis of single-cell RNA sequencing (scRNA-seq) and spatially resolved transcriptomics (SRT) data. Unlike traditional bulk RNA-seq, DE analysis for scRNA-seq or SRT data has unique characteristics that may contribute to the difficulty of detecting DE genes. However, the plethora of DE tools that work with various assumptions makes it difficult to choose an appropriate one. Furthermore, a comprehensive review on detecting DE genes for scRNA-seq data or SRT data from multi-condition, multi-sample experimental designs is lacking. To bridge such a gap, here, we first focus on the challenges of DE detection, then highlight potential opportunities that facilitate further progress in scRNA-seq or SRT analysis, and finally provide insights and guidance in selecting appropriate DE tools or developing new computational DE methods.
Collapse
Affiliation(s)
- Xiya Guo
- School of Public Health, Xi'an Jiaotong University, Xi'an, Shaanxi 710061, P.R. China
- Key Laboratory of Trace Elements and Endemic Diseases, Center for Single Cell Omics and Health, Xi'an Jiaotong University, Xi'an, Shaanxi 710061, P.R. China
| | - Jin Ning
- School of Public Health, Xi'an Jiaotong University, Xi'an, Shaanxi 710061, P.R. China
- Key Laboratory of Trace Elements and Endemic Diseases, Center for Single Cell Omics and Health, Xi'an Jiaotong University, Xi'an, Shaanxi 710061, P.R. China
| | - Yuanze Chen
- School of Public Health, Xi'an Jiaotong University, Xi'an, Shaanxi 710061, P.R. China
- Key Laboratory of Trace Elements and Endemic Diseases, Center for Single Cell Omics and Health, Xi'an Jiaotong University, Xi'an, Shaanxi 710061, P.R. China
| | - Guoliang Liu
- School of Public Health, Xi'an Jiaotong University, Xi'an, Shaanxi 710061, P.R. China
- Key Laboratory of Trace Elements and Endemic Diseases, Center for Single Cell Omics and Health, Xi'an Jiaotong University, Xi'an, Shaanxi 710061, P.R. China
| | - Liyan Zhao
- School of Public Health, Xi'an Jiaotong University, Xi'an, Shaanxi 710061, P.R. China
- Key Laboratory of Trace Elements and Endemic Diseases, Center for Single Cell Omics and Health, Xi'an Jiaotong University, Xi'an, Shaanxi 710061, P.R. China
| | - Yue Fan
- School of Public Health, Xi'an Jiaotong University, Xi'an, Shaanxi 710061, P.R. China
- Key Laboratory of Trace Elements and Endemic Diseases, Center for Single Cell Omics and Health, Xi'an Jiaotong University, Xi'an, Shaanxi 710061, P.R. China
| | - Shiquan Sun
- School of Public Health, Xi'an Jiaotong University, Xi'an, Shaanxi 710061, P.R. China
- Key Laboratory of Trace Elements and Endemic Diseases, Center for Single Cell Omics and Health, Xi'an Jiaotong University, Xi'an, Shaanxi 710061, P.R. China
| |
Collapse
|
4
|
Campbell I, Glinka M, Shaban F, Kirkwood KJ, Nadalin F, Adams D, Papatheodorou I, Burger A, Baldock RA, Arends MJ, Din S. The Promise of Single-Cell RNA Sequencing to Redefine the Understanding of Crohn's Disease Fibrosis Mechanisms. J Clin Med 2023; 12:3884. [PMID: 37373578 PMCID: PMC10299644 DOI: 10.3390/jcm12123884] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2023] [Revised: 06/03/2023] [Accepted: 06/05/2023] [Indexed: 06/29/2023] Open
Abstract
Crohn's disease (CD) is a chronic inflammatory bowel disease with a high prevalence throughout the world. The development of Crohn's-related fibrosis, which leads to strictures in the gastrointestinal tract, presents a particular challenge and is associated with significant morbidity. There are currently no specific anti-fibrotic therapies available, and so treatment is aimed at managing the stricturing complications of fibrosis once it is established. This often requires invasive and repeated endoscopic or surgical intervention. The advent of single-cell sequencing has led to significant advances in our understanding of CD at a cellular level, and this has presented opportunities to develop new therapeutic agents with the aim of preventing or reversing fibrosis. In this paper, we discuss the current understanding of CD fibrosis pathogenesis, summarise current management strategies, and present the promise of single-cell sequencing as a tool for the development of effective anti-fibrotic therapies.
Collapse
Affiliation(s)
- Iona Campbell
- Edinburgh Inflammatory Bowel Disease Unit, Western General Hospital, NHS Lothian, Edinburgh EH4 2XU, UK
| | - Michael Glinka
- Edinburgh Pathology, Centre for Comparative Pathology, Cancer Research UK Scotland Centre, Institute of Cancer and Genetics, University of Edinburgh, Crewe Road, Edinburgh EH4 2XU, UK
| | - Fadlo Shaban
- Edinburgh Colorectal Unit, Western General Hospital, NHS Lothian, Edinburgh EH4 2XU, UK
| | - Kathryn J. Kirkwood
- Department of Pathology, Western General Hospital, NHS Lothian, Edinburgh EH4 2XU, UK
| | - Francesca Nadalin
- European Molecular Biology Laboratory, European Bioinformatics Institute, EMBL-EBI, Hinxton, Cambridge CB10 1SD, UK
| | - David Adams
- Experimental Cancer Genetics, Wellcome Sanger Institute, Hinxton, Cambridge CB10 1SA, UK
| | - Irene Papatheodorou
- European Molecular Biology Laboratory, European Bioinformatics Institute, EMBL-EBI, Hinxton, Cambridge CB10 1SD, UK
| | - Albert Burger
- Department of Computer Science, School of Mathematical and Computer Sciences, Heriot-Watt University, Edinburgh EH14 4AS, UK;
| | - Richard A. Baldock
- Edinburgh Pathology, Centre for Comparative Pathology, Cancer Research UK Scotland Centre, Institute of Cancer and Genetics, University of Edinburgh, Crewe Road, Edinburgh EH4 2XU, UK
| | - Mark J. Arends
- Edinburgh Pathology, Centre for Comparative Pathology, Cancer Research UK Scotland Centre, Institute of Cancer and Genetics, University of Edinburgh, Crewe Road, Edinburgh EH4 2XU, UK
| | - Shahida Din
- Edinburgh Inflammatory Bowel Disease Unit, Western General Hospital, NHS Lothian, Edinburgh EH4 2XU, UK
| |
Collapse
|
5
|
Cao Y, Ghazanfar S, Yang P, Yang J. Benchmarking of analytical combinations for COVID-19 outcome prediction using single-cell RNA sequencing data. Brief Bioinform 2023; 24:7140296. [PMID: 37096588 DOI: 10.1093/bib/bbad159] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2023] [Revised: 03/30/2023] [Accepted: 04/03/2023] [Indexed: 04/26/2023] Open
Abstract
The advances of single-cell transcriptomic technologies have led to increasing use of single-cell RNA sequencing (scRNA-seq) data in large-scale patient cohort studies. The resulting high-dimensional data can be summarized and incorporated into patient outcome prediction models in several ways; however, there is a pressing need to understand the impact of analytical decisions on such model quality. In this study, we evaluate the impact of analytical choices on model choices, ensemble learning strategies and integrate approaches on patient outcome prediction using five scRNA-seq COVID-19 datasets. First, we examine the difference in performance between using single-view feature space versus multi-view feature space. Next, we survey multiple learning platforms from classical machine learning to modern deep learning methods. Lastly, we compare different integration approaches when combining datasets is necessary. Through benchmarking such analytical combinations, our study highlights the power of ensemble learning, consistency among different learning methods and robustness to dataset normalization when using multiple datasets as the model input.
Collapse
Affiliation(s)
- Yue Cao
- School of Mathematics and Statistics, The University of Sydney, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, NSW 2006, Australia
- Sydney Precision Data Science Centre, University of Sydney, NSW 2006, Australia
- Laboratory of Data Discovery for Health Limited (D24H), Science Park, Hong Kong SAR, China
| | - Shila Ghazanfar
- School of Mathematics and Statistics, The University of Sydney, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, NSW 2006, Australia
- Sydney Precision Data Science Centre, University of Sydney, NSW 2006, Australia
| | - Pengyi Yang
- School of Mathematics and Statistics, The University of Sydney, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, NSW 2006, Australia
- Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, NSW 2145, Australia
- Sydney Precision Data Science Centre, University of Sydney, NSW 2006, Australia
- Laboratory of Data Discovery for Health Limited (D24H), Science Park, Hong Kong SAR, China
| | - Jean Yang
- School of Mathematics and Statistics, The University of Sydney, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, NSW 2006, Australia
- Sydney Precision Data Science Centre, University of Sydney, NSW 2006, Australia
- Laboratory of Data Discovery for Health Limited (D24H), Science Park, Hong Kong SAR, China
| |
Collapse
|