1
|
Pavel A, Grønberg MG, Clemmensen LH. The impact of dropouts in scRNAseq dense neighborhood analysis. Comput Struct Biotechnol J 2025; 27:1278-1285. [PMID: 40225837 PMCID: PMC11992407 DOI: 10.1016/j.csbj.2025.03.033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2024] [Revised: 03/19/2025] [Accepted: 03/20/2025] [Indexed: 04/15/2025] Open
Abstract
Single cell RNA sequencing (scRNAseq) provides the possibility to investigate transcriptomic profiles on a single cell level. However, the data show unique challenges in comparison to bulk transcriptomic data, one being high dropout rates, which yields high sparsity data. Many classical analysis and preprocessing pipelines are based on the assumption that poor data can be counteracted by quantity and that similar cells (samples) are close to each other in space. Clustering is commonly used to detect clusters (dense local cell neighborhoods) under the assumption that similar cells are close to each other in space (where close is dependent on the (distance) metric used). The most commonly used clustering methodologies to detect dense local neighborhoods are based on graph clustering on a nearest neighbor graph. However, high dropout rates may break this assumption and make it difficult to reliably detect such dense local neighborhoods. We assess the cluster homogeneity and stability under increasing degrees of dropouts in one of the most popular clustering pipelines (dimensionality reduction + graph based clustering), as provided by scRNAseq analyses packages Seurat and Scanpy. Our study showcases that while the default pipeline performs well in terms of cluster homogeneity (i.e., cells in a cluster are of the same type), also with increasing dropout rates, the stability of clusters (i.e., cell pairs consistently being in the same cluster) decreases. This implies that sub-populations within cell types are increasingly difficult to identify under increasing dropout rates because observations are not consistently close. Our results challenge the current practice of using default clustering pipelines and the general assumption of identifiable local neighborhoods on high dropout data. Hence, these results suggest that careful consideration in interpretation and downstream analysis need to be made when relying on local neighborhoods and clusters on scRNAseq data. In addition, these results call for extensive benchmarking, to identify and provide methods robust in their local neighborhood relationships on data containing low to high dropout rates.
Collapse
Affiliation(s)
- Alisa Pavel
- Department of Applied Mathematics and Computer Science, Technical University of Denmark, 2800, Kongens Lyngby, Denmark
| | - Manja Gersholm Grønberg
- Department of Applied Mathematics and Computer Science, Technical University of Denmark, 2800, Kongens Lyngby, Denmark
| | - Line H. Clemmensen
- Department of Applied Mathematics and Computer Science, Technical University of Denmark, 2800, Kongens Lyngby, Denmark
- Department of Mathematical Sciences, University of Copenhagen, 2100, Copenhagen, Denmark
| |
Collapse
|
2
|
Liang X, Torkel M, Cao Y, Yang JYH. Multi-task benchmarking of spatially resolved gene expression simulation models. Genome Biol 2025; 26:57. [PMID: 40098171 PMCID: PMC11912772 DOI: 10.1186/s13059-025-03505-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2024] [Accepted: 02/12/2025] [Indexed: 03/19/2025] Open
Abstract
BACKGROUND Computational methods for spatially resolved transcriptomics (SRT) are often developed and assessed using simulated data. The effectiveness of these evaluations relies on the ability of simulation methods to accurately reflect experimental data. However, a systematic evaluation framework for spatial simulators is currently lacking. RESULTS Here, we present SpatialSimBench, a comprehensive evaluation framework that assesses 13 simulation methods using ten distinct STR datasets. We introduce simAdaptor, a tool that extends single-cell simulators by incorporating spatial variables, enabling them to simulate spatial data. SimAdaptor ensures SpatialSimBench is backwards compatible, facilitating direct comparisons between spatially aware simulators and existing non-spatial single-cell simulators through the adaption. Using SpatialSimBench, we demonstrate the feasibility of leveraging existing single-cell simulators for SRT data and highlight performance differences among methods. Additionally, we evaluate the simulation methods based on a total of 35 metrics across data property estimation, various downstream analyses, and scalability. In total, we generated 4550 results from 13 simulation methods, ten spatial datasets, and 35 metrics. CONCLUSIONS Our findings reveal that model estimation can be influenced by distribution assumptions and dataset characteristics. In summary, our evaluation framework provides guidelines for selecting appropriate methods for specific scenarios and informs future method development.
Collapse
Affiliation(s)
- Xiaoqi Liang
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW, 2006, Australia
- Sydney Precision Data Science Centre, The University of Sydney, Sydney, NSW, 2006, Australia
- Charles Perkins Centre, The University of Sydney, Sydney, NSW, 2006, Australia
| | - Marni Torkel
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW, 2006, Australia
- Sydney Precision Data Science Centre, The University of Sydney, Sydney, NSW, 2006, Australia
- Charles Perkins Centre, The University of Sydney, Sydney, NSW, 2006, Australia
| | - Yue Cao
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW, 2006, Australia.
- Sydney Precision Data Science Centre, The University of Sydney, Sydney, NSW, 2006, Australia.
- Charles Perkins Centre, The University of Sydney, Sydney, NSW, 2006, Australia.
- Laboratory of Data Discovery for Health Limited (D24H), Science Park, Hong Kong SAR, China.
| | - Jean Yee Hwa Yang
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW, 2006, Australia.
- Sydney Precision Data Science Centre, The University of Sydney, Sydney, NSW, 2006, Australia.
- Charles Perkins Centre, The University of Sydney, Sydney, NSW, 2006, Australia.
- Laboratory of Data Discovery for Health Limited (D24H), Science Park, Hong Kong SAR, China.
| |
Collapse
|
3
|
Ge S, Sun S, Xu H, Cheng Q, Ren Z. Deep learning in single-cell and spatial transcriptomics data analysis: advances and challenges from a data science perspective. Brief Bioinform 2025; 26:bbaf136. [PMID: 40185158 PMCID: PMC11970898 DOI: 10.1093/bib/bbaf136] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2024] [Revised: 02/17/2025] [Accepted: 03/05/2025] [Indexed: 04/07/2025] Open
Abstract
The development of single-cell and spatial transcriptomics has revolutionized our capacity to investigate cellular properties, functions, and interactions in both cellular and spatial contexts. Despite this progress, the analysis of single-cell and spatial omics data remains challenging. First, single-cell sequencing data are high-dimensional and sparse, and are often contaminated by noise and uncertainty, obscuring the underlying biological signal. Second, these data often encompass multiple modalities, including gene expression, epigenetic modifications, metabolite levels, and spatial locations. Integrating these diverse data modalities is crucial for enhancing prediction accuracy and biological interpretability. Third, while the scale of single-cell sequencing has expanded to millions of cells, high-quality annotated datasets are still limited. Fourth, the complex correlations of biological tissues make it difficult to accurately reconstruct cellular states and spatial contexts. Traditional feature engineering approaches struggle with the complexity of biological networks, while deep learning, with its ability to handle high-dimensional data and automatically identify meaningful patterns, has shown great promise in overcoming these challenges. Besides systematically reviewing the strengths and weaknesses of advanced deep learning methods, we have curated 21 datasets from nine benchmarks to evaluate the performance of 58 computational methods. Our analysis reveals that model performance can vary significantly across different benchmark datasets and evaluation metrics, providing a useful perspective for selecting the most appropriate approach based on a specific application scenario. We highlight three key areas for future development, offering valuable insights into how deep learning can be effectively applied to transcriptomic data analysis in biological, medical, and clinical settings.
Collapse
Affiliation(s)
- Shuang Ge
- Shenzhen International Graduate School, Tsinghua University, 2279 Lishui Road, Nanshan District, Shenzhen 518055, Guangdong, China
- Pengcheng Laboratory, 6001 Shahe West Road, Nanshan District, Shenzhen 518055, Guangdong, China
| | - Shuqing Sun
- Shenzhen International Graduate School, Tsinghua University, 2279 Lishui Road, Nanshan District, Shenzhen 518055, Guangdong, China
| | - Huan Xu
- School of Public Health, Anhui University of Science and Technology, 15 Fengxia Road, Changfeng County, Hefei 231131, Anhui, China
| | - Qiang Cheng
- Department of Computer Science, University of Kentucky, 329 Rose Street, Lexington 40506, Kentucky, USA
- Institute for Biomedical Informatics, University of Kentucky, 800 Rose Street, Lexington 40506, Kentucky, USA
| | - Zhixiang Ren
- Pengcheng Laboratory, 6001 Shahe West Road, Nanshan District, Shenzhen 518055, Guangdong, China
| |
Collapse
|
4
|
Monzó C, Aguerralde-Martin M, Martínez-Mira C, Arzalluz-Luque Á, Conesa A, Tarazona S. MOSim: bulk and single-cell multilayer regulatory network simulator. Brief Bioinform 2025; 26:bbaf110. [PMID: 40116657 PMCID: PMC11926980 DOI: 10.1093/bib/bbaf110] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2024] [Revised: 02/13/2025] [Accepted: 02/21/2025] [Indexed: 03/23/2025] Open
Abstract
As multi-omics sequencing technologies advance, the need for simulation tools capable of generating realistic and diverse (bulk and single-cell) multi-omics datasets for method testing and benchmarking becomes increasingly important. We present MOSim, an R package that simulates both bulk (via mosim function) and single-cell (via sc_mosim function) multi-omics data. The mosim function generates bulk transcriptomics data (RNA-seq) and additional regulatory omics layers (ATAC-seq, miRNA-seq, ChIP-seq, Methyl-seq, and transcription factors), while sc_mosim simulates single-cell transcriptomics data (scRNA-seq) with scATAC-seq and transcription factors as regulatory layers. The tool supports various experimental designs, including simulation of gene co-expression patterns, biological replicates, and differential expression between conditions. MOSim enables users to generate quantification matrices for each simulated omics data type, capturing the heterogeneity and complexity of bulk and single-cell multi-omics datasets. Furthermore, MOSim provides differentially abundant features within each omics layer and elucidates the active regulatory relationships between regulatory omics and gene expression data at both bulk and single-cell levels. By leveraging MOSim, researchers will be able to generate realistic and customizable bulk and single-cell multi-omics datasets to benchmark and validate analytical methods specifically designed for the integrative analysis of diverse regulatory omics data.
Collapse
Affiliation(s)
- Carolina Monzó
- Genomics of Gene Expression Lab, Institute for Integrative Systems Biology, Spanish National Research Council (CSIC-UV), C/ Catedràtic Agustín Escardino Benlloch, Paterna 46980, Spain
| | - Maider Aguerralde-Martin
- Applied Statistics, Operational Research and Quality Department, Universitat Politècnica de València, Camí de Vera s/n, València 46022, Spain
| | - Carlos Martínez-Mira
- Biobam Bioinformatics S.L., Marina de Valencia Base 5, BioHub, C/ de la Travesía, s/n, Sector Puerto 14 E, València 46024, Spain
| | - Ángeles Arzalluz-Luque
- Genomics of Gene Expression Lab, Institute for Integrative Systems Biology, Spanish National Research Council (CSIC-UV), C/ Catedràtic Agustín Escardino Benlloch, Paterna 46980, Spain
- Applied Statistics, Operational Research and Quality Department, Universitat Politècnica de València, Camí de Vera s/n, València 46022, Spain
| | - Ana Conesa
- Genomics of Gene Expression Lab, Institute for Integrative Systems Biology, Spanish National Research Council (CSIC-UV), C/ Catedràtic Agustín Escardino Benlloch, Paterna 46980, Spain
| | - Sonia Tarazona
- Applied Statistics, Operational Research and Quality Department, Universitat Politècnica de València, Camí de Vera s/n, València 46022, Spain
| |
Collapse
|
5
|
Liu S, Corcoran D, Garcia-Recio S, Marron JS, Perou C. Crafted experiments to evaluate feature selection methods for single-cell RNA-seq data. NAR Genom Bioinform 2025; 7:lqaf023. [PMID: 40109353 PMCID: PMC11920870 DOI: 10.1093/nargab/lqaf023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2024] [Revised: 01/17/2025] [Accepted: 02/24/2025] [Indexed: 03/22/2025] Open
Abstract
While numerous methods have been developed for analyzing scRNA-seq data, benchmarking various methods remains challenging. There is a lack of ground truth datasets for evaluating novel gene selection and/or clustering methods. We propose the use of crafted experiments, a new approach based upon perturbing signals in a real dataset for comparing analysis methods. We demonstrate the effectiveness of crafted experiments for evaluating new univariate distribution-oriented suite of feature selection methods, called GOF. We show GOF selects features that robustly identify crafted features and perform well on real non-crafted data sets. Using varying ways of crafting, we also show the context in which each GOF method performs the best. GOF is implemented as an open-source R package and freely available under GPL-2 license at https://github.com/siyao-liu/GOF. Source code, including all functions for constructing crafted experiments and benchmarking feature selection methods, are publicly available at https://github.com/siyao-liu/CraftedExperiment.
Collapse
Affiliation(s)
- Siyao Liu
- Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, NC 27599, United States
- Department of Genetics, School of Medicine, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States
| | - David L Corcoran
- Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, NC 27599, United States
- Department of Genetics, School of Medicine, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States
| | - Susana Garcia-Recio
- Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, NC 27599, United States
- Department of Genetics, School of Medicine, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States
| | - James S Marron
- Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, NC 27599, United States
- Department of Statistics and Operation Research, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States
| | - Charles M Perou
- Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, NC 27599, United States
- Department of Genetics, School of Medicine, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States
- Department of Pathology and Laboratory Medicine, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States
| |
Collapse
|
6
|
Pouyabahar D, Andrews T, Bader GD. Interpretable single-cell factor decomposition using sciRED. Nat Commun 2025; 16:1878. [PMID: 39987196 PMCID: PMC11846867 DOI: 10.1038/s41467-025-57157-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2024] [Accepted: 02/10/2025] [Indexed: 02/24/2025] Open
Abstract
Single-cell RNA sequencing maps gene expression heterogeneity within a tissue. However, identifying biological signals in this data is challenging due to confounding technical factors, sparsity, and high dimensionality. Data factorization methods address this by separating and identifying signals in the data, such as gene expression programs, but the resulting factors must be manually interpreted. We developed Single-Cell Interpretable REsidual Decomposition (sciRED) to improve the interpretation of scRNA-seq factor analysis. sciRED removes known confounding effects, uses rotations to improve factor interpretability, maps factors to known covariates, identifies unexplained factors that may capture hidden biological phenomena, and determines the genes and biological processes represented by the resulting factors. We apply sciRED to multiple scRNA-seq datasets and identify sex-specific variation in a kidney map, discern strong and weak immune stimulation signals in a PBMC dataset, reduce ambient RNA contamination in a rat liver atlas to help identify strain variation and reveal rare cell type signatures and anatomical zonation gene programs in a healthy human liver map. These demonstrate that sciRED is useful in characterizing diverse biological signals within scRNA-seq data.
Collapse
Affiliation(s)
- Delaram Pouyabahar
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
- The Donnelly Centre, University of Toronto, Toronto, ON, Canada
| | - Tallulah Andrews
- Department of Biochemistry, Schulich School of Medicine and Dentistry, University of Western Ontario, London, ON, Canada
- Department of Computer Science, University of Western Ontario, London, ON, Canada
| | - Gary D Bader
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada.
- The Donnelly Centre, University of Toronto, Toronto, ON, Canada.
- Department of Computer Science, University of Toronto, Toronto, ON, Canada.
- Lunenfeld-Tanenbaum Research Institute, Toronto, ON, Canada.
- Princess Margaret Research Institute, University Health Network, Toronto, ON, Canada.
- CIFAR Multiscale Human Program, CIFAR, Toronto, ON, Canada.
| |
Collapse
|
7
|
Zhao B, Song K, Wei DQ, Xiong Y, Ding J. scCobra allows contrastive cell embedding learning with domain adaptation for single cell data integration and harmonization. Commun Biol 2025; 8:233. [PMID: 39948393 PMCID: PMC11825689 DOI: 10.1038/s42003-025-07692-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2024] [Accepted: 02/06/2025] [Indexed: 02/16/2025] Open
Abstract
The rapid advancement of single-cell technologies has created an urgent need for effective methods to integrate and harmonize single-cell data. Technical and biological variations across studies complicate data integration, while conventional tools often struggle with reliance on gene expression distribution assumptions and over-correction. Here, we present scCobra, a deep generative neural network designed to overcome these challenges through contrastive learning with domain adaptation. scCobra effectively mitigates batch effects, minimizes over-correction, and ensures biologically meaningful data integration without assuming specific gene expression distributions. It enables online label transfer across datasets with batch effects, allowing continuous integration of new data without retraining. Additionally, scCobra supports batch effect simulation, advanced multi-omic integration, and scalable processing of large datasets. By integrating and harmonizing datasets from similar studies, scCobra expands the available data for investigating specific biological problems, improving cross-study comparability, and revealing insights that may be obscured in isolated datasets.
Collapse
Affiliation(s)
- Bowen Zhao
- State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
- Meakins-Christie Laboratories, Department of Medicine, McGill University Health Centre, Montreal, QC, Canada
- Division of Experimental Medicine, Department of Medicine, McGill University, Montreal, QC, Canada
| | - Kailu Song
- Meakins-Christie Laboratories, Department of Medicine, McGill University Health Centre, Montreal, QC, Canada
- Quantitative Life Sciences, McGill University, Montreal, QC, Canada
| | - Dong-Qing Wei
- State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| | - Yi Xiong
- State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China.
| | - Jun Ding
- Meakins-Christie Laboratories, Department of Medicine, McGill University Health Centre, Montreal, QC, Canada.
- Division of Experimental Medicine, Department of Medicine, McGill University, Montreal, QC, Canada.
- Quantitative Life Sciences, McGill University, Montreal, QC, Canada.
- School of Computer Science, McGill University, Montreal, QC, Canada.
- Mila-Quebec AI Institute, Montreal, QC, Canada.
| |
Collapse
|
8
|
Brombacher E, Schilling O, Kreutz C. Characterizing the omics landscape based on 10,000+ datasets. Sci Rep 2025; 15:3189. [PMID: 39863642 PMCID: PMC11762699 DOI: 10.1038/s41598-025-87256-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2024] [Accepted: 01/17/2025] [Indexed: 01/27/2025] Open
Abstract
The characteristics of data produced by omics technologies are pivotal, as they critically influence the feasibility and effectiveness of computational methods applied in downstream analyses, such as data harmonization and differential abundance analyses. Furthermore, variability in these data characteristics across datasets plays a crucial role, leading to diverging outcomes in benchmarking studies, which are essential for guiding the selection of appropriate analysis methods in all omics fields. Additionally, downstream analysis tools are often developed and applied within specific omics communities due to the presumed differences in data characteristics attributed to each omics technology. In this study, we investigate over ten thousand datasets to understand how proteomics, metabolomics, lipidomics, transcriptomics, and microbiome data vary in specific data characteristics. We were able to show patterns of data characteristics specific to the investigated omics types and provide a tool that enables researchers to assess how representative a given omics dataset is for its respective discipline. Moreover, we illustrate how data characteristics can impact analyses at the example of normalization in the presence of sample-dependent proportions of missing values. Given the variability of omics data characteristics, we encourage the systematic inspection of these characteristics in benchmark studies and for downstream analyses to prevent suboptimal method selection and unintended bias.
Collapse
Affiliation(s)
- Eva Brombacher
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center-University of Freiburg, Freiburg, Germany
- Centre for Integrative Biological Signaling Studies (CIBSS), University of Freiburg, Freiburg, Germany
- Spemann Graduate School of Biology and Medicine (SGBM), University of Freiburg, Freiburg, Germany
- Faculty of Biology, University of Freiburg, Freiburg, Germany
| | - Oliver Schilling
- Institute for Surgical Pathology, Medical Center-University of Freiburg, Faculty of Medicine, University of Freiburg, Freiburg, Germany
- German Cancer Consortium (DKTK) and German Cancer Research Center (DKFZ), Heidelberg, Germany
- BIOSS Centre for Biological Signaling Studies, University of Freiburg, Freiburg, Germany
| | - Clemens Kreutz
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center-University of Freiburg, Freiburg, Germany.
- Centre for Integrative Biological Signaling Studies (CIBSS), University of Freiburg, Freiburg, Germany.
| |
Collapse
|
9
|
CZI Cell Science Program, Abdulla S, Aevermann B, Assis P, Badajoz S, Bell SM, Bezzi E, Cakir B, Chaffer J, Chambers S, Cherry J, Chi T, Chien J, Dorman L, Garcia-Nieto P, Gloria N, Hastie M, Hegeman D, Hilton J, Huang T, Infeld A, Istrate AM, Jelic I, Katsuya K, Kim YJ, Liang K, Lin M, Lombardo M, Marshall B, Martin B, McDade F, Megill C, Patel N, Predeus A, Raymor B, Robatmili B, Rogers D, Rutherford E, Sadgat D, Shin A, Small C, Smith T, Sridharan P, Tarashansky A, Tavares N, Thomas H, Tolopko A, Urisko M, Yan J, Yeretssian G, Zamanian J, Mani A, Cool J, Carr A. CZ CELLxGENE Discover: a single-cell data platform for scalable exploration, analysis and modeling of aggregated data. Nucleic Acids Res 2025; 53:D886-D900. [PMID: 39607691 PMCID: PMC11701654 DOI: 10.1093/nar/gkae1142] [Citation(s) in RCA: 35] [Impact Index Per Article: 35.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2024] [Revised: 10/28/2024] [Accepted: 11/01/2024] [Indexed: 11/29/2024] Open
Abstract
Hundreds of millions of single cells have been analyzed using high-throughput transcriptomic methods. The cumulative knowledge within these datasets provides an exciting opportunity for unlocking insights into health and disease at the level of single cells. Meta-analyses that span diverse datasets building on recent advances in large language models and other machine-learning approaches pose exciting new directions to model and extract insight from single-cell data. Despite the promise of these and emerging analytical tools for analyzing large amounts of data, the sheer number of datasets, data models and accessibility remains a challenge. Here, we present CZ CELLxGENE Discover (cellxgene.cziscience.com), a data platform that provides curated and interoperable single-cell data. Available via a free-to-use online data portal, CZ CELLxGENE hosts a growing corpus of community-contributed data of over 93 million unique cells. Curated, standardized and associated with consistent cell-level metadata, this collection of single-cell transcriptomic data is the largest of its kind and growing rapidly via community contributions. A suite of tools and features enables accessibility and reusability of the data via both computational and visual interfaces to allow researchers to explore individual datasets, perform cross-corpus analysis, and run meta-analyses of tens of millions of cells across studies and tissues at the resolution of single cells.
Collapse
Affiliation(s)
| | - Shibla Abdulla
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1SA, UK
| | - Brian Aevermann
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | - Pedro Assis
- Department of Genetics, Stanford University School of Medicine, 291 Campus Drive, Li Ka Shing Building, Stanford, CA 94305, USA
| | - Seve Badajoz
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | - Sidney M Bell
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | - Emanuele Bezzi
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | - Batuhan Cakir
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1SA, UK
| | - Jim Chaffer
- Department of Genetics, Stanford University School of Medicine, 291 Campus Drive, Li Ka Shing Building, Stanford, CA 94305, USA
| | - Signe Chambers
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | - J Michael Cherry
- Department of Genetics, Stanford University School of Medicine, 291 Campus Drive, Li Ka Shing Building, Stanford, CA 94305, USA
| | - Tiffany Chi
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | - Jennifer Chien
- Department of Genetics, Stanford University School of Medicine, 291 Campus Drive, Li Ka Shing Building, Stanford, CA 94305, USA
| | - Leah Dorman
- Chan Zuckerberg, Biohub, SF, 499 Illinois St, San Francisco, CA 94158, USA
| | - Pablo Garcia-Nieto
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | - Nayib Gloria
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | - Mim Hastie
- Clever Canary, 850 Front St. #1491, Santa Cruz, CA, USA
| | - Daniel Hegeman
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | - Jason Hilton
- Department of Genetics, Stanford University School of Medicine, 291 Campus Drive, Li Ka Shing Building, Stanford, CA 94305, USA
| | - Timmy Huang
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | - Amanda Infeld
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | - Ana-Maria Istrate
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | - Ivana Jelic
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | - Kuni Katsuya
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | - Yang Joon Kim
- Chan Zuckerberg, Biohub, SF, 499 Illinois St, San Francisco, CA 94158, USA
| | - Karen Liang
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | - Mike Lin
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | | | - Bailey Marshall
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | - Bruce Martin
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | - Fran McDade
- Clever Canary, 850 Front St. #1491, Santa Cruz, CA, USA
| | - Colin Megill
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | - Nikhil Patel
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | - Alexander Predeus
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1SA, UK
| | - Brian Raymor
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | - Behnam Robatmili
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | - Dave Rogers
- Clever Canary, 850 Front St. #1491, Santa Cruz, CA, USA
| | - Erica Rutherford
- Department of Genetics, Stanford University School of Medicine, 291 Campus Drive, Li Ka Shing Building, Stanford, CA 94305, USA
| | - Dana Sadgat
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | - Andrew Shin
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | - Corinn Small
- Department of Genetics, Stanford University School of Medicine, 291 Campus Drive, Li Ka Shing Building, Stanford, CA 94305, USA
| | - Trent Smith
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | - Prathap Sridharan
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | | | - Norbert Tavares
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | - Harley Thomas
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | - Andrew Tolopko
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | - Meghan Urisko
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | - Joyce Yan
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | - Garabet Yeretssian
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | - Jennifer Zamanian
- Department of Genetics, Stanford University School of Medicine, 291 Campus Drive, Li Ka Shing Building, Stanford, CA 94305, USA
| | - Arathi Mani
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | - Jonah Cool
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| | - Ambrose Carr
- Chan Zuckerberg Initiative, 1180 Main Street, Redwood City, CA 94063, USA
| |
Collapse
|
10
|
Pouyabahar D, Andrews T, Bader GD. Interpretable single-cell factor decomposition using sciRED. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.08.01.605536. [PMID: 39149356 PMCID: PMC11326131 DOI: 10.1101/2024.08.01.605536] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 08/17/2024]
Abstract
Single-cell RNA sequencing (scRNA-seq) maps gene expression heterogeneity within a tissue. However, identifying biological signals in this data is challenging due to confounding technical factors, sparsity, and high dimensionality. Data factorization methods address this by separating and identifying signals in the data, such as gene expression programs, but the resulting factors must be manually interpreted. We developed Single-Cell Interpretable REsidual Decomposition (sciRED) to improve the interpretation of scRNA-seq factor analysis. sciRED removes known confounding effects, uses rotations to improve factor interpretability, maps factors to known covariates, identifies unexplained factors that may capture hidden biological phenomena and determines the genes and biological processes represented by the resulting factors. We apply sciRED to multiple scRNA-seq datasets and identify sex-specific variation in a kidney map, discern strong and weak immune stimulation signals in a PBMC dataset, reduce ambient RNA contamination in a rat liver atlas to help identify strain variation, and reveal rare cell type signatures and anatomical zonation gene programs in a healthy human liver map. These demonstrate that sciRED is useful in characterizing diverse biological signals within scRNA-seq data.
Collapse
Affiliation(s)
- Delaram Pouyabahar
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
- The Donnelly Centre, University of Toronto, Toronto, Ontario, Canada
| | - Tallulah Andrews
- Department of Biochemistry, Schulich School of Medicine and Dentistry, University of Western Ontario, London, Ontario, Canada
- Department of Computer Science, University of Western Ontario, London, Ontario, Canada
| | - Gary D Bader
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
- The Donnelly Centre, University of Toronto, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Lunenfeld-Tanenbaum Research Institute, Toronto, Ontario, Canada
- Princess Margaret Research Institute, University Health Network, Toronto, Ontario, Canada
- CIFAR Multiscale Human Program, CIFAR, Toronto, Ontario, Canada
| |
Collapse
|
11
|
Shan X, Zhao H. Inferring Cell-Type-Specific Co-Expressed Genes from Single Cell Data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.11.08.622700. [PMID: 39605403 PMCID: PMC11601408 DOI: 10.1101/2024.11.08.622700] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2024]
Abstract
Background Cell-type-specific gene co-expression networks are widely used to characterize gene relationships. Although many methods have been developed to infer such co-expression networks from single-cell data, the lack of consideration of false positive control in many evaluations may lead to incorrect conclusions because higher reproducibility, higher functional coherence, and a larger overlap with known biological networks may not imply better performance if the false positives are not well controlled. Results In this study, we have developed an efficient and effective simulation tool to derive empirical p-values in co-expression inference to appropriately control false positives in assessing method performance. We studied the power of the p-value-based approach in inferring cell-type-specific co-expressions from single-cell data using both simulated and real data. We also highlight the need to adjust for random overlaps between the inferred and known networks when the number of selected correlated gene pairs varies substantially across different methods. We further illustrate the expression level bias in known biological networks and the impact of such bias in method assessment. Conclusion Our study indicates the importance of controlling false positives in the inference of co-expressed genes to achieve more reliable results and proposes a simulation-based p-value method to achieve this.
Collapse
|
12
|
Cui S, Nassiri S, Zakeri I. Mcadet: A feature selection method for fine-resolution single-cell RNA-seq data based on multiple correspondence analysis and community detection. PLoS Comput Biol 2024; 20:e1012560. [PMID: 39466833 PMCID: PMC11542852 DOI: 10.1371/journal.pcbi.1012560] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2023] [Revised: 11/07/2024] [Accepted: 10/15/2024] [Indexed: 10/30/2024] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) data analysis faces numerous challenges, including high sparsity, a high-dimensional feature space, and biological noise. These challenges hinder downstream analysis, necessitating the use of feature selection methods to identify informative genes, and reduce data dimensionality. However, existing methods for selecting highly variable genes (HVGs) exhibit limited overlap and inconsistent clustering performance across benchmark datasets. Moreover, these methods often struggle to accurately select HVGs from fine-resolution scRNA-seq datasets and minority cell types, which are more difficult to distinguish, raising concerns about the reliability of their results. To overcome these limitations, we propose a novel feature selection framework for scRNA-seq data called Mcadet. Mcadet integrates Multiple Correspondence Analysis (MCA), graph-based community detection, and a novel statistical testing approach. To assess the effectiveness of Mcadet, we conducted extensive evaluations using both simulated and real-world data, employing unbiased metrics for comparison. Our results demonstrate the superior performance of Mcadet in the selection of HVGs in scenarios involving fine-resolution scRNA-seq datasets and datasets containing minority cell populations. Overall, we demonstrate that Mcadet enhances the reliability of selected HVGs, although the impact of HVG selection on various downstream analyses varies and needs to be further investigated.
Collapse
Affiliation(s)
- Saishi Cui
- Department of Epidemiology and Biostatistics, Dornsife School of Public Health, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - Sina Nassiri
- Roche Pharma Research and Early Development, Roche Innovation Center Basel, Basel, Switzerland
| | - Issa Zakeri
- Department of Epidemiology and Biostatistics, Dornsife School of Public Health, Drexel University, Philadelphia, Pennsylvania, United States of America
| |
Collapse
|
13
|
Zhang J, Larschan E, Bigness J, Singh R. scNODE : generative model for temporal single cell transcriptomic data prediction. Bioinformatics 2024; 40:ii146-ii154. [PMID: 39230694 PMCID: PMC11373355 DOI: 10.1093/bioinformatics/btae393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/05/2024] Open
Abstract
SUMMARY Measurement of single-cell gene expression at different timepoints enables the study of cell development. However, due to the resource constraints and technical challenges associated with the single-cell experiments, researchers can only profile gene expression at discrete and sparsely sampled timepoints. This missing timepoint information impedes downstream cell developmental analyses. We propose scNODE, an end-to-end deep learning model that can predict in silico single-cell gene expression at unobserved timepoints. scNODE integrates a variational autoencoder with neural ordinary differential equations to predict gene expression using a continuous and nonlinear latent space. Importantly, we incorporate a dynamic regularization term to learn a latent space that is robust against distribution shifts when predicting single-cell gene expression at unobserved timepoints. Our evaluations on three real-world scRNA-seq datasets show that scNODE achieves higher predictive performance than state-of-the-art methods. We further demonstrate that scNODE's predictions help cell trajectory inference under the missing timepoint paradigm and the learned latent space is useful for in silico perturbation analysis of relevant genes along a developmental cell path. AVAILABILITY AND IMPLEMENTATION The data and code are publicly available at https://github.com/rsinghlab/scNODE.
Collapse
Affiliation(s)
- Jiaqi Zhang
- Department of Computer Science, Brown University, Providence, RI 02906, United States
| | - Erica Larschan
- Center for Computational Molecular Biology, Brown University, Providence, RI 02912, United States
- Department of Molecular Biology, Cell Biology and Biochemistry, Brown University, Providence, RI 02912, United States
| | - Jeremy Bigness
- Center for Computational Molecular Biology, Brown University, Providence, RI 02912, United States
| | - Ritambhara Singh
- Department of Computer Science, Brown University, Providence, RI 02906, United States
- Center for Computational Molecular Biology, Brown University, Providence, RI 02912, United States
| |
Collapse
|
14
|
Garbulowski M, Hillerton T, Morgan D, Seçilmiş D, Sonnhammer L, Tjärnberg A, Nordling TEM, Sonnhammer ELL. GeneSPIDER2: large scale GRN simulation and benchmarking with perturbed single-cell data. NAR Genom Bioinform 2024; 6:lqae121. [PMID: 39296931 PMCID: PMC11409065 DOI: 10.1093/nargab/lqae121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2024] [Revised: 08/20/2024] [Accepted: 09/02/2024] [Indexed: 09/21/2024] Open
Abstract
Single-cell data is increasingly used for gene regulatory network (GRN) inference, and benchmarks for this have been developed based on simulated data. However, existing single-cell simulators cannot model the effects of gene perturbations. A further challenge lies in generating large-scale GRNs that often struggle with computational and stability issues. We present GeneSPIDER2, an update of the GeneSPIDER MATLAB toolbox for GRN benchmarking, inference, and analysis. Several software modules have improved capabilities and performance, and new functionalities have been added. A major improvement is the ability to generate large GRNs with biologically realistic topological properties in terms of scale-free degree distribution and modularity. Another major addition is a simulation of single-cell data, which is becoming increasingly popular as input for GRN inference. Specifically, we introduced the unique feature to generate single-cell data based on genetic perturbations. Finally, the simulated single-cell data was compared to real single-cell Perturb-seq data from two cell lines, showing that the synthetic and real data exhibit similar properties.
Collapse
Affiliation(s)
- Mateusz Garbulowski
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, Solna 171 21, Sweden
- Department of Immunology, Genetics and Pathology, Uppsala University, Uppsala 751 85, Sweden
| | - Thomas Hillerton
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, Solna 171 21, Sweden
| | - Daniel Morgan
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, Solna 171 21, Sweden
| | - Deniz Seçilmiş
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, Solna 171 21, Sweden
- Department of Cell and Molecular Biology, Karolinska Institutet, Solna 171 77, Sweden
| | - Lisbet Sonnhammer
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, Solna 171 21, Sweden
| | - Andreas Tjärnberg
- Department of Neuro-Science, University of Wisconsin-Madison, Waisman Center, WI 53705, USA
| | - Torbjörn E M Nordling
- Department of Mechanical Engineering, National Cheng Kung University, No. 1 University Road, Tainan City 701, Taiwan
| | - Erik L L Sonnhammer
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Box 1031, Solna 171 21, Sweden
| |
Collapse
|
15
|
Pouyabahar D, Andrews T, Bader GD. Interpretable single-cell factor decomposition using sciRED. RESEARCH SQUARE 2024:rs.3.rs-4819117. [PMID: 39149508 PMCID: PMC11326389 DOI: 10.21203/rs.3.rs-4819117/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/17/2024]
Abstract
Single-cell RNA sequencing (scRNA-seq) maps gene expression heterogeneity within a tissue. However, identifying biological signals in this data is challenging due to confounding technical factors, sparsity, and high dimensionality. Data factorization methods address this by separating and identifying signals in the data, such as gene expression programs, but the resulting factors must be manually interpreted. We developed Single-Cell Interpretable Residual Decomposition (sciRED) to improve the interpretation of scRNA-seq factor analysis. sciRED removes known confounding effects, uses rotations to improve factor interpretability, maps factors to known covariates, identifies unexplained factors that may capture hidden biological phenomena and determines the genes and biological processes represented by the resulting factors. We apply sciRED to multiple scRNA-seq datasets and identify sex-specific variation in a kidney map, discern strong and weak immune stimulation signals in a PBMC dataset, reduce ambient RNA contamination in a rat liver atlas to help identify strain variation, and reveal rare cell type signatures and anatomical zonation gene programs in a healthy human liver map. These demonstrate that sciRED is useful in characterizing diverse biological signals within scRNA-seq data.
Collapse
Affiliation(s)
- Delaram Pouyabahar
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
- The Donnelly Centre, University of Toronto, Toronto, Ontario, Canada
| | - Tallulah Andrews
- Department of Biochemistry, Schulich School of Medicine and Dentistry, University of Western Ontario, London, Ontario, Canada
- Department of Computer Science, University of Western Ontario, London, Ontario, Canada
| | - Gary D Bader
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
- The Donnelly Centre, University of Toronto, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Lunenfeld-Tanenbaum Research Institute, Toronto, Ontario, Canada
- Princess Margaret Research Institute, University Health Network, Toronto, Ontario, Canada
- CIFAR Multiscale Human Program, CIFAR, Toronto, Ontario, Canada
| |
Collapse
|
16
|
Singh A, Khiabanian H. Feature selection followed by a novel residuals-based normalization that includes variance stabilization simplifies and improves single-cell gene expression analysis. BMC Bioinformatics 2024; 25:248. [PMID: 39080559 PMCID: PMC11290295 DOI: 10.1186/s12859-024-05872-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2024] [Accepted: 07/16/2024] [Indexed: 08/02/2024] Open
Abstract
Normalization is a crucial step in the analysis of single-cell RNA-sequencing (scRNA-seq) counts data. Its principal objectives are reduction of systematic biases primarily introduced through technical sources and transformation of counts to make them more amenable for the application of established statistical frameworks. In the standard workflows, normalization is followed by feature selection to identify highly variable genes (HVGs) that capture most of the biologically meaningful variation across the cells. Here, we make the case for a revised workflow by proposing a simple feature selection method and showing that we can perform feature selection before normalization by relying on observed counts. We highlight that the feature selection step can be used to not only select HVGs but to also identify stable genes. We further propose a novel variance stabilization transformation inclusive residuals-based normalization method that in fact relies on the stable genes to inform the reduction of systematic biases. We demonstrate significant improvements in downstream clustering analyses through the application of our proposed methods on biological truth-known as well as simulated counts datasets. We have implemented this novel workflow for analyzing high-throughput scRNA-seq data in an R package called Piccolo.
Collapse
Affiliation(s)
- Amartya Singh
- Center for Systems and Computational Biology, Rutgers Cancer Institute of New Jersey, Rutgers University, New Brunswick, NJ, USA.
| | - Hossein Khiabanian
- Center for Systems and Computational Biology, Rutgers Cancer Institute of New Jersey, Rutgers University, New Brunswick, NJ, USA
- Department of Pathology and Laboratory Medicine, Rutgers Robert Wood Johnson Medical School, Rutgers University, New Brunswick, NJ, USA
- Regeneron Genetics Center, Regeneron Pharmaceuticals, Tarrytown, NY, USA
| |
Collapse
|
17
|
Han G, Yan D, Sun Z, Fang J, Chang X, Wilson L, Liu Y. Bayesian-frequentist hybrid inference framework for single cell RNA-seq analyses. Hum Genomics 2024; 18:69. [PMID: 38902839 PMCID: PMC11575015 DOI: 10.1186/s40246-024-00638-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2023] [Accepted: 06/12/2024] [Indexed: 06/22/2024] Open
Abstract
BACKGROUND Single cell RNA sequencing technology (scRNA-seq) has been proven useful in understanding cell-specific disease mechanisms. However, identifying genes of interest remains a key challenge. Pseudo-bulk methods that pool scRNA-seq counts in the same biological replicates have been commonly used to identify differentially expressed genes. However, such methods may lack power due to the limited sample size of scRNA-seq datasets, which can be prohibitively expensive. RESULTS Motivated by this, we proposed to use the Bayesian-frequentist hybrid (BFH) framework to increase the power and we showed in simulated scenario, the proposed BFH would be an optimal method when compared with other popular single cell differential expression methods if both FDR and power were considered. As an example, the method was applied to an idiopathic pulmonary fibrosis (IPF) case study. CONCLUSION In our IPF example, we demonstrated that with a proper informative prior, the BFH approach identified more genes of interest. Furthermore, these genes were reasonable based on the current knowledge of IPF. Thus, the BFH offers a unique and flexible framework for future scRNA-seq analyses.
Collapse
Affiliation(s)
- Gang Han
- Department of Epidemiology and Biostatistics, School of Public Health, Texas A&M University, College Station, TX, USA
| | - Dongyan Yan
- Eli Lilly and Company, Lilly Corporate Center, 893 Delaware St, Indianapolis, IN, 46225, USA
| | - Zhe Sun
- Eli Lilly and Company, Lilly Corporate Center, 893 Delaware St, Indianapolis, IN, 46225, USA
| | - Jiyuan Fang
- Eli Lilly and Company, Lilly Corporate Center, 893 Delaware St, Indianapolis, IN, 46225, USA
| | - Xinyue Chang
- Eli Lilly and Company, Lilly Corporate Center, 893 Delaware St, Indianapolis, IN, 46225, USA
| | - Lucas Wilson
- Department of Epidemiology and Biostatistics, School of Public Health, Texas A&M University, College Station, TX, USA
| | - Yushi Liu
- Eli Lilly and Company, Lilly Corporate Center, 893 Delaware St, Indianapolis, IN, 46225, USA.
| |
Collapse
|
18
|
Duo H, Li Y, Lan Y, Tao J, Yang Q, Xiao Y, Sun J, Li L, Nie X, Zhang X, Liang G, Liu M, Hao Y, Li B. Systematic evaluation with practical guidelines for single-cell and spatially resolved transcriptomics data simulation under multiple scenarios. Genome Biol 2024; 25:145. [PMID: 38831386 PMCID: PMC11149245 DOI: 10.1186/s13059-024-03290-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Accepted: 05/28/2024] [Indexed: 06/05/2024] Open
Abstract
BACKGROUND Single-cell RNA sequencing (scRNA-seq) and spatially resolved transcriptomics (SRT) have led to groundbreaking advancements in life sciences. To develop bioinformatics tools for scRNA-seq and SRT data and perform unbiased benchmarks, data simulation has been widely adopted by providing explicit ground truth and generating customized datasets. However, the performance of simulation methods under multiple scenarios has not been comprehensively assessed, making it challenging to choose suitable methods without practical guidelines. RESULTS We systematically evaluated 49 simulation methods developed for scRNA-seq and/or SRT data in terms of accuracy, functionality, scalability, and usability using 152 reference datasets derived from 24 platforms. SRTsim, scDesign3, ZINB-WaVE, and scDesign2 have the best accuracy performance across various platforms. Unexpectedly, some methods tailored to scRNA-seq data have potential compatibility for simulating SRT data. Lun, SPARSim, and scDesign3-tree outperform other methods under corresponding simulation scenarios. Phenopath, Lun, Simple, and MFA yield high scalability scores but they cannot generate realistic simulated data. Users should consider the trade-offs between method accuracy and scalability (or functionality) when making decisions. Additionally, execution errors are mainly caused by failed parameter estimations and appearance of missing or infinite values in calculations. We provide practical guidelines for method selection, a standard pipeline Simpipe ( https://github.com/duohongrui/simpipe ; https://doi.org/10.5281/zenodo.11178409 ), and an online tool Simsite ( https://www.ciblab.net/software/simshiny/ ) for data simulation. CONCLUSIONS No method performs best on all criteria, thus a good-yet-not-the-best method is recommended if it solves problems effectively and reasonably. Our comprehensive work provides crucial insights for developers on modeling gene expression data and fosters the simulation process for users.
Collapse
Affiliation(s)
- Hongrui Duo
- College of Life Sciences, Chongqing Normal University, Chongqing, 401331, People's Republic of China
| | - Yinghong Li
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Telecommunications, Chongqing, 400065, People's Republic of China
| | - Yang Lan
- Institute of Pathology and Southwest Cancer Center, Southwest Hospital, Army Medical University, Chongqing, 400038, People's Republic of China
| | - Jingxin Tao
- College of Life Sciences, Chongqing Normal University, Chongqing, 401331, People's Republic of China
| | - Qingxia Yang
- Zhejiang Provincial Key Laboratory of Precision Diagnosis and Therapy for Major Gynecological Diseases, Women's Hospital, Zhejiang University School of Medicine, Hangzhou, 310058, People's Republic of China
| | - Yingxue Xiao
- College of Life Sciences, Chongqing Normal University, Chongqing, 401331, People's Republic of China
| | - Jing Sun
- College of Life Sciences, Chongqing Normal University, Chongqing, 401331, People's Republic of China
| | - Lei Li
- College of Life Sciences, Chongqing Normal University, Chongqing, 401331, People's Republic of China
| | - Xiner Nie
- Key Laboratory of Biorheological Science and Technology, Ministry of Education, Bioengineering College, Chongqing University, Chongqing, 400044, People's Republic of China
| | - Xiaoxi Zhang
- College of Life Sciences, Chongqing Normal University, Chongqing, 401331, People's Republic of China
| | - Guizhao Liang
- Key Laboratory of Biorheological Science and Technology, Ministry of Education, Bioengineering College, Chongqing University, Chongqing, 400044, People's Republic of China
| | - Mingwei Liu
- Key Laboratory of Clinical Laboratory Diagnostics, College of Laboratory Medicine, Chongqing Medical University, Chongqing, 400016, People's Republic of China
| | - Youjin Hao
- College of Life Sciences, Chongqing Normal University, Chongqing, 401331, People's Republic of China.
| | - Bo Li
- College of Life Sciences, Chongqing Normal University, Chongqing, 401331, People's Republic of China.
| |
Collapse
|
19
|
Brooks TG, Lahens NF, Mrčela A, Grant GR. Challenges and best practices in omics benchmarking. Nat Rev Genet 2024; 25:326-339. [PMID: 38216661 DOI: 10.1038/s41576-023-00679-6] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/14/2023] [Indexed: 01/14/2024]
Abstract
Technological advances enabling massively parallel measurement of biological features - such as microarrays, high-throughput sequencing and mass spectrometry - have ushered in the omics era, now in its third decade. The resulting complex landscape of analytical methods has naturally fostered the growth of an omics benchmarking industry. Benchmarking refers to the process of objectively comparing and evaluating the performance of different computational or analytical techniques when processing and analysing large-scale biological data sets, such as transcriptomics, proteomics and metabolomics. With thousands of omics benchmarking studies published over the past 25 years, the field has matured to the point where the foundations of benchmarking have been established and well described. However, generating meaningful benchmarking data and properly evaluating performance in this complex domain remains challenging. In this Review, we highlight some common oversights and pitfalls in omics benchmarking. We also establish a methodology to bring the issues that can be addressed into focus and to be transparent about those that cannot: this takes the form of a spreadsheet template of guidelines for comprehensive reporting, intended to accompany publications. In addition, a survey of recent developments in benchmarking is provided as well as specific guidance for commonly encountered difficulties.
Collapse
Affiliation(s)
- Thomas G Brooks
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Nicholas F Lahens
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Antonijo Mrčela
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Gregory R Grant
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA.
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
20
|
Liu F, Yang Y, Xu XS, Yuan M. MESBC: A novel mutually exclusive spectral biclustering method for cancer subtyping. Comput Biol Chem 2024; 109:108009. [PMID: 38219419 DOI: 10.1016/j.compbiolchem.2023.108009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Revised: 12/22/2023] [Accepted: 12/24/2023] [Indexed: 01/16/2024]
Abstract
Many soft biclustering algorithms have been developed and applied to various biological and biomedical data analyses. However, few mutually exclusive (hard) biclustering algorithms have been proposed, which could better identify disease or molecular subtypes with survival significance based on genomic or transcriptomic data. In this study, we developed a novel mutually exclusive spectral biclustering (MESBC) algorithm based on spectral method to detect mutually exclusive biclusters. MESBC simultaneously detects relevant features (genes) and corresponding conditions (patients) subgroups and, therefore, automatically uses the signature features for each subtype to perform the clustering. Extensive simulations revealed that MESBC provided superior accuracy in detecting pre-specified biclusters compared with the non-negative matrix factorization (NMF) and Dhillon's algorithm, particularly in very noisy data. Further analysis of the algorithm on real datasets obtained from the TCGA database showed that MESBC provided more accurate (i.e., smaller p-value) overall survival prediction in patients with lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) cancers when compared to the existing, gold-standard subtypes for lung cancers (integrative clustering). Furthermore, MESBC detected several genes with significant prognostic value in both LUAD and LUSC patients. External validation on an independent, unseen GEO dataset of LUAD showed that MESBC-derived clusters based on TCGA data still exhibited clear biclustering patterns and consistent, outstanding prognostic predictability, demonstrating robust generalizability of MESBC. Therefore, MESBC could potentially be used as a risk stratification tool to optimize the treatment for the patient, improve the selection of patients for clinical trials, and contribute to the development of novel therapeutic agents.
Collapse
Affiliation(s)
- Fengrong Liu
- Department of Statistics and Finance, University of Science and Technology of China, Hefei 230026, China
| | - Yaning Yang
- Department of Statistics and Finance, University of Science and Technology of China, Hefei 230026, China
| | | | - Min Yuan
- School of Public Health Administration, Anhui Medical University, Hefei 230032, China.
| |
Collapse
|
21
|
Ranek JS, Stallaert W, Milner JJ, Redick M, Wolff SC, Beltran AS, Stanley N, Purvis JE. DELVE: feature selection for preserving biological trajectories in single-cell data. Nat Commun 2024; 15:2765. [PMID: 38553455 PMCID: PMC10980758 DOI: 10.1038/s41467-024-46773-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2023] [Accepted: 03/07/2024] [Indexed: 04/02/2024] Open
Abstract
Single-cell technologies can measure the expression of thousands of molecular features in individual cells undergoing dynamic biological processes. While examining cells along a computationally-ordered pseudotime trajectory can reveal how changes in gene or protein expression impact cell fate, identifying such dynamic features is challenging due to the inherent noise in single-cell data. Here, we present DELVE, an unsupervised feature selection method for identifying a representative subset of molecular features which robustly recapitulate cellular trajectories. In contrast to previous work, DELVE uses a bottom-up approach to mitigate the effects of confounding sources of variation, and instead models cell states from dynamic gene or protein modules based on core regulatory complexes. Using simulations, single-cell RNA sequencing, and iterative immunofluorescence imaging data in the context of cell cycle and cellular differentiation, we demonstrate how DELVE selects features that better define cell-types and cell-type transitions. DELVE is available as an open-source python package: https://github.com/jranek/delve .
Collapse
Affiliation(s)
- Jolene S Ranek
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Computational Medicine Program, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Wayne Stallaert
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
| | - J Justin Milner
- Department of Microbiology and Immunology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill School of Medicine, Chapel Hill, NC, USA
| | - Margaret Redick
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Computational Medicine Program, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Samuel C Wolff
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Computational Medicine Program, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Adriana S Beltran
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Human Pluripotent Cell Core, University of North Carolina at Chapel Hill School of Medicine, Chapel Hill, NC, USA
| | - Natalie Stanley
- Computational Medicine Program, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
- Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
| | - Jeremy E Purvis
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
- Computational Medicine Program, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
| |
Collapse
|
22
|
Brooks TG, Lahens NF, Mrčela A, Sarantopoulou D, Nayak S, Naik A, Sengupta S, Choi PS, Grant GR. BEERS2: RNA-Seq simulation through high fidelity in silico modeling. Brief Bioinform 2024; 25:bbae164. [PMID: 38605641 PMCID: PMC11009461 DOI: 10.1093/bib/bbae164] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2023] [Revised: 01/26/2024] [Accepted: 03/26/2024] [Indexed: 04/13/2024] Open
Abstract
Simulation of RNA-seq reads is critical in the assessment, comparison, benchmarking and development of bioinformatics tools. Yet the field of RNA-seq simulators has progressed little in the last decade. To address this need we have developed BEERS2, which combines a flexible and highly configurable design with detailed simulation of the entire library preparation and sequencing pipeline. BEERS2 takes input transcripts (typically fully length messenger RNA transcripts with polyA tails) from either customizable input or from CAMPAREE simulated RNA samples. It produces realistic reads of these transcripts as FASTQ, SAM or BAM formats with the SAM or BAM formats containing the true alignment to the reference genome. It also produces true transcript-level quantification values. BEERS2 combines a flexible and highly configurable design with detailed simulation of the entire library preparation and sequencing pipeline and is designed to include the effects of polyA selection and RiboZero for ribosomal depletion, hexamer priming sequence biases, GC-content biases in polymerase chain reaction (PCR) amplification, barcode read errors and errors during PCR amplification. These characteristics combine to make BEERS2 the most complete simulation of RNA-seq to date. Finally, we demonstrate the use of BEERS2 by measuring the effect of several settings on the popular Salmon pseudoalignment algorithm.
Collapse
Affiliation(s)
- Thomas G Brooks
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
| | - Nicholas F Lahens
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
| | - Antonijo Mrčela
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
| | - Dimitra Sarantopoulou
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
- Current address: National Institute on Aging, National Institutes of Health, Baltimore, MD, USA
| | - Soumyashant Nayak
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
- Current address: Statistics and Mathematics Unit, Indian Statistical Institute, Bengaluru, Karnataka, India
| | - Amruta Naik
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
- Children’s Hospital of Philadelphia, Philadelphia, PA, USA
| | - Shaon Sengupta
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
- Children’s Hospital of Philadelphia, Philadelphia, PA, USA
- Department of Pediatrics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, USA
| | - Peter S Choi
- Division of Cancer Pathobiology, Children’s Hospital of Philadelphia, Philadelphia, PA, USA
- Department of Pathology & Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Gregory R Grant
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, PA, USA
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
23
|
Garmire LX, Li Y, Huang Q, Xu C, Teichmann SA, Kaminski N, Pellegrini M, Nguyen Q, Teschendorff AE. Challenges and perspectives in computational deconvolution of genomics data. Nat Methods 2024; 21:391-400. [PMID: 38374264 DOI: 10.1038/s41592-023-02166-6] [Citation(s) in RCA: 15] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2022] [Accepted: 12/26/2023] [Indexed: 02/21/2024]
Abstract
Deciphering cell-type heterogeneity is crucial for systematically understanding tissue homeostasis and its dysregulation in diseases. Computational deconvolution is an efficient approach for estimating cell-type abundances from a variety of omics data. Despite substantial methodological progress in computational deconvolution in recent years, challenges are still outstanding. Here we enlist four important challenges related to computational deconvolution: the quality of the reference data, generation of ground truth data, limitations of computational methodologies, and benchmarking design and implementation. Finally, we make recommendations on reference data generation, new directions of computational methodologies, and strategies to promote rigorous benchmarking.
Collapse
Affiliation(s)
- Lana X Garmire
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.
| | - Yijun Li
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
| | - Qianhui Huang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Chuan Xu
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
| | | | - Naftali Kaminski
- Pulmonary, Critical Care & Sleep Medicine, Yale University School of Medicine, New Haven, CT, USA
| | - Matteo Pellegrini
- Molecular, Cell and Developmental Biology, University of California, Los Angeles, Los Angeles, CA, USA
| | - Quan Nguyen
- Institute for Molecular Bioscience, The University of Queensland and QIMR Berghofer Medical Research Institute, Brisbane, Queensland, Australia
| | - Andrew E Teschendorff
- CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
- UCL Cancer Institute, University College London, London, UK
| |
Collapse
|
24
|
Song D, Wang Q, Yan G, Liu T, Sun T, Li JJ. scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics. Nat Biotechnol 2024; 42:247-252. [PMID: 37169966 PMCID: PMC11182337 DOI: 10.1038/s41587-023-01772-1] [Citation(s) in RCA: 34] [Impact Index Per Article: 34.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2022] [Accepted: 03/30/2023] [Indexed: 05/13/2023]
Abstract
We present a statistical simulator, scDesign3, to generate realistic single-cell and spatial omics data, including various cell states, experimental designs and feature modalities, by learning interpretable parameters from real data. Using a unified probabilistic model for single-cell and spatial omics data, scDesign3 infers biologically meaningful parameters; assesses the goodness-of-fit of inferred cell clusters, trajectories and spatial locations; and generates in silico negative and positive controls for benchmarking computational tools.
Collapse
Affiliation(s)
- Dongyuan Song
- Bioinformatics Interdepartmental Ph.D. Program, University of California, Los Angeles, CA, USA
| | - Qingyang Wang
- Department of Statistics, University of California, Los Angeles, CA, USA
| | - Guanao Yan
- Department of Statistics, University of California, Los Angeles, CA, USA
| | - Tianyang Liu
- Department of Statistics, University of California, Los Angeles, CA, USA
| | - Tianyi Sun
- Department of Statistics, University of California, Los Angeles, CA, USA
| | - Jingyi Jessica Li
- Bioinformatics Interdepartmental Ph.D. Program, University of California, Los Angeles, CA, USA.
- Department of Statistics, University of California, Los Angeles, CA, USA.
- Department of Human Genetics, University of California, Los Angeles, CA, USA.
- Department of Computational Medicine, University of California, Los Angeles, CA, USA.
- Department of Biostatistics, University of California, Los Angeles, CA, USA.
- Radcliffe Institute for Advanced Study, Harvard University, Cambridge, MA, USA.
| |
Collapse
|
25
|
Fu X, Lin Y, Lin DM, Mechtersheimer D, Wang C, Ameen F, Ghazanfar S, Patrick E, Kim J, Yang JYH. BIDCell: Biologically-informed self-supervised learning for segmentation of subcellular spatial transcriptomics data. Nat Commun 2024; 15:509. [PMID: 38218939 PMCID: PMC10787788 DOI: 10.1038/s41467-023-44560-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Accepted: 12/13/2023] [Indexed: 01/15/2024] Open
Abstract
Recent advances in subcellular imaging transcriptomics platforms have enabled high-resolution spatial mapping of gene expression, while also introducing significant analytical challenges in accurately identifying cells and assigning transcripts. Existing methods grapple with cell segmentation, frequently leading to fragmented cells or oversized cells that capture contaminated expression. To this end, we present BIDCell, a self-supervised deep learning-based framework with biologically-informed loss functions that learn relationships between spatially resolved gene expression and cell morphology. BIDCell incorporates cell-type data, including single-cell transcriptomics data from public repositories, with cell morphology information. Using a comprehensive evaluation framework consisting of metrics in five complementary categories for cell segmentation performance, we demonstrate that BIDCell outperforms other state-of-the-art methods according to many metrics across a variety of tissue types and technology platforms. Our findings underscore the potential of BIDCell to significantly enhance single-cell spatial expression analyses, enabling great potential in biological discovery.
Collapse
Affiliation(s)
- Xiaohang Fu
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW, 2006, Australia
- School of Computer Science, The University of Sydney, Sydney, NSW, 2006, Australia
- Sydney Precision Data Science Centre, University of Sydney, Sydney, NSW, 2006, Australia
- Charles Perkins Centre, The University of Sydney, Sydney, NSW, 2006, Australia
- Laboratory of Data Discovery for Health Limited (D24H), Science Park, Hong Kong SAR, China
| | - Yingxin Lin
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW, 2006, Australia
- Sydney Precision Data Science Centre, University of Sydney, Sydney, NSW, 2006, Australia
- Charles Perkins Centre, The University of Sydney, Sydney, NSW, 2006, Australia
- Laboratory of Data Discovery for Health Limited (D24H), Science Park, Hong Kong SAR, China
| | - David M Lin
- Department of Biomedical Sciences, Cornell University, Ithaca, NY, 14850, USA
| | - Daniel Mechtersheimer
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW, 2006, Australia
- Sydney Precision Data Science Centre, University of Sydney, Sydney, NSW, 2006, Australia
- Charles Perkins Centre, The University of Sydney, Sydney, NSW, 2006, Australia
| | - Chuhan Wang
- School of Computer Science, The University of Sydney, Sydney, NSW, 2006, Australia
- Sydney Precision Data Science Centre, University of Sydney, Sydney, NSW, 2006, Australia
- Laboratory of Data Discovery for Health Limited (D24H), Science Park, Hong Kong SAR, China
| | - Farhan Ameen
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW, 2006, Australia
- Sydney Precision Data Science Centre, University of Sydney, Sydney, NSW, 2006, Australia
- Charles Perkins Centre, The University of Sydney, Sydney, NSW, 2006, Australia
| | - Shila Ghazanfar
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW, 2006, Australia
- Sydney Precision Data Science Centre, University of Sydney, Sydney, NSW, 2006, Australia
- Charles Perkins Centre, The University of Sydney, Sydney, NSW, 2006, Australia
| | - Ellis Patrick
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW, 2006, Australia
- Sydney Precision Data Science Centre, University of Sydney, Sydney, NSW, 2006, Australia
- Charles Perkins Centre, The University of Sydney, Sydney, NSW, 2006, Australia
- Laboratory of Data Discovery for Health Limited (D24H), Science Park, Hong Kong SAR, China
- The Westmead Institute for Medical Research, Sydney, NSW, 2145, Australia
| | - Jinman Kim
- School of Computer Science, The University of Sydney, Sydney, NSW, 2006, Australia
- Sydney Precision Data Science Centre, University of Sydney, Sydney, NSW, 2006, Australia
- Laboratory of Data Discovery for Health Limited (D24H), Science Park, Hong Kong SAR, China
| | - Jean Y H Yang
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW, 2006, Australia.
- Sydney Precision Data Science Centre, University of Sydney, Sydney, NSW, 2006, Australia.
- Charles Perkins Centre, The University of Sydney, Sydney, NSW, 2006, Australia.
- Laboratory of Data Discovery for Health Limited (D24H), Science Park, Hong Kong SAR, China.
| |
Collapse
|
26
|
Feng Y, Wang S, Liu X, Han Y, Xu H, Duan X, Xie W, Tian Z, Yuan Z, Wan Z, Xu L, Qin S, He K, Huang J. Geometric constraint-triggered collagen expression mediates bacterial-host adhesion. Nat Commun 2023; 14:8165. [PMID: 38071397 PMCID: PMC10710423 DOI: 10.1038/s41467-023-43827-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2023] [Accepted: 11/21/2023] [Indexed: 12/18/2023] Open
Abstract
Cells living in geometrically confined microenvironments are ubiquitous in various physiological processes, e.g., wound closure. However, it remains unclear whether and how spatially geometric constraints on host cells regulate bacteria-host interactions. Here, we reveal that interactions between bacteria and spatially constrained cell monolayers exhibit strong spatial heterogeneity, and that bacteria tend to adhere to these cells near the outer edges of confined monolayers. The bacterial adhesion force near the edges of the micropatterned monolayers is up to 75 nN, which is ~3 times higher than that at the centers, depending on the underlying substrate rigidities. Single-cell RNA sequencing experiments indicate that spatially heterogeneous expression of collagen IV with significant edge effects is responsible for the location-dependent bacterial adhesion. Finally, we show that collagen IV inhibitors can potentially be utilized as adjuvants to reduce bacterial adhesion and thus markedly enhance the efficacy of antibiotics, as demonstrated in animal experiments.
Collapse
Affiliation(s)
- Yuting Feng
- Department of Mechanics and Engineering Science, College of Engineering, Peking University, 100871, Beijing, China
| | - Shuyi Wang
- Department of Mechanics and Engineering Science, College of Engineering, Peking University, 100871, Beijing, China
| | - Xiaoye Liu
- Beijing Traditional Chinese Veterinary Engineering Center and Beijing Key Laboratory of Traditional Chinese Veterinary Medicine, Beijing University of Agriculture, 102206, Beijing, China
| | - Yiming Han
- Department of Mechanics and Engineering Science, College of Engineering, Peking University, 100871, Beijing, China
| | - Hongwei Xu
- Department of Mechanics and Engineering Science, College of Engineering, Peking University, 100871, Beijing, China
| | - Xiaocen Duan
- Department of Mechanics and Engineering Science, College of Engineering, Peking University, 100871, Beijing, China
| | - Wenyue Xie
- Department of Mechanics and Engineering Science, College of Engineering, Peking University, 100871, Beijing, China
| | - Zhuoling Tian
- Department of Mechanics and Engineering Science, College of Engineering, Peking University, 100871, Beijing, China
- Academy for Advanced Interdisciplinary Studies, Peking University, 100871, Beijing, China
| | - Zuoying Yuan
- Department of Mechanics and Engineering Science, College of Engineering, Peking University, 100871, Beijing, China
| | - Zhuo Wan
- Department of Mechanics and Engineering Science, College of Engineering, Peking University, 100871, Beijing, China
| | - Liang Xu
- Department of Mechanics and Engineering Science, College of Engineering, Peking University, 100871, Beijing, China
- Academy for Advanced Interdisciplinary Studies, Peking University, 100871, Beijing, China
| | - Siying Qin
- School of Life Sciences, Peking University, 100871, Beijing, China
| | - Kangmin He
- State Key Laboratory of Molecular Developmental Biology, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, 100101, Beijing, China
- University of Chinese Academy of Sciences, 100049, Beijing, China
| | - Jianyong Huang
- Department of Mechanics and Engineering Science, College of Engineering, Peking University, 100871, Beijing, China.
| |
Collapse
|
27
|
Yang Y, Wang K, Lu Z, Wang T, Wang X. Cytomulate: accurate and efficient simulation of CyTOF data. Genome Biol 2023; 24:262. [PMID: 37974276 PMCID: PMC10652542 DOI: 10.1186/s13059-023-03099-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2022] [Accepted: 10/24/2023] [Indexed: 11/19/2023] Open
Abstract
Recently, many analysis tools have been devised to offer insights into data generated via cytometry by time-of-flight (CyTOF). However, objective evaluations of these methods remain absent as most evaluations are conducted against real data where the ground truth is generally unknown. In this paper, we develop Cytomulate, a reproducible and accurate simulation algorithm of CyTOF data, which could serve as a foundation for future method development and evaluation. We demonstrate that Cytomulate can capture various characteristics of CyTOF data and is superior in learning overall data distributions than single-cell RNA-seq-oriented methods such as scDesign2, Splatter, and generative models like LAMBDA.
Collapse
Affiliation(s)
- Yuqiu Yang
- Department of Statistics and Data Science, Southern Methodist University, Dallas, TX, 75275, USA
- Quantitative Biomedical Research Center, Peter O'Donnell Jr. School of Public Health, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Kaiwen Wang
- Department of Statistics and Data Science, Southern Methodist University, Dallas, TX, 75275, USA
| | - Zeyu Lu
- Department of Statistics and Data Science, Southern Methodist University, Dallas, TX, 75275, USA
- Quantitative Biomedical Research Center, Peter O'Donnell Jr. School of Public Health, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Tao Wang
- Quantitative Biomedical Research Center, Peter O'Donnell Jr. School of Public Health, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
- Center for the Genetics of Host Defense, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
| | - Xinlei Wang
- Department of Statistics and Data Science, Southern Methodist University, Dallas, TX, 75275, USA.
- Department of Mathematics, University of Texas at Arlington, Arlington, 76019, USA.
- Center for Data Science Research and Education, College of Science, University of Texas at Arlington, Arlington, 76019, USA.
| |
Collapse
|
28
|
Li C, Chen X, Chen S, Jiang R, Zhang X. simCAS: an embedding-based method for simulating single-cell chromatin accessibility sequencing data. Bioinformatics 2023; 39:btad453. [PMID: 37494428 PMCID: PMC10394124 DOI: 10.1093/bioinformatics/btad453] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2023] [Revised: 06/25/2023] [Accepted: 07/25/2023] [Indexed: 07/28/2023] Open
Abstract
MOTIVATION Single-cell chromatin accessibility sequencing (scCAS) technology provides an epigenomic perspective to characterize gene regulatory mechanisms at single-cell resolution. With an increasing number of computational methods proposed for analyzing scCAS data, a powerful simulation framework is desirable for evaluation and validation of these methods. However, existing simulators generate synthetic data by sampling reads from real data or mimicking existing cell states, which is inadequate to provide credible ground-truth labels for method evaluation. RESULTS We present simCAS, an embedding-based simulator, for generating high-fidelity scCAS data from both cell- and peak-wise embeddings. We demonstrate simCAS outperforms existing simulators in resembling real data and show that simCAS can generate cells of different states with user-defined cell populations and differentiation trajectories. Additionally, simCAS can simulate data from different batches and encode user-specified interactions of chromatin regions in the synthetic data, which provides ground-truth labels more than cell states. We systematically demonstrate that simCAS facilitates the benchmarking of four core tasks in downstream analysis: cell clustering, trajectory inference, data integration, and cis-regulatory interaction inference. We anticipate simCAS will be a reliable and flexible simulator for evaluating the ongoing computational methods applied on scCAS data. AVAILABILITY AND IMPLEMENTATION simCAS is freely available at https://github.com/Chen-Li-17/simCAS.
Collapse
Affiliation(s)
- Chen Li
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division of BNRIST, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Xiaoyang Chen
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division of BNRIST, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Shengquan Chen
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin 300071, China
| | - Rui Jiang
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division of BNRIST, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Xuegong Zhang
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division of BNRIST, Department of Automation, Tsinghua University, Beijing 100084, China
- Center for Synthetic and Systems Biology, School of Life Sciences and School of Medicine, Tsinghua University, Beijing 100084, China
| |
Collapse
|
29
|
Mohammad-Taheri S, Tewari V, Kapre R, Rahiminasab E, Sachs K, Tapley Hoyt C, Zucker J, Vitek O. Optimal adjustment sets for causal query estimation in partially observed biomolecular networks. Bioinformatics 2023; 39:i494-i503. [PMID: 37387179 PMCID: PMC10311316 DOI: 10.1093/bioinformatics/btad270] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
Causal query estimation in biomolecular networks commonly selects a 'valid adjustment set', i.e. a subset of network variables that eliminates the bias of the estimator. A same query may have multiple valid adjustment sets, each with a different variance. When networks are partially observed, current methods use graph-based criteria to find an adjustment set that minimizes asymptotic variance. Unfortunately, many models that share the same graph topology, and therefore same functional dependencies, may differ in the processes that generate the observational data. In these cases, the topology-based criteria fail to distinguish the variances of the adjustment sets. This deficiency can lead to sub-optimal adjustment sets, and to miss-characterization of the effect of the intervention. We propose an approach for deriving 'optimal adjustment sets' that takes into account the nature of the data, bias and finite-sample variance of the estimator, and cost. It empirically learns the data generating processes from historical experimental data, and characterizes the properties of the estimators by simulation. We demonstrate the utility of the proposed approach in four biomolecular Case studies with different topologies and different data generation processes. The implementation and reproducible Case studies are at https://github.com/srtaheri/OptimalAdjustmentSet.
Collapse
Affiliation(s)
- Sara Mohammad-Taheri
- Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, USA
| | - Vartika Tewari
- Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, USA
| | - Rohan Kapre
- Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, USA
| | | | - Karen Sachs
- Next Generation Analytics, Palo Alto California, USA
- Modulo Bio Inc, Los Altos, California, USA
- Answer ALS, New Orleans, LA, USA
| | - Charles Tapley Hoyt
- Laboratory of Systems Pharmacology, Harvard Medical School, Boston, Massachusetts, USA
| | - Jeremy Zucker
- Pacific Northwest National Laboratory, Richland, Washington 99354, USA
| | - Olga Vitek
- Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, USA
| |
Collapse
|
30
|
Yan D, Sun Z, Fang J, Cao S, Wang W, Chang X, Badirli S, Fu H, Liu Y. scRAA: the development of a robust and automatic annotation procedure for single-cell RNA sequencing data. J Biopharm Stat 2023:1-14. [PMID: 37162278 DOI: 10.1080/10543406.2023.2208671] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
A critical task in single-cell RNA sequencing (scRNA-Seq) data analysis is to identify cell types from heterogeneous tissues. While the majority of classification methods demonstrated high performance in scRNA-Seq annotation problems, a robust and accurate solution is desired to generate reliable outcomes for downstream analyses, for instance, marker genes identification, differentially expressed genes, and pathway analysis. It is hard to establish a universally good metric. Thus, a universally good classification method for all kinds of scenarios does not exist. In addition, reference and query data in cell classification are usually from different experimental batches, and failure to consider batch effects may result in misleading conclusions. To overcome this bottleneck, we propose a robust ensemble approach to classify cells and utilize a batch correction method between reference and query data. We simulated four scenarios that comprise simple to complex batch effect and account for varying cell-type proportions. We further tested our approach on both lung and pancreas data. We found improved prediction accuracy and robust performance across simulation scenarios and real data. The incorporation of batch effect correction between reference and query, and the ensemble approach improve cell-type prediction accuracy while maintaining robustness. We demonstrated these through simulated and real scRNA-Seq data.
Collapse
Affiliation(s)
- Dongyan Yan
- Global Statistical Science, Eli Lilly & Co, Indianapolis, Indiana, USA
| | - Zhe Sun
- Global Statistical Science, Eli Lilly & Co, Indianapolis, Indiana, USA
| | - Jiyuan Fang
- Global Statistical Science, Eli Lilly & Co, Indianapolis, Indiana, USA
| | - Shanshan Cao
- Global Statistical Science, Eli Lilly & Co, Indianapolis, Indiana, USA
| | - Wenjie Wang
- Advance Analytics and Data Science, Eli Lilly & Co, Indianapolis, Indiana, USA
| | - Xinyue Chang
- Advance Analytics and Data Science, Eli Lilly & Co, Indianapolis, Indiana, USA
| | - Sarkhan Badirli
- Advance Analytics and Data Science, Eli Lilly & Co, Indianapolis, Indiana, USA
| | - Haoda Fu
- Advance Analytics and Data Science, Eli Lilly & Co, Indianapolis, Indiana, USA
| | - Yushi Liu
- Global Statistical Science, Eli Lilly & Co, Indianapolis, Indiana, USA
| |
Collapse
|
31
|
Liu C, Huang H, Yang P. Multi-task learning from multimodal single-cell omics with Matilda. Nucleic Acids Res 2023; 51:e45. [PMID: 36912104 PMCID: PMC10164589 DOI: 10.1093/nar/gkad157] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2022] [Revised: 01/28/2023] [Accepted: 02/21/2023] [Indexed: 03/14/2023] Open
Abstract
Multimodal single-cell omics technologies enable multiple molecular programs to be simultaneously profiled at a global scale in individual cells, creating opportunities to study biological systems at a resolution that was previously inaccessible. However, the analysis of multimodal single-cell omics data is challenging due to the lack of methods that can integrate across multiple data modalities generated from such technologies. Here, we present Matilda, a multi-task learning method for integrative analysis of multimodal single-cell omics data. By leveraging the interrelationship among tasks, Matilda learns to perform data simulation, dimension reduction, cell type classification, and feature selection in a single unified framework. We compare Matilda with other state-of-the-art methods on datasets generated from some of the most popular multimodal single-cell omics technologies. Our results demonstrate the utility of Matilda for addressing multiple key tasks on integrative multimodal single-cell omics data analysis. Matilda is implemented in Pytorch and is freely available from https://github.com/PYangLab/Matilda.
Collapse
Affiliation(s)
- Chunlei Liu
- Computational Systems Biology Group, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
| | - Hao Huang
- Computational Systems Biology Group, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia
| | - Pengyi Yang
- Computational Systems Biology Group, Children's Medical Research Institute, Faculty of Medicine and Health, The University of Sydney, Westmead, NSW 2145, Australia
- School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia
- Charles Perkins Centre, The University of Sydney, Sydney, NSW 2006, Australia
| |
Collapse
|
32
|
Sun L, Wang G, Zhang Z. SimCH: simulation of single-cell RNA sequencing data by modeling cellular heterogeneity at gene expression level. Brief Bioinform 2023; 24:6961608. [PMID: 36575569 DOI: 10.1093/bib/bbac590] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Revised: 11/08/2022] [Accepted: 12/02/2022] [Indexed: 12/29/2022] Open
Abstract
Single-cell ribonucleic acid (RNA) sequencing (scRNA-seq) has been a powerful technology for transcriptome analysis. However, the systematic validation of diverse computational tools used in scRNA-seq analysis remains challenging. Here, we propose a novel simulation tool, termed as Simulation of Cellular Heterogeneity (SimCH), for the flexible and comprehensive assessment of scRNA-seq computational methods. The Gaussian Copula framework is recruited to retain gene coexpression of experimental data shown to be associated with cellular heterogeneity. The synthetic count matrices generated by suitable SimCH modes closely match experimental data originating from either homogeneous or heterogeneous cell populations and either unique molecular identifier (UMI)-based or non-UMI-based techniques. We demonstrate how SimCH can benchmark several types of computational methods, including cell clustering, discovery of differentially expressed genes, trajectory inference, batch correction and imputation. Moreover, we show how SimCH can be used to conduct power evaluation of cell clustering methods. Given these merits, we believe that SimCH can accelerate single-cell research.
Collapse
Affiliation(s)
- Lei Sun
- School of Information Engineering, Yangzhou University, Yangzhou, P.R. China.,School of Artificial Intelligence, Yangzhou University, Yangzhou, P.R. China.,CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, and China National Center for Bioinformation, Beijing, P.R. China
| | - Gongming Wang
- School of Information Engineering, Yangzhou University, Yangzhou, P.R. China.,School of Artificial Intelligence, Yangzhou University, Yangzhou, P.R. China.,China Unicom Software Research Institute Jinan Branch, Jinan, P.R. China
| | - Zhihua Zhang
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, and China National Center for Bioinformation, Beijing, P.R. China.,School of Life Science, University of Chinese Academy of Sciences, Beijing, P.R. China
| |
Collapse
|
33
|
Shakola F, Palejev D, Ivanov I. A Framework for Comparison and Assessment of Synthetic RNA-Seq Data. Genes (Basel) 2022; 13:2362. [PMID: 36553629 PMCID: PMC9778097 DOI: 10.3390/genes13122362] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2022] [Revised: 12/05/2022] [Accepted: 12/06/2022] [Indexed: 12/16/2022] Open
Abstract
The ever-growing number of methods for the generation of synthetic bulk and single cell RNA-seq data have multiple and diverse applications. They are often aimed at benchmarking bioinformatics algorithms for purposes such as sample classification, differential expression analysis, correlation and network studies and the optimization of data integration and normalization techniques. Here, we propose a general framework to compare synthetically generated RNA-seq data and select a data-generating tool that is suitable for a set of specific study goals. As there are multiple methods for synthetic RNA-seq data generation, researchers can use the proposed framework to make an informed choice of an RNA-seq data simulation algorithm and software that are best suited for their specific scientific questions of interest.
Collapse
Affiliation(s)
- Felitsiya Shakola
- GATE Institute, Sofia University, 125 Tsarigradsko Shosse, Bl. 2, 1113 Sofia, Bulgaria
| | - Dean Palejev
- Institute of Mathematics and Informatics, Bulgarian Academy of Sciences, Acad. G. Bonchev St., Bl. 8, 1113 Sofia, Bulgaria
| | - Ivan Ivanov
- Department of Veterinary Physiology and Pharmacology, Texas A&M University, College Station, TX 77843, USA
| |
Collapse
|
34
|
Robert PA, Akbar R, Frank R, Pavlović M, Widrich M, Snapkov I, Slabodkin A, Chernigovskaya M, Scheffer L, Smorodina E, Rawat P, Mehta BB, Vu MH, Mathisen IF, Prósz A, Abram K, Olar A, Miho E, Haug DTT, Lund-Johansen F, Hochreiter S, Haff IH, Klambauer G, Sandve GK, Greiff V. Unconstrained generation of synthetic antibody-antigen structures to guide machine learning methodology for antibody specificity prediction. NATURE COMPUTATIONAL SCIENCE 2022; 2:845-865. [PMID: 38177393 DOI: 10.1038/s43588-022-00372-4] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/16/2021] [Accepted: 11/09/2022] [Indexed: 01/06/2024]
Abstract
Machine learning (ML) is a key technology for accurate prediction of antibody-antigen binding. Two orthogonal problems hinder the application of ML to antibody-specificity prediction and the benchmarking thereof: the lack of a unified ML formalization of immunological antibody-specificity prediction problems and the unavailability of large-scale synthetic datasets to benchmark real-world relevant ML methods and dataset design. Here we developed the Absolut! software suite that enables parameter-based unconstrained generation of synthetic lattice-based three-dimensional antibody-antigen-binding structures with ground-truth access to conformational paratope, epitope and affinity. We formalized common immunological antibody-specificity prediction problems as ML tasks and confirmed that for both sequence- and structure-based tasks, accuracy-based rankings of ML methods trained on experimental data hold for ML methods trained on Absolut!-generated data. The Absolut! framework has the potential to enable real-world relevant development and benchmarking of ML strategies for biotherapeutics design.
Collapse
Affiliation(s)
- Philippe A Robert
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway.
| | - Rahmad Akbar
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
| | - Robert Frank
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
| | | | - Michael Widrich
- ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Linz, Austria
| | - Igor Snapkov
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
| | - Andrei Slabodkin
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
| | - Maria Chernigovskaya
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
| | | | - Eva Smorodina
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
| | - Puneet Rawat
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
| | - Brij Bhushan Mehta
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
| | - Mai Ha Vu
- Department of Linguistics and Scandinavian Studies, University of Oslo, Oslo, Norway
| | | | - Aurél Prósz
- Danish Cancer Society Research Center, Translational Cancer Genomics, Copenhagen, Denmark
| | - Krzysztof Abram
- The Novo Nordisk Foundation Center for Biosustainability, Autoflow, DTU Biosustain and IT University of Copenhagen, Copenhagen, Denmark
| | - Alex Olar
- Department of Complex Systems in Physics, Eötvös Loránd University, Budapest, Hungary
| | - Enkelejda Miho
- Institute of Medical Engineering and Medical Informatics, School of Life Sciences, FHNW University of Applied Sciences and Arts Northwestern Switzerland, Muttenz, Switzerland
- aiNET GmbH, Basel, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | | | | | - Sepp Hochreiter
- ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Linz, Austria
- Institute of Advanced Research in Artificial Intelligence (IARAI), Vienna, Austria
| | | | - Günter Klambauer
- ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Linz, Austria
| | | | - Victor Greiff
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway.
| |
Collapse
|
35
|
Sandve GK, Greiff V. Access to ground truth at unconstrained size makes simulated data as indispensable as experimental data for bioinformatics methods development and benchmarking. Bioinformatics 2022; 38:4994-4996. [PMID: 36073940 PMCID: PMC9620827 DOI: 10.1093/bioinformatics/btac612] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2021] [Revised: 02/18/2022] [Accepted: 09/08/2022] [Indexed: 11/14/2022] Open
Affiliation(s)
- Geir Kjetil Sandve
- Department of Informatics, University of Oslo, 0316 Oslo, Norway
- Centre of Bioinformatics, University of Oslo, 0316 Oslo, Norway
- UiORealArt convergence environment, University of Oslo, 0316 Oslo, Norway
| | - Victor Greiff
- Department of Immunology, University of Oslo and Oslo University Hospital, 0316 Oslo, Norway
| |
Collapse
|
36
|
Azodi CB, Zappia L, Oshlack A, McCarthy DJ. splatPop: simulating population scale single-cell RNA sequencing data. Genome Biol 2021; 22:341. [PMID: 34911537 DOI: 10.1186/s13059-021-02546-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2021] [Accepted: 11/19/2021] [Indexed: 11/10/2022] Open
Abstract
Population-scale single-cell RNA sequencing (scRNA-seq) is now viable, enabling finer resolution functional genomics studies and leading to a rush to adapt bulk methods and develop new single-cell-specific methods to perform these studies. Simulations are useful for developing, testing, and benchmarking methods but current scRNA-seq simulation frameworks do not simulate population-scale data with genetic effects. Here, we present splatPop, a model for flexible, reproducible, and well-documented simulation of population-scale scRNA-seq data with known expression quantitative trait loci. splatPop can also simulate complex batch, cell group, and conditional effects between individuals from different cohorts as well as genetically-driven co-expression.
Collapse
Affiliation(s)
- Christina B Azodi
- St. Vincent's Institute of Medical Research, 9 Princes Street, Fitzroy, 3065, VIC, Australia.,University of Melbourne, Royal Parade, Parkville, 3010, VIC, Australia
| | - Luke Zappia
- Department of Mathematics, Technical University of Munich, Boltzmannstraße 3, Garching bei München, 85748, Germany.,Institute of Computational Biology, Helmholtz Zentrum München, Ingolstädter Landstraße 1, Neuherberg, 85764, Germany
| | - Alicia Oshlack
- University of Melbourne, Royal Parade, Parkville, 3010, VIC, Australia.,Peter MacCallum Cancer Centre, Grattan Street, Melbourne, 3000, VIC, Australia
| | - Davis J McCarthy
- St. Vincent's Institute of Medical Research, 9 Princes Street, Fitzroy, 3065, VIC, Australia. .,University of Melbourne, Royal Parade, Parkville, 3010, VIC, Australia.
| |
Collapse
|