1
|
Nguyen H, Nguyen H, Tran D, Draghici S, Nguyen T. Fourteen years of cellular deconvolution: methodology, applications, technical evaluation and outstanding challenges. Nucleic Acids Res 2024; 52:4761-4783. [PMID: 38619038 PMCID: PMC11109966 DOI: 10.1093/nar/gkae267] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Revised: 03/01/2024] [Accepted: 04/02/2024] [Indexed: 04/16/2024] Open
Abstract
Single-cell RNA sequencing (scRNA-Seq) is a recent technology that allows for the measurement of the expression of all genes in each individual cell contained in a sample. Information at the single-cell level has been shown to be extremely useful in many areas. However, performing single-cell experiments is expensive. Although cellular deconvolution cannot provide the same comprehensive information as single-cell experiments, it can extract cell-type information from bulk RNA data, and therefore it allows researchers to conduct studies at cell-type resolution from existing bulk datasets. For these reasons, a great effort has been made to develop such methods for cellular deconvolution. The large number of methods available, the requirement of coding skills, inadequate documentation, and lack of performance assessment all make it extremely difficult for life scientists to choose a suitable method for their experiment. This paper aims to fill this gap by providing a comprehensive review of 53 deconvolution methods regarding their methodology, applications, performance, and outstanding challenges. More importantly, the article presents a benchmarking of all these 53 methods using 283 cell types from 30 tissues of 63 individuals. We also provide an R package named DeconBenchmark that allows readers to execute and benchmark the reviewed methods (https://github.com/tinnlab/DeconBenchmark).
Collapse
Affiliation(s)
- Hung Nguyen
- Department of Computer Science and Software Engineering, Auburn University, Auburn, AL, USA
| | - Ha Nguyen
- Department of Computer Science and Software Engineering, Auburn University, Auburn, AL, USA
| | - Duc Tran
- Department of Medicine, Washington University School of Medicine, St. Louis, MO, USA
| | - Sorin Draghici
- Department of Computer Science, Wayne State University, Detroit, MI, USA
- Advaita Bioinformatics, Ann Arbor, MI, USA
| | - Tin Nguyen
- Department of Computer Science and Software Engineering, Auburn University, Auburn, AL, USA
| |
Collapse
|
2
|
Vathrakokoili Pournara A, Miao Z, Beker OY, Nolte N, Brazma A, Papatheodorou I. CATD: a reproducible pipeline for selecting cell-type deconvolution methods across tissues. BIOINFORMATICS ADVANCES 2024; 4:vbae048. [PMID: 38638280 PMCID: PMC11023940 DOI: 10.1093/bioadv/vbae048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/09/2023] [Revised: 02/20/2024] [Accepted: 03/21/2024] [Indexed: 04/20/2024]
Abstract
Motivation Cell-type deconvolution methods aim to infer cell composition from bulk transcriptomic data. The proliferation of developed methods coupled with inconsistent results obtained in many cases, highlights the pressing need for guidance in the selection of appropriate methods. Additionally, the growing accessibility of single-cell RNA sequencing datasets, often accompanied by bulk expression from related samples enable the benchmark of existing methods. Results In this study, we conduct a comprehensive assessment of 31 methods, utilizing single-cell RNA-sequencing data from diverse human and mouse tissues. Employing various simulation scenarios, we reveal the efficacy of regression-based deconvolution methods, highlighting their sensitivity to reference choices. We investigate the impact of bulk-reference differences, incorporating variables such as sample, study and technology. We provide validation using a gold standard dataset from mononuclear cells and suggest a consensus prediction of proportions when ground truth is not available. We validated the consensus method on data from the stomach and studied its spillover effect. Importantly, we propose the use of the critical assessment of transcriptomic deconvolution (CATD) pipeline which encompasses functionalities for generating references and pseudo-bulks and running implemented deconvolution methods. CATD streamlines simultaneous deconvolution of numerous bulk samples, providing a practical solution for speeding up the evaluation of newly developed methods. Availability and implementation https://github.com/Papatheodorou-Group/CATD_snakemake.
Collapse
Affiliation(s)
- Anna Vathrakokoili Pournara
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
| | - Zhichao Miao
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
- Open Targets, Wellcome Genome Campus, Hinxton CB10 1SD, United Kingdom
- GMU-GIBH Joint School of Life Sciences, Guangzhou Laboratory, Guangzhou Medical University, Guangzhou, 511436, China
| | - Ozgur Yilimaz Beker
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
- Faculty of Engineering and Natural Sciences, Sabanci University, Tuzla 34956, Turkey
| | - Nadja Nolte
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
- Department of Biotechnology and Systems Biology, National Institute of Biology, Ljubljana, 121-1000, Slovenia
| | - Alvis Brazma
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
| | - Irene Papatheodorou
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom
- Open Targets, Wellcome Genome Campus, Hinxton CB10 1SD, United Kingdom
- Earlham Institute, Norwich Research Park, Norwich NR4 7UZ, United Kingdom
| |
Collapse
|
3
|
Morita K, Mizuno T, Azuma I, Suzuki Y, Kusuhara H. Rat Deconvolution as Knowledge Miner for Immune Cell Trafficking from Toxicogenomics Databases. Toxicol Sci 2023; 197:kfad117. [PMID: 37941435 PMCID: PMC10823770 DOI: 10.1093/toxsci/kfad117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2023] Open
Abstract
Toxicogenomics databases are useful for understanding biological responses in individuals because they include a diverse spectrum of biological responses. Although these databases contain no information regarding immune cells in the liver, which are important in the progression of liver injury, deconvolution that estimates cell-type proportions from bulk transcriptome could extend immune information. However, deconvolution has been mainly applied to humans and mice and less often to rats, which are the main target of toxicogenomics databases. Here, we developed a deconvolution method for rats to retrieve information regarding immune cells from toxicogenomics databases. The rat-specific deconvolution showed high correlations for several types of immune cells between spleen and blood, and between liver treated with toxicants compared with those based on human and mouse data. Additionally, we found 4 clusters of compounds in Open TG-GATEs database based on estimated immune cell trafficking, which are different from those based on transcriptome data itself. The contributions of this work are three-fold. First, we obtained the gene expression profiles of 6 rat immune cells necessary for deconvolution. Second, we clarified the importance of species differences on deconvolution. Third, we retrieved immune cell trafficking from toxicogenomics databases. Accumulated and comparable immune cell profiles of massive data of immune cell trafficking in rats could deepen our understanding of enable us to clarify the relationship between the order and the contribution rate of immune cells, chemokines and cytokines, and pathologies. Ultimately, these findings will lead to the evaluation of organ responses in Adverse Outcome Pathway.
Collapse
Affiliation(s)
- Katsuhisa Morita
- Department of Pharmaceutical Sciences, The University of Tokyo, Bunkyo, Tokyo, Japan
| | - Tadahaya Mizuno
- Department of Pharmaceutical Sciences, The University of Tokyo, Bunkyo, Tokyo, Japan
| | - Iori Azuma
- Department of Pharmaceutical Sciences, The University of Tokyo, Bunkyo, Tokyo, Japan
| | - Yutaka Suzuki
- Graduate School of Frontier Sciences, The University of Tokyo, Chiba, Japan
| | - Hiroyuki Kusuhara
- Department of Pharmaceutical Sciences, The University of Tokyo, Bunkyo, Tokyo, Japan
| |
Collapse
|
4
|
Boldina G, Fogel P, Rocher C, Bettembourg C, Luta G, Augé F. A2Sign: Agnostic Algorithms for Signatures-a universal method for identifying molecular signatures from transcriptomic datasets prior to cell-type deconvolution. Bioinformatics 2022; 38:1015-1021. [PMID: 34788798 DOI: 10.1093/bioinformatics/btab773] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Revised: 09/17/2021] [Accepted: 11/09/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Molecular signatures are critical for inferring the proportions of cell types from bulk transcriptomics data. However, the identification of these signatures is based on a methodology that relies on prior biological knowledge of the cell types being studied. When working with less known biological material, a data-driven approach is required to uncover the underlying classes and generate ad hoc signatures from healthy or pathogenic tissue. RESULTS We present a new approach, A2Sign: Agnostic Algorithms for Signatures, based on a non-negative tensor factorization (NTF) strategy that allows us to identify cell-type-specific molecular signatures, greatly reduce collinearities and also account for inter-individual variability. We propose a global framework that can be applied to uncover molecular signatures for cell-type deconvolution in arbitrary tissues using bulk transcriptome data. We also present two new molecular signatures for deconvolution of up to 16 immune cell types using microarray or RNA-seq data. AVAILABILITY AND IMPLEMENTATION All steps of our analysis were implemented in annotated Python notebooks (https://github.com/paulfogel/A2SIGN). To perform NTF, we used the NMTF package, which can be downloaded using Python pip install. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Galina Boldina
- Sanofi, R&D Translational Sciences France, Bioinformatics, Sanofi, F-91385 Chilly-Mazarin Cedex, France
| | - Paul Fogel
- Consultant, F-75006 Paris, France.,Advestis, F-75008 Paris, France.,Quinten, F-75017 Paris, France
| | - Corinne Rocher
- Sanofi, R&D Translational Sciences France, Bioinformatics, Sanofi, F-91385 Chilly-Mazarin Cedex, France
| | - Charles Bettembourg
- Sanofi, R&D Translational Sciences France, Bioinformatics, Sanofi, F-91385 Chilly-Mazarin Cedex, France
| | - George Luta
- Department of Biostatistics, Bioinformatics and Biomathematics, Georgetown University, Washington, DC 20057, USA
| | - Franck Augé
- Sanofi, R&D Translational Sciences France, Bioinformatics, Sanofi, F-91385 Chilly-Mazarin Cedex, France
| |
Collapse
|
5
|
Doostparast Torshizi A, Duan J, Wang K. A computational method for direct imputation of cell type-specific expression profiles and cellular compositions from bulk-tissue RNA-Seq in brain disorders. NAR Genom Bioinform 2021; 3:lqab056. [PMID: 34169279 PMCID: PMC8219045 DOI: 10.1093/nargab/lqab056] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2020] [Revised: 05/24/2021] [Accepted: 06/21/2021] [Indexed: 02/06/2023] Open
Abstract
The importance of cell type-specific gene expression in disease-relevant tissues is increasingly recognized in genetic studies of complex diseases. However, most gene expression studies are conducted on bulk tissues, without examining cell type-specific expression profiles. Several computational methods are available for cell type deconvolution (i.e. inference of cellular composition) from bulk RNA-Seq data, but few of them impute cell type-specific expression profiles. We hypothesize that with external prior information such as single cell RNA-seq and population-wide expression profiles, it can be computationally tractable to estimate both cellular composition and cell type-specific expression from bulk RNA-Seq data. Here we introduce CellR, which addresses cross-individual gene expression variations to adjust the weights of cell-specific gene markers. It then transforms the deconvolution problem into a linear programming model while taking into account inter/intra cellular correlations and uses a multi-variate stochastic search algorithm to estimate the cell type-specific expression profiles. Analyses on several complex diseases such as schizophrenia, Alzheimer’s disease, Huntington’s disease and type 2 diabetes validated the efficiency of CellR, while revealing how specific cell types contribute to different diseases. In summary, CellR compares favorably against competing approaches, enabling cell type-specific re-analysis of gene expression data on bulk tissues in complex diseases.
Collapse
Affiliation(s)
- Abolfazl Doostparast Torshizi
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Jubao Duan
- Center for Psychiatric Genetics, NorthShore University HealthSystem, Evanston, IL 60201, USA
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
| |
Collapse
|
6
|
Jaakkola MK, Elo LL. Computational deconvolution to estimate cell type-specific gene expression from bulk data. NAR Genom Bioinform 2021; 3:lqaa110. [PMID: 33575652 PMCID: PMC7803005 DOI: 10.1093/nargab/lqaa110] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2020] [Revised: 12/14/2020] [Accepted: 12/17/2020] [Indexed: 12/24/2022] Open
Abstract
Computational deconvolution is a time and cost-efficient approach to obtain cell type-specific information from bulk gene expression of heterogeneous tissues like blood. Deconvolution can aim to either estimate cell type proportions or abundances in samples, or estimate how strongly each present cell type expresses different genes, or both tasks simultaneously. Among the two separate goals, the estimation of cell type proportions/abundances is widely studied, but less attention has been paid on defining the cell type-specific expression profiles. Here, we address this gap by introducing a novel method Rodeo and empirically evaluating it and the other available tools from multiple perspectives utilizing diverse datasets.
Collapse
Affiliation(s)
- Maria K Jaakkola
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, Tykistökatu 6, FI-20520 Turku, Finland
| | - Laura L Elo
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, Tykistökatu 6, FI-20520 Turku, Finland
| |
Collapse
|
7
|
Chen Z, Wu A. Progress and challenge for computational quantification of tissue immune cells. Brief Bioinform 2021; 22:6065002. [PMID: 33401306 DOI: 10.1093/bib/bbaa358] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2020] [Revised: 10/23/2020] [Accepted: 11/07/2020] [Indexed: 12/28/2022] Open
Abstract
Tissue immune cells have long been recognized as important regulators for the maintenance of balance in the body system. Quantification of the abundance of different immune cells will provide enhanced understanding of the correlation between immune cells and normal or abnormal situations. Currently, computational methods to predict tissue immune cell compositions from bulk transcriptomes have been largely developed. Therefore, summarizing the advantages and disadvantages is appropriate. In addition, an examination of the challenges and possible solutions for these computational models will assist the development of this field. The common hypothesis of these models is that the expression of signature genes for immune cell types might represent the proportion of immune cells that contribute to the tissue transcriptome. In general, we grouped all reported tools into three groups, including reference-free, reference-based scoring and reference-based deconvolution methods. In this review, a summary of all the currently reported computational immune cell quantification tools and their applications, limitations, and perspectives are presented. Furthermore, some critical problems are found that have limited the performance and application of these models, including inadequate immune cell type, the collinearity problem, the impact of the tissue environment on the immune cell expression level, and the deficiency of standard datasets for model validation. To address these issues, tissue specific training datasets that include all known immune cells, a hierarchical computational framework, and benchmark datasets including both tissue expression profiles and the abundances of all the immune cells are proposed to further promote the development of this field.
Collapse
Affiliation(s)
- Ziyi Chen
- Suzhou Institute of Systems Medicine, Center for Systems Medicine, Chinese Academy of Medical Sciences & Peking Union Medical College, Jiangsu, Suzhou, China
| | - Aiping Wu
- Suzhou Institute of Systems Medicine, Center for Systems Medicine, Chinese Academy of Medical Sciences & Peking Union Medical College, Jiangsu, Suzhou, China
| |
Collapse
|
8
|
Devaraj V, Bose B. DEBay: A computational tool for deconvolution of quantitative PCR data for estimation of cell type-specific gene expression in a mixed population. Heliyon 2020; 6:e04489. [PMID: 32728643 PMCID: PMC7381708 DOI: 10.1016/j.heliyon.2020.e04489] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2020] [Revised: 07/12/2020] [Accepted: 07/14/2020] [Indexed: 11/30/2022] Open
Abstract
The expression of a gene is commonly estimated by quantitative PCR (qPCR) using RNA isolated from a large number of pooled cells. Such pooled samples often have subpopulations of cells with different levels of expression of the target gene. Estimation of gene expression from an ensemble of cells obscures the pattern of expression in different subpopulations. Physical separation of various subpopulations is a demanding task. We have developed a computational tool, Deconvolution of Ensemble through Bayes-approach (DEBay), to estimate cell type-specific gene expression from qPCR data of a mixed population. DEBay estimates Normalized Gene Expression Coefficient (NGEC), which is a relative measure of the expression of the target gene in each cell type in a population. NGEC has a direct algebraic correspondence with the normalized fold change in gene expression measured by qPCR. DEBay can deconvolute both time-dependent and -independent gene expression profiles. It uses the Bayesian method of model selection and parameter estimation. We have evaluated DEBay using synthetic and real experimental data. DEBay is implemented in Python. A GUI of DEBay and its source code are available for download at SourceForge (https://sourceforge.net/projects/debay).
Collapse
Affiliation(s)
- Vimalathithan Devaraj
- Department of Biosciences and Bioengineering, Indian Institute of Technology Guwahati, Guwahati, 781039, India
| | - Biplab Bose
- Department of Biosciences and Bioengineering, Indian Institute of Technology Guwahati, Guwahati, 781039, India
| |
Collapse
|
9
|
Kang K, Meng Q, Shats I, Umbach DM, Li M, Li Y, Li X, Li L. CDSeq: A novel complete deconvolution method for dissecting heterogeneous samples using gene expression data. PLoS Comput Biol 2019; 15:e1007510. [PMID: 31790389 PMCID: PMC6907860 DOI: 10.1371/journal.pcbi.1007510] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2019] [Revised: 12/12/2019] [Accepted: 10/25/2019] [Indexed: 11/18/2022] Open
Abstract
Quantifying cell-type proportions and their corresponding gene expression profiles in tissue samples would enhance understanding of the contributions of individual cell types to the physiological states of the tissue. Current approaches that address tissue heterogeneity have drawbacks. Experimental techniques, such as fluorescence-activated cell sorting, and single cell RNA sequencing are expensive. Computational approaches that use expression data from heterogeneous samples are promising, but most of the current methods estimate either cell-type proportions or cell-type-specific expression profiles by requiring the other as input. Although such partial deconvolution methods have been successfully applied to tumor samples, the additional input required may be unavailable. We introduce a novel complete deconvolution method, CDSeq, that uses only RNA-Seq data from bulk tissue samples to simultaneously estimate both cell-type proportions and cell-type-specific expression profiles. Using several synthetic and real experimental datasets with known cell-type composition and cell-type-specific expression profiles, we compared CDSeq’s complete deconvolution performance with seven other established deconvolution methods. Complete deconvolution using CDSeq represents a substantial technical advance over partial deconvolution approaches and will be useful for studying cell mixtures in tissue samples. CDSeq is available at GitHub repository (MATLAB and Octave code): https://github.com/kkang7/CDSeq. Understanding the cellular composition of bulk tissues is critical to investigate the underlying mechanisms of many biological processes. Single cell sequencing is a promising technique, however, it is expensive and the analysis of single cell data is non-trivial. Therefore, tissue samples are still routinely processed in bulk. To estimate cell-type composition using bulk gene expression data, computational deconvolution methods are needed. Many deconvolution methods have been proposed, however, they often estimate only cell type proportions using a reference cell type gene expression profile, which in many cases may not be available. We present a novel complete deconvolution method that uses only bulk gene expression data to simultaneously estimate cell-type-specific gene expression profiles and sample-specific cell-type proportions. We showed that, using multiple RNA-Seq and microarray datasets where the cell-type composition was previously known, our method could accurately determine the cell-type composition. By providing a method that requires a single input to determine both cell-type proportion and cell-type-specific expression profiles, we expect that our method will be beneficial to biologists and facilitate the research and identification of mechanisms underlying many biological processes.
Collapse
Affiliation(s)
- Kai Kang
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Durham, North Carolina, United States of America
- * E-mail: (KK); (LL)
| | - Qian Meng
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Durham, North Carolina, United States of America
| | - Igor Shats
- Signal Transduction Laboratory, National Institute of Environmental Health Sciences, Durham, North Carolina, United States of America
| | - David M. Umbach
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Durham, North Carolina, United States of America
| | - Melissa Li
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Durham, North Carolina, United States of America
| | - Yuanyuan Li
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Durham, North Carolina, United States of America
| | - Xiaoling Li
- Signal Transduction Laboratory, National Institute of Environmental Health Sciences, Durham, North Carolina, United States of America
| | - Leping Li
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Durham, North Carolina, United States of America
- * E-mail: (KK); (LL)
| |
Collapse
|