1
|
Jin YW, Hu P, Liu Q. NNICE: a deep quantile neural network algorithm for expression deconvolution. Sci Rep 2024; 14:14040. [PMID: 38890415 PMCID: PMC11189483 DOI: 10.1038/s41598-024-65053-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2024] [Accepted: 06/17/2024] [Indexed: 06/20/2024] Open
Abstract
The composition of cell-type is a key indicator of health. Advancements in bulk gene expression data curation, single cell RNA-sequencing technologies, and computational deconvolution approaches offer a new perspective to learn about the composition of different cell types in a quick and affordable way. In this study, we developed a quantile regression and deep learning-based method called Neural Network Immune Contexture Estimator (NNICE) to estimate the cell type abundance and its uncertainty by automatically deconvolving bulk RNA-seq data. The proposed NNICE model was able to successfully recover ground-truth cell type fraction values given unseen bulk mixture gene expression profiles from the same dataset it was trained on. Compared with baseline methods, NNICE achieved better performance on deconvolve both pseudo-bulk gene expressions (Pearson correlation R = 0.9) and real bulk gene expression data (Pearson correlation R = 0.9) across all cell types. In conclusion, NNICE combines statistic inference with deep learning to provide accurate and interpretable cell type deconvolution from bulk gene expression.
Collapse
Affiliation(s)
- Yong Won Jin
- Department of Biochemistry & Medical Genetics, Rady Faculty of Health Sciences, University of Manitoba, Winnipeg, MB, R3E 0J9, Canada
| | - Pingzhao Hu
- Department of Biochemistry & Medical Genetics, Rady Faculty of Health Sciences, University of Manitoba, Winnipeg, MB, R3E 0J9, Canada
- Department of Biochemistry, Schulich School of Medicine & Dentistry, Western University, London, ON, N6A 5C1, Canada
| | - Qian Liu
- Department of Applied Computer Science, University of Winnipeg, Winnipeg, MB, R3B 2E9, Canada.
| |
Collapse
|
2
|
Tiwari A, Trivedi R, Lin SY. Tumor microenvironment: barrier or opportunity towards effective cancer therapy. J Biomed Sci 2022; 29:83. [PMID: 36253762 PMCID: PMC9575280 DOI: 10.1186/s12929-022-00866-3] [Citation(s) in RCA: 183] [Impact Index Per Article: 61.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2022] [Accepted: 10/01/2022] [Indexed: 12/24/2022] Open
Abstract
Tumor microenvironment (TME) is a specialized ecosystem of host components, designed by tumor cells for successful development and metastasis of tumor. With the advent of 3D culture and advanced bioinformatic methodologies, it is now possible to study TME’s individual components and their interplay at higher resolution. Deeper understanding of the immune cell’s diversity, stromal constituents, repertoire profiling, neoantigen prediction of TMEs has provided the opportunity to explore the spatial and temporal regulation of immune therapeutic interventions. The variation of TME composition among patients plays an important role in determining responders and non-responders towards cancer immunotherapy. Therefore, there could be a possibility of reprogramming of TME components to overcome the widely prevailing issue of immunotherapeutic resistance. The focus of the present review is to understand the complexity of TME and comprehending future perspective of its components as potential therapeutic targets. The later part of the review describes the sophisticated 3D models emerging as valuable means to study TME components and an extensive account of advanced bioinformatic tools to profile TME components and predict neoantigens. Overall, this review provides a comprehensive account of the current knowledge available to target TME.
Collapse
Affiliation(s)
- Aadhya Tiwari
- Department of Systems Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA.
| | - Rakesh Trivedi
- Department of Translational Molecular Pathology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Shiaw-Yih Lin
- Department of Systems Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA.
| |
Collapse
|
3
|
Zhang Y, Sun H, Mandava A, Aevermann BD, Kollmann TR, Scheuermann RH, Qiu X, Qian Y. FastMix: a versatile data integration pipeline for cell type-specific biomarker inference. Bioinformatics 2022; 38:4735-4744. [PMID: 36018232 PMCID: PMC9801972 DOI: 10.1093/bioinformatics/btac585] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2022] [Revised: 08/18/2022] [Accepted: 08/25/2022] [Indexed: 01/07/2023] Open
Abstract
MOTIVATION Flow cytometry (FCM) and transcription profiling are the two widely used assays in translational immunology research. However, there is no data integration pipeline for analyzing these two types of assays together with experiment variables for biomarker inference. Current FCM data analysis mainly relies on subjective manual gating analysis, which is difficult to be directly integrated with other automated computational methods. Existing deconvolutional analysis of bulk transcriptomics relies on predefined marker genes in the transcriptomics data, which are unavailable for novel cell types and does not utilize the FCM data that provide canonical phenotypic definitions of the cell types. RESULTS We developed a novel analytics pipeline-FastMix-for computational immunology, which integrates flow cytometry, bulk transcriptomics and clinical covariates for identifying cell type-specific gene expression signatures and biomarker genes. FastMix addresses the 'large p, small n' problem in the gene expression and flow cytometry integration analysis via a linear mixed effects model (LMER) for both cross-sectional and longitudinal studies. Its novel moment-based estimator not only reduces bias in parameter estimation but also is more efficient than iterative optimization. The FastMix pipeline also includes a cutting-edge flow cytometry data analysis method-DAFi-for identifying cell populations of interest and their characteristics. Simulation studies showed that FastMix produced smaller type I/II errors than competing methods. Validation using real data of two vaccine studies showed that FastMix identified a consistent set of signature genes as in independent single-cell RNA-seq analysis, producing additional interesting findings. AVAILABILITY AND IMPLEMENTATION Source code of FastMix is publicly available at https://github.com/terrysun0302/FastMix. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | - Aishwarya Mandava
- Department of Informatics, J. Craig Venter Institute, La Jolla, CA 92037, USA
| | - Brian D Aevermann
- Department of Informatics, J. Craig Venter Institute, La Jolla, CA 92037, USA
| | - Tobias R Kollmann
- Systems Vaccinology, Telethon Kids Institute, Perth Children’s Hospital, University of Western Australia, Nedlands, WA 6009, Australia
| | - Richard H Scheuermann
- Department of Informatics, J. Craig Venter Institute, La Jolla, CA 92037, USA,Department of Pathology, University of California, San Diego, La Jolla, CA 92093, USA
| | - Xing Qiu
- To whom correspondence should be addressed. or
| | - Yu Qian
- To whom correspondence should be addressed. or
| |
Collapse
|
4
|
McDonald RC. Development of a pO 2-Guided Fine Needle Tumor Biopsy Device. J Med Device 2022; 16:021003. [PMID: 35154556 PMCID: PMC8822461 DOI: 10.1115/1.4052900] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2021] [Revised: 10/24/2021] [Indexed: 10/10/2023] Open
Abstract
Tumor biopsies are an important aspect of oncology providing a guide for medical treatment and evaluation of disease progression. Highly heterogenous tumors have complex regions of active cancer cells interdigitated with necrotic tissue and healthy noncancerous tissue. The reliable access to tumor tissue pathology is therefore challenging and usually requires multiple needle insertions with accompanying patient discomfort and risk of infection. Oxygen levels provide a means of detecting and evaluating tumor tissue with levels reduced by 2-fold to 22-fold, depending on the type of organ. However, if the biopsy needle is placed in an area of normal tissue, there is always a chance that no diagnostic cells will be acquired for meaningful pathology and molecular analysis. While not the case in all tumors, there are cases where the in vivo oxygen levels differ with tumor cells having a value of pO2 lying between the anoxic necrotic tissue and normoxic normal tissue. The level of oxygen in tumor cells can also vary with time as related to complex biochemical pathways. The efficacy of radiation therapy is also sensitive to oxygen levels in tumors. Lower levels of oxygen present greater resistance to treatment. To address these concerns, a pO2-guided biopsy needle (OGBN) was developed to determine oxygen levels and fluctuations in highly resolved regions of tumors, in order to aide in determining the optimal region for cell sampling help in determining medical treatment options.
Collapse
|
5
|
Ma J, Tran G, Wan AMD, Young EWK, Kumacheva E, Iscove NN, Zandstra PW. Microdroplet-based one-step RT-PCR for ultrahigh throughput single-cell multiplex gene expression analysis and rare cell detection. Sci Rep 2021; 11:6777. [PMID: 33762663 PMCID: PMC7990930 DOI: 10.1038/s41598-021-86087-4] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2020] [Accepted: 03/10/2021] [Indexed: 01/31/2023] Open
Abstract
Gene expression analysis of individual cells enables characterization of heterogeneous and rare cell populations, yet widespread implementation of existing single-cell gene analysis techniques has been hindered due to limitations in scale, ease, and cost. Here, we present a novel microdroplet-based, one-step reverse-transcriptase polymerase chain reaction (RT-PCR) platform and demonstrate the detection of three targets simultaneously in over 100,000 single cells in a single experiment with a rapid read-out. Our customized reagent cocktail incorporates the bacteriophage T7 gene 2.5 protein to overcome cell lysate-mediated inhibition and allows for one-step RT-PCR of single cells encapsulated in nanoliter droplets. Fluorescent signals indicative of gene expressions are analyzed using a probabilistic deconvolution method to account for ambient RNA and cell doublets and produce single-cell gene signature profiles, as well as predict cell frequencies within heterogeneous samples. We also developed a simulation model to guide experimental design and optimize the accuracy and precision of the assay. Using mixtures of in vitro transcripts and murine cell lines, we demonstrated the detection of single RNA molecules and rare cell populations at a frequency of 0.1%. This low cost, sensitive, and adaptable technique will provide an accessible platform for high throughput single-cell analysis and enable a wide range of research and clinical applications.
Collapse
Affiliation(s)
- Jennifer Ma
- Institute of Biomedical Engineering, University of Toronto, Toronto, ON, M5S 3G9, Canada
| | - Gary Tran
- Department of Medical Biophysics, University of Toronto, Toronto, ON, M5G 1L7, Canada
| | - Alwin M D Wan
- Department of Mechanical and Industrial Engineering, University of Toronto, Toronto, ON, M5S 3G8, Canada
| | - Edmond W K Young
- Institute of Biomedical Engineering, University of Toronto, Toronto, ON, M5S 3G9, Canada
- Department of Mechanical and Industrial Engineering, University of Toronto, Toronto, ON, M5S 3G8, Canada
| | - Eugenia Kumacheva
- Institute of Biomedical Engineering, University of Toronto, Toronto, ON, M5S 3G9, Canada
- Department of Mechanical and Industrial Engineering, University of Toronto, Toronto, ON, M5S 3G8, Canada
- Department of Chemistry, University of Toronto, Toronto, ON, M5S 3H6, Canada
| | - Norman N Iscove
- Department of Medical Biophysics, University of Toronto, Toronto, ON, M5G 1L7, Canada
- Princess Margaret Cancer Centre, University Health Network, Toronto, ON, M5G 1L7, Canada
| | - Peter W Zandstra
- School of Biomedical Engineering, University of British Columbia, 2222 Health Sciences Mall, Vancouver, BC, V6T 1Z3, Canada.
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada.
| |
Collapse
|
6
|
Dong L, Kollipara A, Darville T, Zou F, Zheng X. Semi-CAM: A semi-supervised deconvolution method for bulk transcriptomic data with partial marker gene information. Sci Rep 2020; 10:5434. [PMID: 32214192 PMCID: PMC7096458 DOI: 10.1038/s41598-020-62330-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2019] [Accepted: 02/26/2020] [Indexed: 01/03/2023] Open
Abstract
Deconvolution of bulk transcriptomics data from mixed cell populations is vital to identify the cellular mechanism of complex diseases. Existing deconvolution approaches can be divided into two major groups: supervised and unsupervised methods. Supervised deconvolution methods use cell type-specific prior information including cell proportions, reference cell type-specific gene signatures, or marker genes for each cell type, which may not be available in practice. Unsupervised methods, such as non-negative matrix factorization (NMF) and Convex Analysis of Mixtures (CAM), in contrast, completely disregard prior information and thus are not efficient for data with partial cell type-specific information. In this paper, we propose a semi-supervised deconvolution method, semi-CAM, that extends CAM by utilizing marker information from partial cell types. Analysis of simulation and two benchmark data have demonstrated that semi-CAM outperforms CAM by yielding more accurate cell proportion estimations when markers from partial/all cell types are available. In addition, when markers from all cell types are available, semi-CAM achieves better or similar accuracy compared to the supervised method using signature genes, CIBERSORT, and the marker-based supervised methods semi-NMF and DSA. Furthermore, analysis of human chlamydia-infection data with bulk expression profiles from six cell types and prior marker information of only three cell types suggests that semi-CAM achieves more accurate cell proportion estimations than CAM.
Collapse
Affiliation(s)
- Li Dong
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | - Avinash Kollipara
- Department of Pediatrics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | - Toni Darville
- Department of Pediatrics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | - Fei Zou
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA.
| | - Xiaojing Zheng
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA.
- Department of Pediatrics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA.
| |
Collapse
|
7
|
Kang K, Meng Q, Shats I, Umbach DM, Li M, Li Y, Li X, Li L. CDSeq: A novel complete deconvolution method for dissecting heterogeneous samples using gene expression data. PLoS Comput Biol 2019; 15:e1007510. [PMID: 31790389 PMCID: PMC6907860 DOI: 10.1371/journal.pcbi.1007510] [Citation(s) in RCA: 38] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2019] [Revised: 12/12/2019] [Accepted: 10/25/2019] [Indexed: 11/18/2022] Open
Abstract
Quantifying cell-type proportions and their corresponding gene expression profiles in tissue samples would enhance understanding of the contributions of individual cell types to the physiological states of the tissue. Current approaches that address tissue heterogeneity have drawbacks. Experimental techniques, such as fluorescence-activated cell sorting, and single cell RNA sequencing are expensive. Computational approaches that use expression data from heterogeneous samples are promising, but most of the current methods estimate either cell-type proportions or cell-type-specific expression profiles by requiring the other as input. Although such partial deconvolution methods have been successfully applied to tumor samples, the additional input required may be unavailable. We introduce a novel complete deconvolution method, CDSeq, that uses only RNA-Seq data from bulk tissue samples to simultaneously estimate both cell-type proportions and cell-type-specific expression profiles. Using several synthetic and real experimental datasets with known cell-type composition and cell-type-specific expression profiles, we compared CDSeq’s complete deconvolution performance with seven other established deconvolution methods. Complete deconvolution using CDSeq represents a substantial technical advance over partial deconvolution approaches and will be useful for studying cell mixtures in tissue samples. CDSeq is available at GitHub repository (MATLAB and Octave code): https://github.com/kkang7/CDSeq. Understanding the cellular composition of bulk tissues is critical to investigate the underlying mechanisms of many biological processes. Single cell sequencing is a promising technique, however, it is expensive and the analysis of single cell data is non-trivial. Therefore, tissue samples are still routinely processed in bulk. To estimate cell-type composition using bulk gene expression data, computational deconvolution methods are needed. Many deconvolution methods have been proposed, however, they often estimate only cell type proportions using a reference cell type gene expression profile, which in many cases may not be available. We present a novel complete deconvolution method that uses only bulk gene expression data to simultaneously estimate cell-type-specific gene expression profiles and sample-specific cell-type proportions. We showed that, using multiple RNA-Seq and microarray datasets where the cell-type composition was previously known, our method could accurately determine the cell-type composition. By providing a method that requires a single input to determine both cell-type proportion and cell-type-specific expression profiles, we expect that our method will be beneficial to biologists and facilitate the research and identification of mechanisms underlying many biological processes.
Collapse
Affiliation(s)
- Kai Kang
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Durham, North Carolina, United States of America
- * E-mail: (KK); (LL)
| | - Qian Meng
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Durham, North Carolina, United States of America
| | - Igor Shats
- Signal Transduction Laboratory, National Institute of Environmental Health Sciences, Durham, North Carolina, United States of America
| | - David M. Umbach
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Durham, North Carolina, United States of America
| | - Melissa Li
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Durham, North Carolina, United States of America
| | - Yuanyuan Li
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Durham, North Carolina, United States of America
| | - Xiaoling Li
- Signal Transduction Laboratory, National Institute of Environmental Health Sciences, Durham, North Carolina, United States of America
| | - Leping Li
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Durham, North Carolina, United States of America
- * E-mail: (KK); (LL)
| |
Collapse
|
8
|
Avila Cobos F, Vandesompele J, Mestdagh P, De Preter K. Computational deconvolution of transcriptomics data from mixed cell populations. Bioinformatics 2019; 34:1969-1979. [PMID: 29351586 DOI: 10.1093/bioinformatics/bty019] [Citation(s) in RCA: 146] [Impact Index Per Article: 24.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2017] [Accepted: 01/10/2018] [Indexed: 12/22/2022] Open
Abstract
Summary Gene expression analyses of bulk tissues often ignore cell type composition as an important confounding factor, resulting in a loss of signal from lowly abundant cell types. In this review, we highlight the importance and value of computational deconvolution methods to infer the abundance of different cell types and/or cell type-specific expression profiles in heterogeneous samples without performing physical cell sorting. We also explain the various deconvolution scenarios, the mathematical approaches used to solve them and the effect of data processing and different confounding factors on the accuracy of the deconvolution results. Contact katleen.depreter@ugent.be. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Francisco Avila Cobos
- Center for Medical Genetics Ghent (CMGG), Ghent University, 9000 Ghent, Belgium.,Cancer Research Institute Ghent (CRIG), 9000 Ghent, Belgium.,Bioinformatics Institute Ghent from Nucleotides to Networks (BIG N2N), 9000 Ghent, Belgium
| | - Jo Vandesompele
- Center for Medical Genetics Ghent (CMGG), Ghent University, 9000 Ghent, Belgium.,Cancer Research Institute Ghent (CRIG), 9000 Ghent, Belgium.,Bioinformatics Institute Ghent from Nucleotides to Networks (BIG N2N), 9000 Ghent, Belgium
| | - Pieter Mestdagh
- Center for Medical Genetics Ghent (CMGG), Ghent University, 9000 Ghent, Belgium.,Cancer Research Institute Ghent (CRIG), 9000 Ghent, Belgium.,Bioinformatics Institute Ghent from Nucleotides to Networks (BIG N2N), 9000 Ghent, Belgium
| | - Katleen De Preter
- Center for Medical Genetics Ghent (CMGG), Ghent University, 9000 Ghent, Belgium.,Cancer Research Institute Ghent (CRIG), 9000 Ghent, Belgium.,Bioinformatics Institute Ghent from Nucleotides to Networks (BIG N2N), 9000 Ghent, Belgium
| |
Collapse
|
9
|
Wei S, Zang J, Jia Y, Chen A, Xie Y, Huang J, Li Z, Nie G, Liu H, Liu F, Gao W. A Gene-Related Nomogram for Preoperative Prediction of Lymph Node Metastasis in Colorectal Cancer. J INVEST SURG 2019; 33:715-722. [PMID: 30907189 DOI: 10.1080/08941939.2019.1569738] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
Purpose: To develop and validate a gene-related nomogram for predicting the risk of lymph node (LN) metastasis preoperatively in patients with colorectal cancer (CRC). Methods: RNA-seq data of 581 CRC and 51 normal cases with clinical features were downloaded from TCGA database. In the evaluation cohort with 381 CRC cases, the LASSO regression was used to reduce dimensionality of gene signatures extracted to build gene score. A gene-related nomogram was performed based on the multivariable logistic regression analysis. The performance of the nomogram was assessed by the discrimination, calibration, and clinical usefulness not only in the evaluation, but also in the validation cohort with 200 CRC cases. Results: A total of 12,590 differentially expressed genes were selected, in which 59 candidates associated with LN metastasis in differentially expressed genes set were screened by LASSO to form the gene score. Based on the analysis of multivariate logistic regression, the gene-related nomogram showed good calibration and discrimination not only in the evaluation cohort (concordance-index 0.93; 95%CI 0.91-0.96), but also in the validation cohort (concordance-index 0.70; 95%CI 0.63-0.78). The decision curve analysis of the gene-related nomogram also provides constructive guidance for the design of operation plan, preoperatively. Conclusions: The presented genes nomogram may predict the LN metastasis in CRC patients, preoperatively. And 59 hub genes were defined related to LN metastasis of CRC, which can serve as treatment targets for the further study. Preoperative biopsy and gene analysis are needed to develop the operation plan in clinical practice.
Collapse
Affiliation(s)
- Shuxun Wei
- The First Department of General Surgery, Changzheng Hospital, Second Military Medical University, Shanghai, China
| | - Jia Zang
- The First Department of General Surgery, Changzheng Hospital, Second Military Medical University, Shanghai, China
| | - Youpeng Jia
- General Surgery Department, Dalian Municipal Center Hospital, Liaoning Province, Dalian, China
| | - Aona Chen
- The First Department of General Surgery, Changzheng Hospital, Second Military Medical University, Shanghai, China
| | - Yayun Xie
- The First Department of General Surgery, Changzheng Hospital, Second Military Medical University, Shanghai, China
| | - Jian Huang
- The Third Department of Hepatic Surgery, Eastern Hepatobiliary Surgery Hospital, Second Military Medical University, Shanghai, China
| | - Zheng Li
- The Third Department of Hepatic Surgery, Eastern Hepatobiliary Surgery Hospital, Second Military Medical University, Shanghai, China
| | - Gang Nie
- The Department of Hepatobiliary Pancreatic Surgery, Changhai Hospital, Second Military Medical University, Shanghai, China
| | - Hui Liu
- The Third Department of Hepatic Surgery, Eastern Hepatobiliary Surgery Hospital, Second Military Medical University, Shanghai, China
| | - Fuchen Liu
- The Third Department of Hepatic Surgery, Eastern Hepatobiliary Surgery Hospital, Second Military Medical University, Shanghai, China
| | - Wenchao Gao
- The First Department of General Surgery, Changzheng Hospital, Second Military Medical University, Shanghai, China
| |
Collapse
|
10
|
Roman T, Xie L, Schwartz R. Automated deconvolution of structured mixtures from heterogeneous tumor genomic data. PLoS Comput Biol 2017; 13:e1005815. [PMID: 29059177 PMCID: PMC5695636 DOI: 10.1371/journal.pcbi.1005815] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2017] [Revised: 11/02/2017] [Accepted: 10/10/2017] [Indexed: 11/23/2022] Open
Abstract
With increasing appreciation for the extent and importance of intratumor heterogeneity, much attention in cancer research has focused on profiling heterogeneity on a single patient level. Although true single-cell genomic technologies are rapidly improving, they remain too noisy and costly at present for population-level studies. Bulk sequencing remains the standard for population-scale tumor genomics, creating a need for computational tools to separate contributions of multiple tumor clones and assorted stromal and infiltrating cell populations to pooled genomic data. All such methods are limited to coarse approximations of only a few cell subpopulations, however. In prior work, we demonstrated the feasibility of improving cell type deconvolution by taking advantage of substructure in genomic mixtures via a strategy called simplicial complex unmixing. We improve on past work by introducing enhancements to automate learning of substructured genomic mixtures, with specific emphasis on genome-wide copy number variation (CNV) data, as well as the ability to process quantitative RNA expression data, and heterogeneous combinations of RNA and CNV data. We introduce methods for dimensionality estimation to better decompose mixture model substructure; fuzzy clustering to better identify substructure in sparse, noisy data; and automated model inference methods for other key model parameters. We further demonstrate their effectiveness in identifying mixture substructure in true breast cancer CNV data from the Cancer Genome Atlas (TCGA). Source code is available at https://github.com/tedroman/WSCUnmix.
Collapse
Affiliation(s)
- Theodore Roman
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| | - Lu Xie
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| | - Russell Schwartz
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
- Biological Sciences Department, Mellon College of Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| |
Collapse
|
11
|
Complex Sources of Variation in Tissue Expression Data: Analysis of the GTEx Lung Transcriptome. Am J Hum Genet 2016; 99:624-635. [PMID: 27588449 DOI: 10.1016/j.ajhg.2016.07.007] [Citation(s) in RCA: 53] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2016] [Accepted: 07/08/2016] [Indexed: 01/10/2023] Open
Abstract
The sources of gene expression variability in human tissues are thought to be a complex interplay of technical, compositional, and disease-related factors. To better understand these contributions, we investigated expression variability in a relatively homogeneous tissue expression dataset from the Genotype-Tissue Expression (GTEx) resource. In addition to identifying technical sources, such as sequencing date and post-mortem interval, we also identified several biological sources of variation. An in-depth analysis of the 175 genes with the greatest variation among 133 lung tissue samples identified five distinct clusters of highly correlated genes. One large cluster included surfactant genes (SFTPA1, SFTPA2, and SFTPC), which are expressed exclusively in type II pneumocytes, cells that proliferate in ventilator associated lung injury. High surfactant expression was strongly associated with death on a ventilator and type II pneumocyte hyperplasia. A second large cluster included dynein (DNAH9 and DNAH12) and mucin (MUC5B and MUC16) genes, which are exclusive to the respiratory epithelium and goblet cells of bronchial structures. This indicates heterogeneous bronchiole sampling due to the harvesting location in the lung. A small cluster included acute-phase reactant genes (SAA1, SAA2, and SAA2-SAA4). The final two small clusters were technical and gender related. To summarize, in a collection of normal lung samples, we found that tissue heterogeneity caused by harvesting location (medial or lateral lung) and late therapeutic intervention (mechanical ventilation) were major contributors to expression variation. These unexpected sources of variation were the result of altered cell ratios in the tissue samples, an underappreciated source of expression variation.
Collapse
|
12
|
Wang M, Tsai TH, Di Poto C, Ferrarini A, Yu G, Ressom HW. Topic model-based mass spectrometric data analysis in cancer biomarker discovery studies. BMC Genomics 2016; 17 Suppl 4:545. [PMID: 27535232 PMCID: PMC5001243 DOI: 10.1186/s12864-016-2796-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
Background A fundamental challenge in quantitation of biomolecules for cancer biomarker discovery is owing to the heterogeneous nature of human biospecimens. Although this issue has been a subject of discussion in cancer genomic studies, it has not yet been rigorously investigated in mass spectrometry based proteomic and metabolomic studies. Purification of mass spectometric data is highly desired prior to subsequent analysis, e.g., quantitative comparison of the abundance of biomolecules in biological samples. Methods We investigated topic models to computationally analyze mass spectrometric data considering both integrated peak intensities and scan-level features, i.e., extracted ion chromatograms (EICs). Probabilistic generative models enable flexible representation in data structure and infer sample-specific pure resources. Scan-level modeling helps alleviate information loss during data preprocessing. We evaluated the capability of the proposed models in capturing mixture proportions of contaminants and cancer profiles on LC-MS based serum proteomic and GC-MS based tissue metabolomic datasets acquired from patients with hepatocellular carcinoma (HCC) and liver cirrhosis as well as synthetic data we generated based on the serum proteomic data. Results The results we obtained by analysis of the synthetic data demonstrated that both intensity-level and scan-level purification models can accurately infer the mixture proportions and the underlying true cancerous sources with small average error ratios (<7 %) between estimation and ground truth. By applying the topic model-based purification to mass spectrometric data, we found more proteins and metabolites with significant changes between HCC cases and cirrhotic controls. Candidate biomarkers selected after purification yielded biologically meaningful pathway analysis results and improved disease discrimination power in terms of the area under ROC curve compared to the results found prior to purification. Conclusions We investigated topic model-based inference methods to computationally address the heterogeneity issue in samples analyzed by LC/GC-MS. We observed that incorporation of scan-level features have the potential to lead to more accurate purification results by alleviating the loss in information as a result of integrating peaks. We believe cancer biomarker discovery studies that use mass spectrometric analysis of human biospecimens can greatly benefit from topic model-based purification of the data prior to statistical and pathway analyses.
Collapse
Affiliation(s)
- Minkun Wang
- Department of Oncology, Georgetown University, 4000 Reservoir Rd NW, Washington D.C., USA.,Department of Electrical and Computer Engineering, Virginia Tech, 900 N Glebe Rd, Arlington, VA, USA
| | - Tsung-Heng Tsai
- Department of Oncology, Georgetown University, 4000 Reservoir Rd NW, Washington D.C., USA
| | - Cristina Di Poto
- Department of Oncology, Georgetown University, 4000 Reservoir Rd NW, Washington D.C., USA
| | - Alessia Ferrarini
- Department of Oncology, Georgetown University, 4000 Reservoir Rd NW, Washington D.C., USA
| | - Guoqiang Yu
- Department of Electrical and Computer Engineering, Virginia Tech, 900 N Glebe Rd, Arlington, VA, USA
| | - Habtom W Ressom
- Department of Oncology, Georgetown University, 4000 Reservoir Rd NW, Washington D.C., USA.
| |
Collapse
|
13
|
Cui A, Quon G, Rosenberg AM, Yeung RSM, Morris Q, BBOP Study Consortium. Gene Expression Deconvolution for Uncovering Molecular Signatures in Response to Therapy in Juvenile Idiopathic Arthritis. PLoS One 2016; 11:e0156055. [PMID: 27244050 PMCID: PMC4887077 DOI: 10.1371/journal.pone.0156055] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2015] [Accepted: 05/09/2016] [Indexed: 01/10/2023] Open
Abstract
Gene expression-based signatures help identify pathways relevant to diseases and treatments, but are challenging to construct when there is a diversity of disease mechanisms and treatments in patients with complex diseases. To overcome this challenge, we present a new application of an in silico gene expression deconvolution method, ISOpure-S1, and apply it to identify a common gene expression signature corresponding to response to treatment in 33 juvenile idiopathic arthritis (JIA) patients. Using pre- and post-treatment gene expression profiles only, we found a gene expression signature that significantly correlated with a reduction in the number of joints with active arthritis, a measure of clinical outcome (Spearman rho = 0.44, p = 0.040, Bonferroni correction). This signature may be associated with a decrease in T-cells, monocytes, neutrophils and platelets. The products of most differentially expressed genes include known biomarkers for JIA such as major histocompatibility complexes and interleukins, as well as novel biomarkers including α-defensins. This method is readily applicable to expression datasets of other complex diseases to uncover shared mechanistic patterns in heterogeneous samples.
Collapse
Affiliation(s)
- Ang Cui
- Division of Engineering Science, University of Toronto, Toronto, ON, Canada
| | - Gerald Quon
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
| | - Alan M. Rosenberg
- Department of Pediatrics, Division of Rheumatology, University of Saskatchewan, Saskatoon, SK, Canada
| | - Rae S. M. Yeung
- Divisions of Rheumatology and Cell Biology, The Hospital for Sick Children, Toronto, ON, Canada
- Departments of Paediatrics, Immunology and Medical Sciences, University of Toronto, Toronto, ON, Canada
- * E-mail: (RY); (QM)
| | - Quaid Morris
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
- The Donnelly Centre, University of Toronto, Toronto, ON, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
- * E-mail: (RY); (QM)
| | | |
Collapse
|
14
|
Abstract
Despite the enormous medical impact of cancers and intensive study of their biology, detailed characterization of tumor growth and development remains elusive. This difficulty occurs in large part because of enormous heterogeneity in the molecular mechanisms of cancer progression, both tumor-to-tumor and cell-to-cell in single tumors. Advances in genomic technologies, especially at the single-cell level, are improving the situation, but these approaches are held back by limitations of the biotechnologies for gathering genomic data from heterogeneous cell populations and the computational methods for making sense of those data. One popular way to gain the advantages of whole-genome methods without the cost of single-cell genomics has been the use of computational deconvolution (unmixing) methods to reconstruct clonal heterogeneity from bulk genomic data. These methods, too, are limited by the difficulty of inferring genomic profiles of rare or subtly varying clonal subpopulations from bulk data, a problem that can be computationally reduced to that of reconstructing the geometry of point clouds of tumor samples in a genome space. Here, we present a new method to improve that reconstruction by better identifying subspaces corresponding to tumors produced from mixtures of distinct combinations of clonal subpopulations. We develop a nonparametric clustering method based on medoidshift clustering for identifying subgroups of tumors expected to correspond to distinct trajectories of evolutionary progression. We show on synthetic and real tumor copy-number data that this new method substantially improves our ability to resolve discrete tumor subgroups, a key step in the process of accurately deconvolving tumor genomic data and inferring clonal heterogeneity from bulk data.
Collapse
Affiliation(s)
- Theodore Roman
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, 15213, PA, USA. .,Joint Carnegie Mellon/University of Pittsburgh Ph.D. Program in Computational Biology, 5000 Forbes Ave, Pittsburgh, 15213, PA, USA.
| | - Lu Xie
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, 15213, PA, USA. .,Joint Carnegie Mellon/University of Pittsburgh Ph.D. Program in Computational Biology, 5000 Forbes Ave, Pittsburgh, 15213, PA, USA.
| | - Russell Schwartz
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, 15213, PA, USA. .,Department of Biological Sciences, Mellon College of Science, Carnegie Mellon University, 4400 Fifth Avenue, Pittsburgh, 15213, PA, USA.
| |
Collapse
|
15
|
Parsons J, Munro S, Pine PS, McDaniel J, Mehaffey M, Salit M. Using mixtures of biological samples as process controls for RNA-sequencing experiments. BMC Genomics 2015; 16:708. [PMID: 26383878 PMCID: PMC4574543 DOI: 10.1186/s12864-015-1912-7] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2015] [Accepted: 09/09/2015] [Indexed: 12/02/2022] Open
Abstract
Background Genome-scale “-omics” measurements are challenging to benchmark due to the enormous variety of unique biological molecules involved. Mixtures of previously-characterized samples can be used to benchmark repeatability and reproducibility using component proportions as truth for the measurement. We describe and evaluate experiments characterizing the performance of RNA-sequencing (RNA-Seq) measurements, and discuss cases where mixtures can serve as effective process controls. Results We apply a linear model to total RNA mixture samples in RNA-seq experiments. This model provides a context for performance benchmarking. The parameters of the model fit to experimental results can be evaluated to assess bias and variability of the measurement of a mixture. A linear model describes the behavior of mixture expression measures and provides a context for performance benchmarking. Residuals from fitting the model to experimental data can be used as a metric for evaluating the effect that an individual step in an experimental process has on the linear response function and precision of the underlying measurement while identifying signals affected by interference from other sources. Effective benchmarking requires well-defined mixtures, which for RNA-Seq requires knowledge of the post-enrichment ‘target RNA’ content of the individual total RNA components. We demonstrate and evaluate an experimental method suitable for use in genome-scale process control and lay out a method utilizing spike-in controls to determine enriched RNA content of total RNA in samples. Conclusions Genome-scale process controls can be derived from mixtures. These controls relate prior knowledge of individual components to a complex mixture, allowing assessment of measurement performance. The target RNA fraction accounts for differential selection of RNA out of variable total RNA samples. Spike-in controls can be utilized to measure this relationship between target RNA content and input total RNA. Our mixture analysis method also enables estimation of the proportions of an unknown mixture, even when component-specific markers are not previously known, whenever pure components are measured alongside the mixture. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1912-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jerod Parsons
- Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, MD, 20899, USA. .,Department of Bioengineering, Stanford University, 443 Via Ortega, Stanford, CA, 94305, USA.
| | - Sarah Munro
- Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, MD, 20899, USA. .,Department of Bioengineering, Stanford University, 443 Via Ortega, Stanford, CA, 94305, USA.
| | - P Scott Pine
- Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, MD, 20899, USA. .,Department of Bioengineering, Stanford University, 443 Via Ortega, Stanford, CA, 94305, USA.
| | - Jennifer McDaniel
- Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, MD, 20899, USA.
| | - Michele Mehaffey
- Leidos Biomedical Research Inc., P.O. Box B Bldg 428, Frederick, MD, 21702, USA.
| | - Marc Salit
- Material Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, MD, 20899, USA. .,Department of Bioengineering, Stanford University, 443 Via Ortega, Stanford, CA, 94305, USA.
| |
Collapse
|
16
|
Anghel CV, Quon G, Haider S, Nguyen F, Deshwar AG, Morris QD, Boutros PC. ISOpureR: an R implementation of a computational purification algorithm of mixed tumour profiles. BMC Bioinformatics 2015; 16:156. [PMID: 25972088 PMCID: PMC4429941 DOI: 10.1186/s12859-015-0597-x] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2014] [Accepted: 04/27/2015] [Indexed: 01/23/2023] Open
Abstract
Background Tumour samples containing distinct sub-populations of cancer and normal cells present challenges in the development of reproducible biomarkers, as these biomarkers are based on bulk signals from mixed tumour profiles. ISOpure is the only mRNA computational purification method to date that does not require a paired tumour-normal sample, provides a personalized cancer profile for each patient, and has been tested on clinical data. Replacing mixed tumour profiles with ISOpure-preprocessed cancer profiles led to better prognostic gene signatures for lung and prostate cancer. Results To simplify the integration of ISOpure into standard R-based bioinformatics analysis pipelines, the algorithm has been implemented as an R package. The ISOpureR package performs analogously to the original code in estimating the fraction of cancer cells and the patient cancer mRNA abundance profile from tumour samples in four cancer datasets. Conclusions The ISOpureR package estimates the fraction of cancer cells and personalized patient cancer mRNA abundance profile from a mixed tumour profile. This open-source R implementation enables integration into existing computational pipelines, as well as easy testing, modification and extension of the model. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0597-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Catalina V Anghel
- Informatics and Biocomputing Program, Ontario Institute for Cancer Research, 661 University Avenue, Toronto, Suite 510, M5G 0A3, ON, Canada.
| | - Gerald Quon
- Department of Computer Science, University of Toronto, 10 King's College Road, Room 3303, M5S 3G4, Toronto, ON, Canada.
| | - Syed Haider
- Informatics and Biocomputing Program, Ontario Institute for Cancer Research, 661 University Avenue, Toronto, Suite 510, M5G 0A3, ON, Canada. .,Department of Oncology, University of Oxford, Old Road Campus Research Building, Roosevelt Drive, Oxford, OX3 7DQ, United Kingdom.
| | - Francis Nguyen
- Informatics and Biocomputing Program, Ontario Institute for Cancer Research, 661 University Avenue, Toronto, Suite 510, M5G 0A3, ON, Canada.
| | - Amit G Deshwar
- Edward S. Rogers Sr. Department of Electrical and Computer Engineering, University of Toronto, 10 King's College, Room SFB540, Toronto, M5S 3G4, ON, Canada.
| | - Quaid D Morris
- Department of Computer Science, University of Toronto, 10 King's College Road, Room 3303, M5S 3G4, Toronto, ON, Canada. .,Edward S. Rogers Sr. Department of Electrical and Computer Engineering, University of Toronto, 10 King's College, Room SFB540, Toronto, M5S 3G4, ON, Canada. .,Department of Molecular Genetics, University of Toronto, 1 King's College Circle, Room 4396, Toronto, M4S 1A8, ON, Canada. .,The Donnelly Centre, 160 College Street, Room 230, Toronto, M5S 3E1, ON, Canada.
| | - Paul C Boutros
- Informatics and Biocomputing Program, Ontario Institute for Cancer Research, 661 University Avenue, Toronto, Suite 510, M5G 0A3, ON, Canada. .,Department of Medical Biophysics, University of Toronto, 101 College Street, Toronto, M5G 1L7, ON, Canada. .,Department of Pharmacology and Toxicology, University of Toronto, 1 King's College Circle, Toronto, M5S 1A8, ON, Canada.
| |
Collapse
|
17
|
Wei IH, Shi Y, Jiang H, Kumar-Sinha C, Chinnaiyan AM. RNA-Seq accurately identifies cancer biomarker signatures to distinguish tissue of origin. Neoplasia 2014; 16:918-27. [PMID: 25425966 PMCID: PMC4240918 DOI: 10.1016/j.neo.2014.09.007] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2014] [Revised: 09/23/2014] [Accepted: 09/23/2014] [Indexed: 12/27/2022] Open
Abstract
Metastatic cancer of unknown primary (CUP) accounts for up to 5% of all new cancer cases, with a 5-year survival rate of only 10%. Accurate identification of tissue of origin would allow for directed, personalized therapies to improve clinical outcomes. Our objective was to use transcriptome sequencing (RNA-Seq) to identify lineage-specific biomarker signatures for the cancer types that most commonly metastasize as CUP (colorectum, kidney, liver, lung, ovary, pancreas, prostate, and stomach). RNA-Seq data of 17,471 transcripts from a total of 3,244 cancer samples across 26 different tissue types were compiled from in-house sequencing data and publically available International Cancer Genome Consortium and The Cancer Genome Atlas datasets. Robust cancer biomarker signatures were extracted using a 10-fold cross-validation method of log transformation, quantile normalization, transcript ranking by area under the receiver operating characteristic curve, and stepwise logistic regression. The entire algorithm was then repeated with a new set of randomly generated training and test sets, yielding highly concordant biomarker signatures. External validation of the cancer-specific signatures yielded high sensitivity (92.0% ± 3.15%; mean ± standard deviation) and specificity (97.7% ± 2.99%) for each cancer biomarker signature. The overall performance of this RNA-Seq biomarker-generating algorithm yielded an accuracy of 90.5%. In conclusion, we demonstrate a computational model for producing highly sensitive and specific cancer biomarker signatures from RNA-Seq data, generating signatures for the top eight cancer types responsible for CUP to accurately identify tumor origin.
Collapse
Affiliation(s)
- Iris H Wei
- University of Michigan Department of Surgery, University of Michigan Medical School, Ann Arbor, MI, USA 48109
| | - Yang Shi
- University of Michigan Department of Biostatistics, University of Michigan Medical School, Ann Arbor, MI, USA 48109
| | - Hui Jiang
- University of Michigan Department of Biostatistics, University of Michigan Medical School, Ann Arbor, MI, USA 48109
| | - Chandan Kumar-Sinha
- Michigan Center for Translational Pathology, University of Michigan Medical School, Ann Arbor, MI, USA 48109 ; University of Michigan Department of Pathology, University of Michigan Medical School, Ann Arbor, MI, USA 48109
| | - Arul M Chinnaiyan
- Michigan Center for Translational Pathology, University of Michigan Medical School, Ann Arbor, MI, USA 48109 ; University of Michigan Department of Pathology, University of Michigan Medical School, Ann Arbor, MI, USA 48109 ; Comprehensive Cancer Center, University of Michigan Medical School, Ann Arbor, MI, USA 48109 ; University of Michigan Department of Urology, University of Michigan Medical School, Ann Arbor, MI, USA 48109 ; Howard Hughes Medical Institute, University of Michigan Medical School, Ann Arbor, MI, USA 48109
| |
Collapse
|
18
|
Clarke B, Clarke J. Estimating the proportions in a mixed sample using transcriptomics. Stat (Int Stat Inst) 2014. [DOI: 10.1002/sta4.65] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Affiliation(s)
- Bertrand Clarke
- Department of Statistics University of Nebraska–Lincoln Lincoln NE 68583 USA
| | - Jennifer Clarke
- Department of Statistics and the Department of Food Science and Technology University of Nebraska‐Lincoln Lincoln NE 68583 USA
| |
Collapse
|
19
|
Yadav VK, De S. An assessment of computational methods for estimating purity and clonality using genomic data derived from heterogeneous tumor tissue samples. Brief Bioinform 2014; 16:232-41. [PMID: 24562872 DOI: 10.1093/bib/bbu002] [Citation(s) in RCA: 61] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
Solid tumor samples typically contain multiple distinct clonal populations of cancer cells, and also stromal and immune cell contamination. A majority of the cancer genomics and transcriptomics studies do not explicitly consider genetic heterogeneity and impurity, and draw inferences based on mixed populations of cells. Deconvolution of genomic data from heterogeneous samples provides a powerful tool to address this limitation. We discuss several computational tools, which enable deconvolution of genomic and transcriptomic data from heterogeneous samples. We also performed a systematic comparative assessment of these tools. If properly used, these tools have potentials to complement single-cell genomics and immunoFISH analyses, and provide novel insights into tumor heterogeneity.
Collapse
|
20
|
Listgarten J, Stegle O, Morris Q, Brenner SE, Parts L. Personalized medicine: from genotypes and molecular phenotypes towards therapy- session introduction. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2014; 19:224-228. [PMID: 24297549 PMCID: PMC5215523 DOI: 10.1142/9789814583220_0022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
|
21
|
Deshwar AG, Morris Q. PLIDA: cross-platform gene expression normalization using perturbed topic models. ACTA ACUST UNITED AC 2013; 30:956-61. [PMID: 24123674 DOI: 10.1093/bioinformatics/btt574] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Gene expression data are currently collected on a wide range of platforms. Differences between platforms make it challenging to combine and compare data collected on different platforms. We propose a new method of cross-platform normalization that uses topic models to summarize the expression patterns in each dataset before normalizing the topics learned from each dataset using per-gene multiplicative weights. RESULTS This method allows for cross-platform normalization even when samples profiled on different platforms have systematic differences, allows the simultaneous normalization of data from an arbitrary number of platforms and, after suitable training, allows for online normalization of expression data collected individually or in small batches. In addition, our method outperforms existing state-of-the-art platform normalization tools. AVAILABILITY AND IMPLEMENTATION MATLAB code is available at http://morrislab.med.utoronto.ca/plida/.
Collapse
Affiliation(s)
- Amit G Deshwar
- Edward S. Rogers Sr. Department of Electrical and Computer Engineering, Department of Molecular Genetics, Department of Computer Science and Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario M5S 1A1, Canada
| | | |
Collapse
|
22
|
Strino F, Parisi F, Micsinai M, Kluger Y. TrAp: a tree approach for fingerprinting subclonal tumor composition. Nucleic Acids Res 2013; 41:e165. [PMID: 23892400 PMCID: PMC3783191 DOI: 10.1093/nar/gkt641] [Citation(s) in RCA: 78] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2013] [Revised: 06/11/2013] [Accepted: 07/02/2013] [Indexed: 01/01/2023] Open
Abstract
Revealing the clonal composition of a single tumor is essential for identifying cell subpopulations with metastatic potential in primary tumors or with resistance to therapies in metastatic tumors. Sequencing technologies provide only an overview of the aggregate of numerous cells. Computational approaches to de-mix a collective signal composed of the aberrations of a mixed cell population of a tumor sample into its individual components are not available. We propose an evolutionary framework for deconvolving data from a single genome-wide experiment to infer the composition, abundance and evolutionary paths of the underlying cell subpopulations of a tumor. We have developed an algorithm (TrAp) for solving this mixture problem. In silico analyses show that TrAp correctly deconvolves mixed subpopulations when the number of subpopulations and the measurement errors are moderate. We demonstrate the applicability of the method using tumor karyotypes and somatic hypermutation data sets. We applied TrAp to Exome-Seq experiment of a renal cell carcinoma tumor sample and compared the mutational profile of the inferred subpopulations to the mutational profiles of single cells of the same tumor. Finally, we deconvolve sequencing data from eight acute myeloid leukemia patients and three distinct metastases of one melanoma patient to exhibit the evolutionary relationships of their subpopulations.
Collapse
Affiliation(s)
- Francesco Strino
- Department of Pathology, Yale University School of Medicine, New Haven, CT 06520, USA, NYU Center for Health Informatics and Bioinformatics, New York University Langone Medical Center, 227 East 30th Street, New York, NY 10016, USA and Yale Cancer Center, New Haven, CT 06520, USA
| | - Fabio Parisi
- Department of Pathology, Yale University School of Medicine, New Haven, CT 06520, USA, NYU Center for Health Informatics and Bioinformatics, New York University Langone Medical Center, 227 East 30th Street, New York, NY 10016, USA and Yale Cancer Center, New Haven, CT 06520, USA
| | - Mariann Micsinai
- Department of Pathology, Yale University School of Medicine, New Haven, CT 06520, USA, NYU Center for Health Informatics and Bioinformatics, New York University Langone Medical Center, 227 East 30th Street, New York, NY 10016, USA and Yale Cancer Center, New Haven, CT 06520, USA
| | - Yuval Kluger
- Department of Pathology, Yale University School of Medicine, New Haven, CT 06520, USA, NYU Center for Health Informatics and Bioinformatics, New York University Langone Medical Center, 227 East 30th Street, New York, NY 10016, USA and Yale Cancer Center, New Haven, CT 06520, USA
| |
Collapse
|
23
|
Oesper L, Mahmoody A, Raphael BJ. THetA: inferring intra-tumor heterogeneity from high-throughput DNA sequencing data. Genome Biol 2013; 14:R80. [PMID: 23895164 PMCID: PMC4054893 DOI: 10.1186/gb-2013-14-7-r80] [Citation(s) in RCA: 150] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2013] [Accepted: 07/29/2013] [Indexed: 12/11/2022] Open
Abstract
Tumor samples are typically heterogeneous, containing admixture by normal, non-cancerous cells and one or more subpopulations of cancerous cells. Whole-genome sequencing of a tumor sample yields reads from this mixture, but does not directly reveal the cell of origin for each read. We introduce THetA (Tumor Heterogeneity Analysis), an algorithm that infers the most likely collection of genomes and their proportions in a sample, for the case where copy number aberrations distinguish subpopulations. THetA successfully estimates normal admixture and recovers clonal and subclonal copy number aberrations in real and simulated sequencing data. THetA is available at http://compbio.cs.brown.edu/software/.
Collapse
|
24
|
Burdick JT, Murray JI. Deconvolution of gene expression from cell populations across the C. elegans lineage. BMC Bioinformatics 2013; 14:204. [PMID: 23800200 PMCID: PMC3704917 DOI: 10.1186/1471-2105-14-204] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2013] [Accepted: 06/11/2013] [Indexed: 11/11/2022] Open
Abstract
Background Knowledge of when and in which cells each gene is expressed across multicellular organisms is critical in understanding both gene function and regulation of cell type diversity. However, methods for measuring expression typically involve a trade-off between imaging-based methods, which give the precise location of a limited number of genes, and higher throughput methods such as RNA-seq, which include all genes, but are more limited in their resolution to apply to many tissues. We propose an intermediate method, which estimates expression in individual cells, based on high-throughput measurements of expression from multiple overlapping groups of cells. This approach has particular benefits in organisms such as C. elegans where invariant developmental patterns make it possible to define these overlapping populations of cells at single-cell resolution. Result We implement several methods to deconvolve the gene expression in individual cells from population-level data and determine the accuracy of these estimates on simulated data from the C. elegans embryo. Conclusion These simulations suggest that a high-resolution map of expression in the C. elegans embryo may be possible with expression data from as few as 30 cell populations.
Collapse
Affiliation(s)
- Joshua T Burdick
- Genomics and Computational Biology Group, University of Pennsylvania, 440 Clinical Research Building, 415 Curie Boulevard, Philadelphia, PA 19104, USA
| | | |
Collapse
|
25
|
Oien KA, Dennis JL. Diagnostic work-up of carcinoma of unknown primary: from immunohistochemistry to molecular profiling. Ann Oncol 2013; 23 Suppl 10:x271-7. [PMID: 22987975 DOI: 10.1093/annonc/mds357] [Citation(s) in RCA: 80] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Carcinoma of unknown primary (CUP) remains a common and challenging clinical problem. The aim of diagnostic work-up in CUP is to classify as specifically as possible the cancer affecting the patient, according to the broad tumour type, subtype and, where possible, site of origin. This classification currently best predicts patient outcome and guides optimal treatment. a stepwise approach to diagnostic work-up is described. although pathology is based on morphology, the assessment of tissue-specific genes through immunohistochemistry (IHC) substantially helps tumour classification at each diagnostic step. For IHC in CUP, recent improvements include more standardised approaches and marker panels plus new markers. Tissue-specific genes are also being used in CUP work-up through molecular profiling. Large-scale profiles of hundreds of tumours of different types have been generated, compared and used to generate diagnostic algorithms. Commercial tests for CUP classification have been developed at the mRNa and microRNA and (miRNA) levels and validated in metastatic tumours and CUPs. While currently optimal pathology and IHC remain the 'gold standard' for CUP diagnostic work-up, and full clinical correlation is vital, the molecular tests appear to perform well: in the main diagnostic challenge of undifferentiated or poorly differentiated tumours, molecular profiling performs as well as or better than IHC.
Collapse
Affiliation(s)
- K A Oien
- University of Glasgow, Institute of Cancer Sciences, Glasgow, UK.
| | | |
Collapse
|
26
|
Quon G, Haider S, Deshwar AG, Cui A, Boutros PC, Morris Q. Computational purification of individual tumor gene expression profiles leads to significant improvements in prognostic prediction. Genome Med 2013; 5:29. [PMID: 23537167 PMCID: PMC3706990 DOI: 10.1186/gm433] [Citation(s) in RCA: 72] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2012] [Accepted: 03/28/2013] [Indexed: 11/10/2022] Open
Abstract
Tumor heterogeneity is a limiting factor in cancer treatment and in the discovery of biomarkers to personalize it. We describe a computational purification tool, ISOpure, to directly address the effects of variable normal tissue contamination in clinical tumor specimens. ISOpure uses a set of tumor expression profiles and a panel of healthy tissue expression profiles to generate a purified cancer profile for each tumor sample and an estimate of the proportion of RNA originating from cancerous cells. Applying ISOpure before identifying gene signatures leads to significant improvements in the prediction of prognosis and other clinical variables in lung and prostate cancer.
Collapse
|
27
|
Zhong Y, Wan YW, Pang K, Chow LML, Liu Z. Digital sorting of complex tissues for cell type-specific gene expression profiles. BMC Bioinformatics 2013; 14:89. [PMID: 23497278 PMCID: PMC3626856 DOI: 10.1186/1471-2105-14-89] [Citation(s) in RCA: 149] [Impact Index Per Article: 12.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2012] [Accepted: 02/14/2013] [Indexed: 11/29/2022] Open
Abstract
Background Cellular heterogeneity is present in almost all gene expression profiles. However, transcriptome analysis of tissue specimens often ignores the cellular heterogeneity present in these samples. Standard deconvolution algorithms require prior knowledge of the cell type frequencies within a tissue or their in vitro expression profiles. Furthermore, these algorithms tend to report biased estimations. Results Here, we describe a Digital Sorting Algorithm (DSA) for extracting cell-type specific gene expression profiles from mixed tissue samples that is unbiased and does not require prior knowledge of cell type frequencies. Conclusions The results suggest that DSA is a specific and sensitivity algorithm in gene expression profile deconvolution and will be useful in studying individual cell types of complex tissues.
Collapse
Affiliation(s)
- Yi Zhong
- Department of Pediatrics, Neurological Research Institute, Baylor College of Medicine, Houston, TX, USA
| | | | | | | | | |
Collapse
|
28
|
Gong T, Szustakowski JD. DeconRNASeq: a statistical framework for deconvolution of heterogeneous tissue samples based on mRNA-Seq data. ACTA ACUST UNITED AC 2013; 29:1083-5. [PMID: 23428642 DOI: 10.1093/bioinformatics/btt090] [Citation(s) in RCA: 179] [Impact Index Per Article: 14.9] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
SUMMARY For heterogeneous tissues, measurements of gene expression through mRNA-Seq data are confounded by relative proportions of cell types involved. In this note, we introduce an efficient pipeline: DeconRNASeq, an R package for deconvolution of heterogeneous tissues based on mRNA-Seq data. It adopts a globally optimized non-negative decomposition algorithm through quadratic programming for estimating the mixing proportions of distinctive tissue types in next-generation sequencing data. We demonstrated the feasibility and validity of DeconRNASeq across a range of mixing levels and sources using mRNA-Seq data mixed in silico at known concentrations. We validated our computational approach for various benchmark data, with high correlation between our predicted cell proportions and the real fractions of tissues. Our study provides a rigorous, quantitative and high-resolution tool as a prerequisite to use mRNA-Seq data. The modularity of package design allows an easy deployment of custom analytical pipelines for data from other high-throughput platforms. AVAILABILITY DeconRNASeq is written in R, and is freely available at http://bioconductor.org/packages. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ting Gong
- Biomarker Development, Translational Medicine, Novartis Institutes for BioMedical Research, Cambridge, MA 02139, USA.
| | | |
Collapse
|
29
|
PERT: a method for expression deconvolution of human blood samples from varied microenvironmental and developmental conditions. PLoS Comput Biol 2012; 8:e1002838. [PMID: 23284283 PMCID: PMC3527275 DOI: 10.1371/journal.pcbi.1002838] [Citation(s) in RCA: 84] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2012] [Accepted: 10/26/2012] [Indexed: 12/30/2022] Open
Abstract
The cellular composition of heterogeneous samples can be predicted using an expression deconvolution algorithm to decompose their gene expression profiles based on pre-defined, reference gene expression profiles of the constituent populations in these samples. However, the expression profiles of the actual constituent populations are often perturbed from those of the reference profiles due to gene expression changes in cells associated with microenvironmental or developmental effects. Existing deconvolution algorithms do not account for these changes and give incorrect results when benchmarked against those measured by well-established flow cytometry, even after batch correction was applied. We introduce PERT, a new probabilistic expression deconvolution method that detects and accounts for a shared, multiplicative perturbation in the reference profiles when performing expression deconvolution. We applied PERT and three other state-of-the-art expression deconvolution methods to predict cell frequencies within heterogeneous human blood samples that were collected under several conditions (uncultured mono-nucleated and lineage-depleted cells, and culture-derived lineage-depleted cells). Only PERT's predicted proportions of the constituent populations matched those assigned by flow cytometry. Genes associated with cell cycle processes were highly enriched among those with the largest predicted expression changes between the cultured and uncultured conditions. We anticipate that PERT will be widely applicable to expression deconvolution strategies that use profiles from reference populations that vary from the corresponding constituent populations in cellular state but not cellular phenotypic identity.
Collapse
|
30
|
Optimal deconvolution of transcriptional profiling data using quadratic programming with application to complex clinical blood samples. PLoS One 2011; 6:e27156. [PMID: 22110609 PMCID: PMC3217948 DOI: 10.1371/journal.pone.0027156] [Citation(s) in RCA: 96] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2011] [Accepted: 10/11/2011] [Indexed: 11/19/2022] Open
Abstract
Large-scale molecular profiling technologies have assisted the identification of disease biomarkers and facilitated the basic understanding of cellular processes. However, samples collected from human subjects in clinical trials possess a level of complexity, arising from multiple cell types, that can obfuscate the analysis of data derived from them. Failure to identify, quantify, and incorporate sources of heterogeneity into an analysis can have widespread and detrimental effects on subsequent statistical studies.We describe an approach that builds upon a linear latent variable model, in which expression levels from mixed cell populations are modeled as the weighted average of expression from different cell types. We solve these equations using quadratic programming, which efficiently identifies the globally optimal solution while preserving non-negativity of the fraction of the cells. We applied our method to various existing platforms to estimate proportions of different pure cell or tissue types and gene expression profilings of distinct phenotypes, with a focus on complex samples collected in clinical trials. We tested our methods on several well controlled benchmark data sets with known mixing fractions of pure cell or tissue types and mRNA expression profiling data from samples collected in a clinical trial. Accurate agreement between predicted and actual mixing fractions was observed. In addition, our method was able to predict mixing fractions for more than ten species of circulating cells and to provide accurate estimates for relatively rare cell types (<10% total population). Furthermore, accurate changes in leukocyte trafficking associated with Fingolomid (FTY720) treatment were identified that were consistent with previous results generated by both cell counts and flow cytometry. These data suggest that our method can solve one of the open questions regarding the analysis of complex transcriptional data: namely, how to identify the optimal mixing fractions in a given experiment.
Collapse
|
31
|
CULLUM R, ALDER O, HOODLESS PA. The next generation: Using new sequencing technologies to analyse gene regulation. Respirology 2011; 16:210-22. [DOI: 10.1111/j.1440-1843.2010.01899.x] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
32
|
Erkkilä T, Lehmusvaara S, Ruusuvuori P, Visakorpi T, Shmulevich I, Lähdesmäki H. Probabilistic analysis of gene expression measurements from heterogeneous tissues. ACTA ACUST UNITED AC 2010; 26:2571-7. [PMID: 20631160 PMCID: PMC2951082 DOI: 10.1093/bioinformatics/btq406] [Citation(s) in RCA: 54] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
Abstract
Motivation: Tissue heterogeneity, arising from multiple cell types, is a major confounding factor in experiments that focus on studying cell types, e.g. their expression profiles, in isolation. Although sample heterogeneity can be addressed by manual microdissection, prior to conducting experiments, computational treatment on heterogeneous measurements have become a reliable alternative to perform this microdissection in silico. Favoring computation over manual purification has its advantages, such as time consumption, measuring responses of multiple cell types simultaneously, keeping samples intact of external perturbations and unaltered yield of molecular content. Results: We formalize a probabilistic model, DSection, and show with simulations as well as with real microarray data that DSection attains increased modeling accuracy in terms of (i) estimating cell-type proportions of heterogeneous tissue samples, (ii) estimating replication variance and (iii) identifying differential expression across cell types under various experimental conditions. As our reference we use the corresponding linear regression model, which mirrors the performance of the majority of current non-probabilistic modeling approaches. Availability and Software: All codes are written in Matlab, and are freely available upon request as well as at the project web page http://www.cs.tut.fi/∼erkkila2/. Furthermore, a web-application for DSection exists at http://informatics.systemsbiology.net/DSection. Contact:timo.p.erkkila@tut.fi; harri.lahdesmaki@tut.fi
Collapse
Affiliation(s)
- Timo Erkkilä
- Department of Signal Processing, Tampere University of Technology, Finland.
| | | | | | | | | | | |
Collapse
|
33
|
Datta S, Datta S, Kim S, Chakraborty S, Gill RS. Statistical Analyses of Next Generation Sequence Data: A Partial Overview. JOURNAL OF PROTEOMICS & BIOINFORMATICS 2010; 3:183-190. [PMID: 21113236 PMCID: PMC2989618 DOI: 10.4172/jpb.1000138] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Next generation sequencing has revolutionized the status of biological research. For a long time, the gold standard of DNA sequencing was considered to be the Sanger method. However, in 2005, commercial launching of next generation sequencing has made it possible to generate massively parallel and high resolution DNA sequence data. Its usefulness in various genomic applications such as genome-wide detection of SNPs, DNA methylation profiling, mRNA expression profiling, whole-genome re-sequencing and so on are now well recognized. There are several platforms for generating next generation sequencing (NGS) data which we briefly discuss in this mini overview. With new technologies come new challenges for the data analysts. This mini review attempts to present a collection of selected topics in the current development of statistical methods dealing with these novel data types. We believe that knowing the advances and bottlenecks of this technology will help the researchers to benchmark the analytical tools dealing with these data and will pave the path for its proper application into clinical diagnostics.
Collapse
Affiliation(s)
- Susmita Datta
- Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, KY 40202, USA
| | - Somnath Datta
- Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, KY 40202, USA
| | - Seongho Kim
- Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, KY 40202, USA
| | - Sutirtha Chakraborty
- Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, KY 40202, USA
| | - Ryan S. Gill
- Department of Mathematics, University of Louisville, Louisville, KY 40202, USA
| |
Collapse
|