1
|
Liu Z. Clustering Single-Cell RNA-Seq Data with Regularized Gaussian Graphical Model. Genes (Basel) 2021; 12:311. [PMID: 33671799 PMCID: PMC7927011 DOI: 10.3390/genes12020311] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2020] [Revised: 02/07/2021] [Accepted: 02/15/2021] [Indexed: 11/20/2022] Open
Abstract
Single-cell RNA-seq (scRNA-seq) is a powerful tool to measure the expression patterns of individual cells and discover heterogeneity and functional diversity among cell populations. Due to variability, it is challenging to analyze such data efficiently. Many clustering methods have been developed using at least one free parameter. Different choices for free parameters may lead to substantially different visualizations and clusters. Tuning free parameters is also time consuming. Thus there is need for a simple, robust, and efficient clustering method. In this paper, we propose a new regularized Gaussian graphical clustering (RGGC) method for scRNA-seq data. RGGC is based on high-order (partial) correlations and subspace learning, and is robust over a wide-range of a regularized parameter λ. Therefore, we can simply set λ=2 or λ=log(p) for AIC (Akaike information criterion) or BIC (Bayesian information criterion) without cross-validation. Cell subpopulations are discovered by the Louvain community detection algorithm that determines the number of clusters automatically. There is no free parameter to be tuned with RGGC. When evaluated with simulated and benchmark scRNA-seq data sets against widely used methods, RGGC is computationally efficient and one of the top performers. It can detect inter-sample cell heterogeneity, when applied to glioblastoma scRNA-seq data.
Collapse
Affiliation(s)
- Zhenqiu Liu
- Department of Public Health Sciences, Pennsylvania State University College of Medicine, 500 University Drive, Hershey, PA 17033, USA
| |
Collapse
|
2
|
Liu Z, Lin S. Sparse Treatment-Effect Model for Taxon Identification with High-Dimensional Metagenomic Data. Methods Mol Biol 2018; 1849:309-318. [PMID: 30298262 DOI: 10.1007/978-1-4939-8728-3_19] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
To identify disease-associated taxa is an important task in metagenomics. To date, many methods have been proposed for feature selection and prediction. However, those proposed methods are either using univariate (generalized) regression approaches to get the corresponding P-values without considering the interactions among taxa, or using lasso or L0 type sparse modeling approaches to identify taxa with best predictions without providing P-values. To the best of our knowledge, there are no available methods that consider taxon interactions and also generate P-values.In this paper, we propose a treatment-effect model for identifying taxa (STEMIT) and performing statistical inference with high-dimensional metagenomic data. STEMIT will provide a P-value for a taxon through a two-step treatment-effect maximization. It will provide causal inference if the study is a clinical trial. We first identify taxa associated with the treatment-effect variable and the targeting feature with sparse modeling, and then estimate the P-value of the targeting gene with ordinary least square (OLS) regression. We demonstrate that the proposed method is efficient and can identify biologically important taxa with a real metagenomic data set. The software for L0 sparse modeling can be downloaded at https://cran.r-project.org/web/packages/l0ara/ .
Collapse
Affiliation(s)
- Zhenqiu Liu
- Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, CA, USA.
| | - Shili Lin
- Department of Statistics, The Ohio State University, Columbus, OH, USA
| |
Collapse
|
3
|
Kind T, Cho E, Park TD, Deng N, Liu Z, Lee T, Fiehn O, Kim J. Interstitial Cystitis-Associated Urinary Metabolites Identified by Mass-Spectrometry Based Metabolomics Analysis. Sci Rep 2016; 6:39227. [PMID: 27976711 PMCID: PMC5156939 DOI: 10.1038/srep39227] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2016] [Accepted: 11/18/2016] [Indexed: 11/09/2022] Open
Abstract
This study on interstitial cystitis (IC) aims to identify a unique urine metabolomic profile associated with IC, which can be defined as an unpleasant sensation including pain and discomfort related to the urinary bladder, without infection or other identifiable causes. Although the burden of IC on the American public is immense in both human and financial terms, there is no clear diagnostic test for IC, but rather it is a disease of exclusion. Very little is known about the clinically useful urinary biomarkers of IC, which are desperately needed. Untargeted comprehensive metabolomic profiling was performed using gas-chromatography/mass-spectrometry to compare urine specimens of IC patients or health donors. The study profiled 200 known and 290 unknown metabolites. The majority of the thirty significantly changed metabolites before false discovery rate correction were unknown compounds. Partial least square discriminant analysis clearly separated IC patients from controls. The high number of unknown compounds hinders useful biological interpretation of such predictive models. Given that urine analyses have great potential to be adapted in clinical practice, research has to be focused on the identification of unknown compounds to uncover important clues about underlying disease mechanisms.
Collapse
Affiliation(s)
- Tobias Kind
- West Coast Metabolomics Center, University of California, Davis, Davis, CA, USA
| | - Eunho Cho
- University of California Los Angeles, CA, USA
| | | | - Nan Deng
- Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Zhenqiu Liu
- Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Tack Lee
- Department of Urology, Inha University College of Medicine, Incheon, South Korea
| | - Oliver Fiehn
- West Coast Metabolomics Center, University of California, Davis, Davis, CA, USA.,King Abdulaziz University, Jeddah, Saudi Arabia
| | - Jayoung Kim
- University of California Los Angeles, CA, USA.,Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, CA, USA.,Departments of Surgery and Biomedical Sciences, Cedars-Sinai Medical Center, Los Angeles, CA, USA.,Department of Medicine, University of California Los Angeles, Los Angeles, CA, USA
| |
Collapse
|
4
|
Efficient Regularized Regression with L0 Penalty for Variable Selection and Network Construction. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2016; 2016:3456153. [PMID: 27843486 PMCID: PMC5098106 DOI: 10.1155/2016/3456153] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/06/2016] [Revised: 08/29/2016] [Accepted: 09/20/2016] [Indexed: 12/22/2022]
Abstract
Variable selections for regression with high-dimensional big data have found many applications in bioinformatics and computational biology. One appealing approach is the L0 regularized regression which penalizes the number of nonzero features in the model directly. However, it is well known that L0 optimization is NP-hard and computationally challenging. In this paper, we propose efficient EM (L0EM) and dual L0EM (DL0EM) algorithms that directly approximate the L0 optimization problem. While L0EM is efficient with large sample size, DL0EM is efficient with high-dimensional (n ≪ m) data. They also provide a natural solution to all Lp
p ∈ [0,2] problems, including lasso with p = 1 and elastic net with p ∈ [1,2]. The regularized parameter λ can be determined through cross validation or AIC and BIC. We demonstrate our methods through simulation and high-dimensional genomic data. The results indicate that L0 has better performance than lasso, SCAD, and MC+, and L0 with AIC or BIC has similar performance as computationally intensive cross validation. The proposed algorithms are efficient in identifying the nonzero variables with less bias and constructing biologically important networks with high-dimensional big data.
Collapse
|
5
|
Kruppa J, Kramer F, Beißbarth T, Jung K. A simulation framework for correlated count data of features subsets in high-throughput sequencing or proteomics experiments. Stat Appl Genet Mol Biol 2016; 15:401-414. [PMID: 27655448 DOI: 10.1515/sagmb-2015-0082] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
As part of the data processing of high-throughput-sequencing experiments count data are produced representing the amount of reads that map to specific genomic regions. Count data also arise in mass spectrometric experiments for the detection of protein-protein interactions. For evaluating new computational methods for the analysis of sequencing count data or spectral count data from proteomics experiments artificial count data is thus required. Although, some methods for the generation of artificial sequencing count data have been proposed, all of them simulate single sequencing runs, omitting thus the correlation structure between the individual genomic features, or they are limited to specific structures. We propose to draw correlated data from the multivariate normal distribution and round these continuous data in order to obtain discrete counts. In our approach, the required distribution parameters can either be constructed in different ways or estimated from real count data. Because rounding affects the correlation structure we evaluate the use of shrinkage estimators that have already been used in the context of artificial expression data from DNA microarrays. Our approach turned out to be useful for the simulation of counts for defined subsets of features such as individual pathways or GO categories.
Collapse
|
6
|
Guo W, Liu Z, Ma S. Nonparametric Regularized Regression for Phenotype-Associated Taxa Selection and Network Construction with Metagenomic Count Data. J Comput Biol 2016; 23:877-890. [PMID: 27427793 DOI: 10.1089/cmb.2016.0023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
We use a metagenomic approach and network analysis to investigate the relationships between phenotypes across taxa under different environmental conditions. The network structure of taxa can be affected by the disease-associated environmental conditions. In addition, taxa abundance is differentiated under conditions. Therefore, knowing how the correlation or relative abundance changes with these factors would be of great interest to researchers. We develop a nonparametric regularized regression method to construct taxa association networks under different clinical conditions. We let the coefficients be unknown functions of the environmental variable. The varying coefficients are estimated by using regression splines. The proposed method is regularized with concave penalties, and an efficient group descent algorithm is developed for computation. We also apply the varying coefficient model to estimate taxa abundance to see how it changes across different environmental conditions. Moreover, for conducting inference, we propose a bootstrap method to construct the simultaneous confidence bands for the corresponding coefficients. We use different simulated designs and a real data set to demonstrate that our method can identify the network structures successfully under different environmental conditions. As such, the proposed method has potential applications for researchers to construct differential networks and identify taxa.
Collapse
Affiliation(s)
- Wenchuan Guo
- 1 Department of Statistics, University of California Riverside , Riverside, California
| | - Zhenqiu Liu
- 2 Samuel Oschin Comprehensive Cancer Institute , Cedars-Sinai Medical Center, Los Angeles, California
| | - Shujie Ma
- 1 Department of Statistics, University of California Riverside , Riverside, California
| |
Collapse
|
7
|
Liu Z, Lin S, Deng N, McGovern DP, Piantadosi S. Sparse Inverse Covariance Estimation with L0 Penalty for Network Construction with Omics Data. J Comput Biol 2016; 23:192-202. [PMID: 26828463 DOI: 10.1089/cmb.2015.0102] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Affiliation(s)
- Zhenqiu Liu
- Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, California
| | - Shili Lin
- Department of Statistics, The Ohio State University, Columbus, Ohio
| | - Nan Deng
- Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, California
| | - Dermot P.B. McGovern
- Widjaja Foundation Inflammatory Bowel & Immunobiology Research Institute, Cedars-Sinai Medical Center, Los Angeles, California
| | - Steven Piantadosi
- Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, California
| |
Collapse
|
8
|
Peng X, Li G, Liu Z. Zero-Inflated Beta Regression for Differential Abundance Analysis with Metagenomics Data. J Comput Biol 2015; 23:102-110. [PMID: 26675626 DOI: 10.1089/cmb.2015.0157] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
Metagenomics data have been growing rapidly due to the advances in NGS technologies. One goal of human microbial studies is to detect abundance differences across clinical conditions. Besides small sample size and high dimension, metagenomics data are usually represented as compositions (proportions) with a large number of zeros and skewed distribution. Efficient tools for handling such compositional data need to be developed. We propose a zero-inflated beta regression approach (ZIBSeq) for identifying differentially abundant features between multiple clinical conditions. The proposed method takes the sparse nature of metagenomics data into account and handle the compositional data efficiently. Compared with other available methods, the proposed approach demonstrates better performance with large AUC values for most simulation studies. When applied to a human metagenomics data, it also identifies biologically important taxa reported from previous studies. The software in R is available upon request from the first author.
Collapse
Affiliation(s)
- Xiaoling Peng
- 1 Division of Science and Technology, Beijing Normal University - Hong Kong Baptist University United International College , Zhuhai, China
| | - Gang Li
- 2 Department of Biostatistics, School of Public Health, University of California at Los Angeles , Los Angeles, California
| | - Zhenqiu Liu
- 3 Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center , Los Angeles, California
| |
Collapse
|
9
|
Liu Z, Lin S, Piantadosi S. Network construction and structure detection with metagenomic count data. BioData Min 2015; 8:40. [PMID: 26692900 PMCID: PMC4676895 DOI: 10.1186/s13040-015-0072-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2015] [Accepted: 11/18/2015] [Indexed: 11/16/2022] Open
Abstract
Background The human microbiome plays a critical role in human health. Massive amounts of metagenomic data have been generated with advances in next-generation sequencing technologies that characterize microbial communities via direct isolation and sequencing. How to extract, analyze, and transform these vast amounts of data into useful knowledge is a great challenge to bioinformaticians. Microbial biodiversity research has focused primarily on taxa composition and abundance and less on the co-occurrences among different taxa. However, taxa co-occurrences and their relationships to environmental and clinical conditions are important because network structure may help to understand how microbial taxa function together. Results We propose a systematic robust approach for bacteria network construction and structure detection using metagenomic count data. Pairwise similarity/distance measures between taxa are proposed by adapting distance measures for samples in ecology. We also extend the sparse inverse covariance approach to a sparse inverse of a similarity matrix from count data for network construction. Our approach is efficient for large metagenomic count data with thousands of bacterial taxa. We evaluate our method with real and simulated data. Our method identifies true and biologically significant network structures efficiently. Conclusions Network analysis is crucial for detecting subnetwork structures with metagenomic count data. We developed a software tool in MATLAB for network construction and biologically significant module detection. Software MetaNet can be downloaded from http://biostatistics.csmc.edu/MetaNet/.
Collapse
Affiliation(s)
- Zhenqiu Liu
- Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, 90048 CA USA
| | - Shili Lin
- Department of Statistics, The Ohio State University, Columbus, 43210 OH USA
| | - Steven Piantadosi
- Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, 90048 CA USA
| |
Collapse
|
10
|
Liu Z, Beach JA, Agadjanian H, Jia D, Aspuria PJ, Karlan BY, Orsulic S. Suboptimal cytoreduction in ovarian carcinoma is associated with molecular pathways characteristic of increased stromal activation. Gynecol Oncol 2015; 139:394-400. [PMID: 26348314 DOI: 10.1016/j.ygyno.2015.08.026] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2015] [Revised: 08/26/2015] [Accepted: 08/30/2015] [Indexed: 02/06/2023]
Abstract
OBJECTIVE Suboptimal cytoreductive surgery in advanced epithelial ovarian cancer (EOC) is associated with poor survival but it is unknown if poor outcome is due to the intrinsic biology of unresectable tumors or insufficient surgical effort resulting in residual tumor-sustaining clones. Our objective was to identify the potential molecular pathway(s) and cell type(s) that may be responsible for suboptimal surgical resection. METHODS By comparing gene expression in optimally and suboptimally cytoreduced patients, we identified a gene network associated with suboptimal cytoreduction and explored the biological processes and cell types associated with this gene network. RESULTS We show that primary tumors from suboptimally cytoreduced patients express molecular signatures that are typically present in a distinct molecular subtype of EOC characterized by increased stromal activation and lymphovascular invasion. Similar molecular pathways are present in EOC metastases, suggesting that primary tumors in suboptimally cytoreduced patients are biologically similar to metastatic tumors. We demonstrate that the suboptimal cytoreduction network genes are enriched in reactive tumor stroma cells rather than malignant tumor cells. CONCLUSION Our data suggest that the success of cytoreductive surgery is dictated by tumor biology, such as extensive stromal reaction and increased invasiveness, which may hinder surgical resection and ultimately lead to poor survival.
Collapse
Affiliation(s)
- Zhenqiu Liu
- Biostatistics and Bioinformatics Research Center, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Jessica A Beach
- Graduate Program in Biomedical Science and Translational Medicine, Cedars-Sinai Medical Center, Los Angeles, CA, USA; Women's Cancer Program, Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Hasmik Agadjanian
- Women's Cancer Program, Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Dongyu Jia
- Women's Cancer Program, Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Paul-Joseph Aspuria
- Women's Cancer Program, Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Beth Y Karlan
- Women's Cancer Program, Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, CA, USA; Department of Obstetrics and Gynecology, David Geffen School of Medicine, University of California at Los Angeles, Los Angeles, CA, USA
| | - Sandra Orsulic
- Women's Cancer Program, Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, CA, USA; Department of Obstetrics and Gynecology, David Geffen School of Medicine, University of California at Los Angeles, Los Angeles, CA, USA.
| |
Collapse
|