1
|
Lee I, Deng S, Ning Y. Optimal variable clustering for high-dimensional matrix valued data. INFORMATION AND INFERENCE : A JOURNAL OF THE IMA 2025; 14:iaaf001. [PMID: 40084241 PMCID: PMC11899537 DOI: 10.1093/imaiai/iaaf001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Revised: 10/07/2024] [Accepted: 01/24/2025] [Indexed: 03/16/2025]
Abstract
Matrix valued data has become increasingly prevalent in many applications. Most of the existing clustering methods for this type of data are tailored to the mean model and do not account for the dependence structure of the features, which can be very informative, especially in high-dimensional settings or when mean information is not available. To extract the information from the dependence structure for clustering, we propose a new latent variable model for the features arranged in matrix form, with some unknown membership matrices representing the clusters for the rows and columns. Under this model, we further propose a class of hierarchical clustering algorithms using the difference of a weighted covariance matrix as the dissimilarity measure. Theoretically, we show that under mild conditions, our algorithm attains clustering consistency in the high-dimensional setting. While this consistency result holds for our algorithm with a broad class of weighted covariance matrices, the conditions for this result depend on the choice of the weight. To investigate how the weight affects the theoretical performance of our algorithm, we establish the minimax lower bound for clustering under our latent variable model in terms of some cluster separation metric. Given these results, we identify the optimal weight in the sense that using this weight guarantees our algorithm to be minimax rate-optimal. The practical implementation of our algorithm with the optimal weight is also discussed. Simulation studies show that our algorithm performs better than existing methods in terms of the adjusted Rand index (ARI). The method is applied to a genomic dataset and yields meaningful interpretations.
Collapse
Affiliation(s)
- Inbeom Lee
- Booth School of Business, University of Chicago, 5807 S. Woodlawn Ave., Chicago, IL 60637, USA
| | - Siyi Deng
- Amazon, 425 106th Ave NE, Bellevue, WA 98004, USA
| | - Yang Ning
- Department of Statistics and Data Science, Cornell University, 1198 Comstock Hall, 129 Garden Ave., Ithaca, NY 14853, USA
| |
Collapse
|
2
|
Qiao X, Zhang X, Chen W, Xu X, Chen YW, Liu ZP. tensorGSEA: Detecting Differential Pathways in Type 2 Diabetes via Tensor-Based Data Reconstruction. Interdiscip Sci 2022; 14:520-531. [PMID: 35195883 DOI: 10.1007/s12539-022-00506-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2021] [Revised: 01/24/2022] [Accepted: 02/07/2022] [Indexed: 06/14/2023]
Abstract
Detecting significant signaling pathways in disease progression highlights the dysfunctions and pathogenic mechanisms of complex disease development. Since tensor decomposition has been proven effective for multi-dimensional data representation and reconstruction, differences between original and tensor-processed data are expected to extract crucial information and differential indication. This paper provides a tensor-based gene set enrichment analysis, called tensorGSEA, based on a data reconstruction method to identify relevant significant pathways during disease development. As a proof-of-concept study, we identify the differential pathways of diabetes in rats. Specifically, we first arrange gene expression profiles of each documented pathway as tensors with three dimensions: genes, samples, and periods. Then we compress tensors into core tensors with lower ranks. The pathways with lower reconstruction rates are obtained after reconstructing gene expression profiles in another state via these cores. Thus, differences underlying pathways are extracted by cross-state data reconstruction between controls and diseases. The experiments reveal several critical pathways with diabetes-specific functions which otherwise cannot be identified by alternative methods. Our proposed tensorGSEA is efficient in evaluating pathways by achieving their empirical statistical significance, respectively. The classification experiments demonstrate that the selected pathways can be implemented as biomarkers to identify the diabetic state. The code of tensorGSEA is available at https://github.com/zhxr37/tensorGSEA .
Collapse
Affiliation(s)
- Xu Qiao
- Department of Biomedical Engineering, School of Control Science and Engineering, Shandong University, Jinan, 250061, Shandong, China
| | - Xianru Zhang
- Department of Biomedical Engineering, School of Control Science and Engineering, Shandong University, Jinan, 250061, Shandong, China
| | - Wei Chen
- Shandong Provincial Key Laboratory of Oral Tissue Regeneration, School of Stomatology, Cheeloo College of Medicine, Shandong University, Jinan, 250012, Shandong, China
| | - Xin Xu
- Shandong Provincial Key Laboratory of Oral Tissue Regeneration, School of Stomatology, Cheeloo College of Medicine, Shandong University, Jinan, 250012, Shandong, China
| | - Yen-Wei Chen
- Graduate School of Information Science and Engineering, Ritsumeikan University, Shiga, 525-8577, Japan
| | - Zhi-Ping Liu
- Department of Biomedical Engineering, School of Control Science and Engineering, Shandong University, Jinan, 250061, Shandong, China.
| |
Collapse
|
3
|
Qian S, Liu H, Yuan X, Wei W, Chen S, Yan H. Row and Column Structure-Based Biclustering for Gene Expression Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1117-1129. [PMID: 32894722 DOI: 10.1109/tcbb.2020.3022085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Due to the development of high-throughput technologies for gene analysis, the biclustering method has attracted much attention. However, existing methods have problems with high time and space complexity. This paper proposes a biclustering method, called Row and Column Structure-based Biclustering (RCSBC), with low time and space complexity to find checkerboard patterns within microarray data. First, the paper describes the structure of bicluster by using the structure of rows and columns. Second, the paper chooses the representative rows and columns with two algorithms. Finally, the gene expression data are biclustered on the space spanned by representative rows and columns. To the best of our knowledge, this paper is the first to exploit the relationship between the row/column structure of a gene expression matrix and the structure of biclusters. Both the synthetic datasets and the real-life gene expression datasets are used to validate the effectiveness of our method. It can be seen from the experiment results that the RCSBC outperforms the state-of-the-art algorithms both on clustering accuracy and time/space complexity. This study offers new insights into biclustering the large-scale gene expression data without loading the whole data into memory.
Collapse
|
4
|
Abstract
Biclustering is an important exploratory analysis tool that simultaneously clusters rows (e.g., samples) and columns (e.g., variables) of a data matrix. Checkerboard-like biclusters reveal intrinsic associations between rows and columns. However, most existing methods rely on Gaussian assumptions and only apply to matrix data. In practice, non-Gaussian and/or multi-way tensor data are frequently encountered. A new CO-clustering method via Regularized Alternating Least Squares (CORALS) is proposed, which generalizes biclustering to non-Gaussian data and multi-way tensor arrays. Non-Gaussian data are modeled with single-parameter exponential family distributions and co-clusters are identified in the natural parameter space via sparse CANDECOMP/PARAFAC tensor decomposition. A regularized alternating (iteratively reweighted) least squares algorithm is devised for model fitting and a deflation procedure is exploited to automatically determine the number of co-clusters. Comprehensive simulation studies and three real data examples demonstrate the efficacy of the proposed method. The data and code are publicly available.
Collapse
Affiliation(s)
- Gen Li
- Department of Biostatistics, Columbia University. New York, NY 10032
| |
Collapse
|
5
|
Chi EC, Gaines BR, Sun WW, Zhou H, Yang J. Provable Convex Co-clustering of Tensors. JOURNAL OF MACHINE LEARNING RESEARCH : JMLR 2020; 21:214. [PMID: 33312074 PMCID: PMC7731944] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Cluster analysis is a fundamental tool for pattern discovery of complex heterogeneous data. Prevalent clustering methods mainly focus on vector or matrix-variate data and are not applicable to general-order tensors, which arise frequently in modern scientific and business applications. Moreover, there is a gap between statistical guarantees and computational efficiency for existing tensor clustering solutions due to the nature of their non-convex formulations. In this work, we bridge this gap by developing a provable convex formulation of tensor co-clustering. Our convex co-clustering (CoCo) estimator enjoys stability guarantees and its computational and storage costs are polynomial in the size of the data. We further establish a non-asymptotic error bound for the CoCo estimator, which reveals a surprising "blessing of dimensionality" phenomenon that does not exist in vector or matrix-variate cluster analysis. Our theoretical findings are supported by extensive simulated studies. Finally, we apply the CoCo estimator to the cluster analysis of advertisement click tensor data from a major online company. Our clustering results provide meaningful business insights to improve advertising effectiveness.
Collapse
Affiliation(s)
- Eric C Chi
- Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA
| | - Brian R Gaines
- Advanced Analytics R&D, SAS Institute Inc., Cary, NC 27513, USA
| | - Will Wei Sun
- Krannert School of Management, Purdue University, West Lafayette, IN 47907, USA
| | - Hua Zhou
- Department of Biostatistics, University of California, Los Angeles, CA 90095, USA
| | - Jian Yang
- Advertising Sciences, Yahoo Research, Sunnyvale, CA 94089, USA
| |
Collapse
|