1
|
Wang J, Lu CH, Kong XZ, Dai LY, Yuan S, Zhang X. Multi-view manifold regularized compact low-rank representation for cancer samples clustering on multi-omics data. BMC Bioinformatics 2022; 22:334. [PMID: 35057729 PMCID: PMC8772048 DOI: 10.1186/s12859-021-04220-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2021] [Accepted: 05/27/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The identification of cancer types is of great significance for early diagnosis and clinical treatment of cancer. Clustering cancer samples is an important means to identify cancer types, which has been paid much attention in the field of bioinformatics. The purpose of cancer clustering is to find expression patterns of different cancer types, so that the samples with similar expression patterns can be gathered into the same type. In order to improve the accuracy and reliability of cancer clustering, many clustering methods begin to focus on the integration analysis of cancer multi-omics data. Obviously, the methods based on multi-omics data have more advantages than those using single omics data. However, the high heterogeneity and noise of cancer multi-omics data pose a great challenge to the multi-omics analysis method. RESULTS In this study, in order to extract more complementary information from cancer multi-omics data for cancer clustering, we propose a low-rank subspace clustering method called multi-view manifold regularized compact low-rank representation (MmCLRR). In MmCLRR, each omics data are regarded as a view, and it learns a consistent subspace representation by imposing a consistence constraint on the low-rank affinity matrix of each view to balance the agreement between different views. Moreover, the manifold regularization and concept factorization are introduced into our method. Relying on the concept factorization, the dictionary can be updated in the learning, which greatly improves the subspace learning ability of low-rank representation. We adopt linearized alternating direction method with adaptive penalty to solve the optimization problem of MmCLRR method. CONCLUSIONS Finally, we apply MmCLRR into the clustering of cancer samples based on multi-omics data, and the clustering results show that our method outperforms the existing multi-view methods.
Collapse
Affiliation(s)
- Juan Wang
- School of Computer Science, Qufu Normal University, Rizhao, 276826 China
| | - Cong-Hai Lu
- School of Computer Science, Qufu Normal University, Rizhao, 276826 China
| | - Xiang-Zhen Kong
- School of Computer Science, Qufu Normal University, Rizhao, 276826 China
| | - Ling-Yun Dai
- School of Computer Science, Qufu Normal University, Rizhao, 276826 China
| | - Shasha Yuan
- School of Computer Science, Qufu Normal University, Rizhao, 276826 China
| | - Xiaofeng Zhang
- School of Information and Electrical Engineering, Ludong University, Yantai, 264025 China
| |
Collapse
|
2
|
Liu Q. A truncated nuclear norm and graph-Laplacian regularized low-rank representation method for tumor clustering and gene selection. BMC Bioinformatics 2022; 22:436. [PMID: 35057728 PMCID: PMC8772046 DOI: 10.1186/s12859-021-04333-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2021] [Accepted: 08/23/2021] [Indexed: 12/24/2022] Open
Abstract
Background Clustering and feature selection act major roles in many communities. As a matrix factorization, Low-Rank Representation (LRR) has attracted lots of attentions in clustering and feature selection, but sometimes its performance is frustrated when the data samples are insufficient or contain a lot of noise. Results To address this drawback, a novel LRR model named TGLRR is proposed by integrating the truncated nuclear norm with graph-Laplacian. Different from the nuclear norm minimizing all singular values, the truncated nuclear norm only minimizes some smallest singular values, which can dispel the harm of shrinkage of the leading singular values. Finally, an efficient algorithm based on Linearized Alternating Direction with Adaptive Penalty is applied to resolving the optimization problem. Conclusions The results show that the TGLRR method exceeds the existing state-of-the-art methods in aspect of tumor clustering and gene selection on integrated gene expression data.
Collapse
|
3
|
Yu N, Wu MJ, Liu JX, Zheng CH, Xu Y. Correntropy-Based Hypergraph Regularized NMF for Clustering and Feature Selection on Multi-Cancer Integrated Data. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:3952-3963. [PMID: 32603306 DOI: 10.1109/tcyb.2020.3000799] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Non-negative matrix factorization (NMF) has become one of the most powerful methods for clustering and feature selection. However, the performance of the traditional NMF method severely degrades when the data contain noises and outliers or the manifold structure of the data is not taken into account. In this article, a novel method called correntropy-based hypergraph regularized NMF (CHNMF) is proposed to solve the above problem. Specifically, we use the correntropy instead of the Euclidean norm in the loss term of CHNMF, which will improve the robustness of the algorithm. And the hypergraph regularization term is also applied to the objective function, which can explore the high-order geometric information in more sample points. Then, the half-quadratic (HQ) optimization technique is adopted to solve the complex optimization problem of CHNMF. Finally, extensive experimental results on multi-cancer integrated data indicate that the proposed CHNMF method is superior to other state-of-the-art methods for clustering and feature selection.
Collapse
|
4
|
Zhao YY, Jiao CN, Wang ML, Liu JX, Wang J, Zheng CH. HTRPCA: Hypergraph Regularized Tensor Robust Principal Component Analysis for Sample Clustering in Tumor Omics Data. Interdiscip Sci 2021; 14:22-33. [PMID: 34115312 DOI: 10.1007/s12539-021-00441-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2021] [Revised: 04/27/2021] [Accepted: 05/15/2021] [Indexed: 12/24/2022]
Abstract
In recent years, clustering analysis of cancer genomics data has gained widespread attention. However, limited by the dimensions of the matrix, the traditional methods cannot fully mine the underlying geometric structure information in the data. Besides, noise and outliers inevitably exist in the data. To solve the above two problems, we come up with a new method which uses tensor to represent cancer omics data and applies hypergraph to save the geometric structure information in original data. This model is called hypergraph regularized tensor robust principal component analysis (HTRPCA). The data processed by HTRPCA becomes two parts, one of which is a low-rank component that contains pure underlying structure information between samples, and the other is some sparse interference points. So we can use the low-rank component for clustering. This model can retain complex geometric information between more sample points due to the addition of the hypergraph regularization. Through clustering, we can demonstrate the effectiveness of HTRPCA, and the experimental results on TCGA datasets demonstrate that HTRPCA precedes other advanced methods. This paper proposes a new method of using tensors to represent cancer omics data and introduces hypergraph items to save the geometric structure information of the original data. At the same time, the model decomposes the original tensor into low-order tensors and sparse tensors. The low-rank tensor was used to cluster cancer samples to verify the effectiveness of the method.
Collapse
Affiliation(s)
- Yu-Ying Zhao
- School of Computer Science, Qufu Normal University, Rizhao, China
| | - Cui-Na Jiao
- School of Computer Science, Qufu Normal University, Rizhao, China
| | - Mao-Li Wang
- School of Computer Science, Qufu Normal University, Rizhao, China
| | - Jin-Xing Liu
- School of Computer Science, Qufu Normal University, Rizhao, China. .,Rizhao Huilian Zhongchuang Institute of Intelligent Technology, Rizhao, 276826, China.
| | - Juan Wang
- School of Computer Science, Qufu Normal University, Rizhao, China
| | - Chun-Hou Zheng
- School of Computer Science, Qufu Normal University, Rizhao, China
| |
Collapse
|
5
|
Yu N, Gao YL, Liu JX, Wang J, Shang J. Robust hypergraph regularized non-negative matrix factorization for sample clustering and feature selection in multi-view gene expression data. Hum Genomics 2019; 13:46. [PMID: 31639067 PMCID: PMC6805321 DOI: 10.1186/s40246-019-0222-6] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
Background As one of the most popular data representation methods, non-negative matrix decomposition (NMF) has been widely concerned in the tasks of clustering and feature selection. However, most of the previously proposed NMF-based methods do not adequately explore the hidden geometrical structure in the data. At the same time, noise and outliers are inevitably present in the data. Results To alleviate these problems, we present a novel NMF framework named robust hypergraph regularized non-negative matrix factorization (RHNMF). In particular, the hypergraph Laplacian regularization is imposed to capture the geometric information of original data. Unlike graph Laplacian regularization which captures the relationship between pairwise sample points, it captures the high-order relationship among more sample points. Moreover, the robustness of the RHNMF is enhanced by using the L2,1-norm constraint when estimating the residual. This is because the L2,1-norm is insensitive to noise and outliers. Conclusions Clustering and common abnormal expression gene (com-abnormal expression gene) selection are conducted to test the validity of the RHNMF model. Extensive experimental results on multi-view datasets reveal that our proposed model outperforms other state-of-the-art methods.
Collapse
Affiliation(s)
- Na Yu
- School of Information Science and Engineering, Qufu Normal University, Rizhao, 276826, China
| | - Ying-Lian Gao
- Library of Qufu Normal University, Qufu Normal University, Rizhao, 276826, China
| | - Jin-Xing Liu
- School of Information Science and Engineering, Qufu Normal University, Rizhao, 276826, China.
| | - Juan Wang
- School of Information Science and Engineering, Qufu Normal University, Rizhao, 276826, China.
| | - Junliang Shang
- School of Information Science and Engineering, Qufu Normal University, Rizhao, 276826, China
| |
Collapse
|
6
|
A Random Walk Based Cluster Ensemble Approach for Data Integration and Cancer Subtyping. Genes (Basel) 2019; 10:genes10010066. [PMID: 30669418 PMCID: PMC6356971 DOI: 10.3390/genes10010066] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2018] [Revised: 01/11/2019] [Accepted: 01/14/2019] [Indexed: 12/17/2022] Open
Abstract
Availability of diverse types of high-throughput data increases the opportunities for researchers to develop computational methods to provide a more comprehensive view for the mechanism and therapy of cancer. One fundamental goal for oncology is to divide patients into subtypes with clinical and biological significance. Cluster ensemble fits this task exactly. It can improve the performance and robustness of clustering results by combining multiple basic clustering results. However, many existing cluster ensemble methods use a co-association matrix to summarize the co-occurrence statistics of the instance-cluster, where the relationship in the integration is only encapsulated at a rough level. Moreover, the relationship among clusters is completely ignored. Finding these missing associations could greatly expand the ability of cluster ensemble methods for cancer subtyping. In this paper, we propose the RWCE (Random Walk based Cluster Ensemble) to consider similarity among clusters. We first obtained a refined similarity between clusters by using random walk and a scaled exponential similarity kernel. Then, after being modeled as a bipartite graph, a more informative instance-cluster association matrix filled with the aforementioned cluster similarity was fed into a spectral clustering algorithm to get the final clustering result. We applied our method on six cancer types from The Cancer Genome Atlas (TCGA) and breast cancer from the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC). Experimental results show that our method is competitive against existing methods. Further case study demonstrates that our method has the potential to find subtypes with clinical and biological significance.
Collapse
|
7
|
Li H, Li SJ, Shang J, Liu JX, Zheng CH. A Dynamic Scale-Free Network Particle Swarm Optimization for Extracting Features on Multi-Omics Data. J Comput Biol 2018; 26:769-781. [PMID: 30495971 DOI: 10.1089/cmb.2018.0185] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
Mining meaningful and comprehensive molecular characterization of cancers from The Cancer Genome Atlas (TCGA) data has become a bioinformatics bottleneck. Meanwhile, recent progress in cancer analysis shows that multi-omics data can effectively and systematically detect the cancer-related genes at all levels. In this study, we propose an improved particle swarm optimization with dynamic scale-free network, named DSFPSO, to extract features on multi-omics data. The highlights of DSFPSO are taking the dynamic scale-free network as its population structure and diverse velocity updating strategies for fully considering the heterogeneity of particles and their neighbors. Experiments of DSFPSO and its comparison with several state-of-the-art feature extraction approaches are performed on two public data sets from TCGA. Results show that DSFPSO can extract genes associated with cancers effectively.
Collapse
Affiliation(s)
- Huiyu Li
- 1School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Sheng-Jun Li
- 1School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Junliang Shang
- 1School of Information Science and Engineering, Qufu Normal University, Rizhao, China.,2School of Statistics, Qufu Normal University, Qufu, China
| | - Jin-Xing Liu
- 1School of Information Science and Engineering, Qufu Normal University, Rizhao, China
| | - Chun-Hou Zheng
- 3School of Computer Science and Technology, Anhui University, Hefei, China
| |
Collapse
|
8
|
Yu N, Gao YL, Liu JX, Shang J, Zhu R, Dai LY. Co-differential Gene Selection and Clustering Based on Graph Regularized Multi-View NMF in Cancer Genomic Data. Genes (Basel) 2018; 9:E586. [PMID: 30487464 PMCID: PMC6315625 DOI: 10.3390/genes9120586] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2018] [Revised: 11/13/2018] [Accepted: 11/26/2018] [Indexed: 12/19/2022] Open
Abstract
Cancer genomic data contain views from different sources that provide complementary information about genetic activity. This provides a new way for cancer research. Feature selection and multi-view clustering are hot topics in bioinformatics, and they can make full use of complementary information to improve the effect. In this paper, a novel integrated model called Multi-view Non-negative Matrix Factorization (MvNMF) is proposed for the selection of common differential genes (co-differential genes) and multi-view clustering. In order to encode the geometric information in the multi-view genomic data, graph regularized MvNMF (GMvNMF) is further proposed by applying the graph regularization constraint in the objective function. GMvNMF can not only obtain the potential shared feature structure and shared cluster group structure, but also capture the manifold structure of multi-view data. The validity of the proposed GMvNMF method was tested in four multi-view genomic data. Experimental results showed that the GMvNMF method has better performance than other representative methods.
Collapse
Affiliation(s)
- Na Yu
- School of Information Science and Engineering, Qufu Normal University, Rizhao 276826, China.
| | - Ying-Lian Gao
- Library of Qufu Normal University, Qufu Normal University, Rizhao 276826, China.
| | - Jin-Xing Liu
- School of Information Science and Engineering, Qufu Normal University, Rizhao 276826, China.
| | - Junliang Shang
- School of Information Science and Engineering, Qufu Normal University, Rizhao 276826, China.
| | - Rong Zhu
- School of Information Science and Engineering, Qufu Normal University, Rizhao 276826, China.
| | - Ling-Yun Dai
- School of Information Science and Engineering, Qufu Normal University, Rizhao 276826, China.
| |
Collapse
|
9
|
Yang C, Ge SG, Zheng CH. ndmaSNF: cancer subtype discovery based on integrative framework assisted by network diffusion model. Oncotarget 2017; 8:89021-89032. [PMID: 29179495 PMCID: PMC5687665 DOI: 10.18632/oncotarget.21643] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2017] [Accepted: 08/27/2017] [Indexed: 12/25/2022] Open
Abstract
Recently, with the rapid progress of high-throughput sequencing technology, diverse genomic data are easy to be obtained. To effectively exploit the value of those data, integrative methods are urgently needed. In this paper, based on SNF (Similarity Network Diffusion) [1], we proposed a new integrative method named ndmaSNF (network diffusion model assisted SNF), which can be used for cancer subtype discovery with the advantage of making use of somatic mutation data and other discrete data. Firstly, we incorporate network diffusion model on mutation data to make it smoothed and adaptive. Then, the mutation data along with other data types are utilized in the SNF framework by constructing patient-by-patient similarity networks for each data type. Finally, a fused patient network containing all the information from different input data types is obtained by using a nonlinear iterative method. The fused network can be used for cancer subtype discovery through the clustering algorithm. Experimental results on four cancer datasets showed that our ndmaSNF method can find subtypes with significant differences in the survival profile and other clinical features.
Collapse
Affiliation(s)
- Chao Yang
- College of Computer Science and Technology, Anhui University, Hefei, Anhui 230601, China
| | - Shu-Guang Ge
- College of Electrical Engineering and Automation, Anhui University, Hefei, Anhui 230601, China
| | - Chun-Hou Zheng
- College of Computer Science and Technology, Anhui University, Hefei, Anhui 230601, China
| |
Collapse
|
10
|
Wang YX, Gao YL, Liu JX, Kong XZ, Li HJ. Robust Principal Component Analysis Regularized by Truncated Nuclear Norm for Identifying Differentially Expressed Genes. IEEE Trans Nanobioscience 2017; 16:447-454. [PMID: 28692983 DOI: 10.1109/tnb.2017.2723439] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Identifying differentially expressed genes from the thousands of genes is a challenging task. Robust principal component analysis (RPCA) is an efficient method in the identification of differentially expressed genes. RPCA method uses nuclear norm to approximate the rank function. However, theoretical studies showed that the nuclear norm minimizes all singular values, so it may not be the best solution to approximate the rank function. The truncated nuclear norm is defined as the sum of some smaller singular values, which may achieve a better approximation of the rank function than nuclear norm. In this paper, a novel method is proposed by replacing nuclear norm of RPCA with the truncated nuclear norm, which is named robust principal component analysis regularized by truncated nuclear norm (TRPCA). The method decomposes the observation matrix of genomic data into a low-rank matrix and a sparse matrix. Because the significant genes can be considered as sparse signals, the differentially expressed genes are viewed as the sparse perturbation signals. Thus, the differentially expressed genes can be identified according to the sparse matrix. The experimental results on The Cancer Genome Atlas data illustrate that the TRPCA method outperforms other state-of-the-art methods in the identification of differentially expressed genes.
Collapse
|
11
|
Joint L1/2-Norm Constraint and Graph-Laplacian PCA Method for Feature Extraction. BIOMED RESEARCH INTERNATIONAL 2017; 2017:5073427. [PMID: 28470011 PMCID: PMC5392409 DOI: 10.1155/2017/5073427] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/30/2016] [Revised: 02/12/2017] [Accepted: 03/01/2017] [Indexed: 01/05/2023]
Abstract
Principal Component Analysis (PCA) as a tool for dimensionality reduction is widely used in many areas. In the area of bioinformatics, each involved variable corresponds to a specific gene. In order to improve the robustness of PCA-based method, this paper proposes a novel graph-Laplacian PCA algorithm by adopting L1/2 constraint (L1/2 gLPCA) on error function for feature (gene) extraction. The error function based on L1/2-norm helps to reduce the influence of outliers and noise. Augmented Lagrange Multipliers (ALM) method is applied to solve the subproblem. This method gets better results in feature extraction than other state-of-the-art PCA-based methods. Extensive experimental results on simulation data and gene expression data sets demonstrate that our method can get higher identification accuracies than others.
Collapse
|
12
|
Feng CM, Gao YL, Liu JX, Zheng CH, Yu J. PCA Based on Graph Laplacian Regularization and P-Norm for Gene Selection and Clustering. IEEE Trans Nanobioscience 2017; 16:257-265. [PMID: 28371780 DOI: 10.1109/tnb.2017.2690365] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
In modern molecular biology, the hotspots and difficulties of this field are identifying characteristic genes from gene expression data. Traditional reconstruction-error-minimization model principal component analysis (PCA) as a matrix decomposition method uses quadratic error function, which is known sensitive to outliers and noise. Hence, it is necessary to learn a good PCA method when outliers and noise exist. In this paper, we develop a novel PCA method enforcing P-norm on error function and graph-Laplacian regularization term for matrix decomposition problem, which is called as PgLPCA. The heart of the method designing for reducing outliers and noise is a new error function based on non-convex proximal P-norm. Besides, Laplacian regularization term is used to find the internal geometric structure in the data representation. To solve the minimization problem, we develop an efficient optimization algorithm based on the augmented Lagrange multiplier method. This method is used to select characteristic genes and cluster the samples from explosive biological data, which has higher accuracy than compared methods.
Collapse
|