1
|
Cai J, Hao J, Yang H, Zhao X, Yang Y. A Review on Semi-supervised Clustering. Inf Sci (N Y) 2023. [DOI: 10.1016/j.ins.2023.02.088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/07/2023]
|
2
|
He K, Massena DG. Examining unsupervised ensemble learning using spectroscopy data of organic compounds. J Comput Aided Mol Des 2023; 37:17-37. [PMID: 36404382 DOI: 10.1007/s10822-022-00488-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2022] [Accepted: 11/03/2022] [Indexed: 11/22/2022]
Abstract
One solution to the challenge of choosing an appropriate clustering algorithm is to combine different clusterings into a single consensus clustering result, known as cluster ensemble (CE). This ensemble learning strategy can provide more robust and stable solutions across different domains and datasets. Unfortunately, not all clusterings in the ensemble contribute to the final data partition. Cluster ensemble selection (CES) aims at selecting a subset from a large library of clustering solutions to form a smaller cluster ensemble that performs as well as or better than the set of all available clustering solutions. In this paper, we investigate four CES methods for the categorization of structurally distinct organic compounds using high-dimensional IR and Raman spectroscopy data. Single quality selection (SQI) forms a subset of the ensemble by selecting the highest quality ensemble members. The Single Quality Selection (SQI) method is used with various quality indices to select subsets by including the highest quality ensemble members. The Bagging method, usually applied in supervised learning, ranks ensemble members by calculating the normalized mutual information (NMI) between ensemble members and consensus solutions generated from a randomly sampled subset of the full ensemble. The hierarchical cluster and select method (HCAS-SQI) uses the diversity matrix of ensemble members to select a diverse set of ensemble members with the highest quality. Furthermore, a combining strategy can be used to combine subsets selected using multiple quality indices (HCAS-MQI) for the refinement of clustering solutions in the ensemble. The IR + Raman hybrid ensemble library is created by merging two complementary "views" of the organic compounds. This inherently more diverse library gives the best full ensemble consensus results. Overall, the Bagging method is recommended because it provides the most robust results that are better than or comparable to the full ensemble consensus solutions.
Collapse
Affiliation(s)
- Kedan He
- Department of Physical Sciences, School of Arts and Sciences, Eastern Connecticut State University, Willimantic, CT, 06226, USA.
| | - Djenerly G Massena
- Department of Physical Sciences, School of Arts and Sciences, Eastern Connecticut State University, Willimantic, CT, 06226, USA
| |
Collapse
|
3
|
Yu Z, Wang D, Meng XB, Chen CLP. Clustering Ensemble Based on Hybrid Multiview Clustering. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:6518-6530. [PMID: 33284761 DOI: 10.1109/tcyb.2020.3034157] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
As an effective method for clustering applications, the clustering ensemble algorithm integrates different clustering solutions into a final one, thus improving the clustering efficiency. The key to designing the clustering ensemble algorithm is to improve the diversities of base learners and optimize the ensemble strategies. To address these problems, we propose a clustering ensemble framework that consists of three parts. First, three view transformation methods, including random principal component analysis, random nearest neighbor, and modified fuzzy extension model, are used as base learners to learn different clustering views. A random transformation and hybrid multiview learning-based clustering ensemble method (RTHMC) is then designed to synthesize the multiview clustering results. Second, a new random subspace transformation is integrated into RTHMC to enhance its performance. Finally, a view-based self-evolutionary strategy is developed to further improve the proposed method by optimizing random subspace sets. Experiments and comparisons demonstrate the effectiveness and superiority of the proposed method for clustering different kinds of data.
Collapse
|
4
|
A multi-level consensus function clustering ensemble. Soft comput 2021. [DOI: 10.1007/s00500-021-06092-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
5
|
Wang Z, Parvin H, Qasem SN, Tuan BA, Pho KH. Cluster ensemble selection using balanced normalized mutual information. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2020. [DOI: 10.3233/jifs-191531] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Abstract
A bad partition in an ensemble will be removed by a cluster ensemble selection framework from the final ensemble. It is the main idea in cluster ensemble selection to remove these partitions (bad partitions) from the selected ensemble. But still, it is likely that one of them contains some reliable clusters. Therefore, it may be reasonable to apply the selection phase on cluster level. To do this, a cluster evaluation metric is needed. Some of these metrics have been recently introduced; each of them has its limitations. The weak points of each method have been addressed in the paper. Subsequently, a new metric for cluster assessment has been introduced. The new measure is named Balanced Normalized Mutual Information (BNMI) criterion. It balances the deficiency of the traditional NMI-based criteria. Additionally, an innovative cluster ensemble approach has been proposed. To create the consensus partition considering the elected clusters, a set of different aggregation-functions (called also consensus-functions) have been utilized: the ones which are based upon the co-association matrix (CAM), the ones which are based on hyper graph partitioning algorithms, and the ones which are based upon intermediate space. The experimental study indicates that the state-of-the-art cluster ensemble methods are outperformed by the proposed cluster ensemble approach.
Collapse
Affiliation(s)
- Zecong Wang
- School of Computer Science and Cyberspace Security, Hainan University, China
| | - Hamid Parvin
- Institute of Research and Development, Duy Tan University, Da Nang, Vietnam
- Faculty of Information Technology, Duy Tan University, Da Nang, Vietnam
- Department of Computer Science, Nourabad Mamasani Branch, Islamic Azad University, Mamasani, Iran
| | - Sultan Noman Qasem
- Computer Science Department, College of Computer and Information Sciences, AI Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh, Saudi Arabia
- Computer Science Department, Faculty of Applied Science, Taiz University, Taiz, Yemen
| | - Bui Anh Tuan
- Department of Mathematics Education, Teachers College, Can Tho University, Can Tho City, Vietnam
| | - Kim-Hung Pho
- Fractional Calculus, Optimization and Algebra Research Group, Faculty of Mathematics and Statistics, Ton Duc Thang University, Ho Chi Minh City, Vietnam
| |
Collapse
|
6
|
Abdulla M, Khasawneh MT. G-Forest: An ensemble method for cost-sensitive feature selection in gene expression microarrays. Artif Intell Med 2020; 108:101941. [DOI: 10.1016/j.artmed.2020.101941] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2020] [Revised: 06/27/2020] [Accepted: 08/07/2020] [Indexed: 12/27/2022]
|
7
|
Mahmoudi MR, Akbarzadeh H, Parvin H, Nejatian S, Rezaie V, Alinejad-Rokny H. Consensus function based on cluster-wise two level clustering. Artif Intell Rev 2020. [DOI: 10.1007/s10462-020-09862-1] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
8
|
Li G, Mahmoudi MR, Qasem SN, Tuan BA, Pho KH. Cluster ensemble of valid small clusters. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2020. [DOI: 10.3233/jifs-191530] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Guang Li
- Institute of Data Science, City University of Macau, Macau
| | - Mohammad Reza Mahmoudi
- Institute of Research and Development, Duy Tan University, Da Nang, Vietnam
- Department of Statistics, Faculty of Science, Fasa University, Fasa, Iran
| | - Sultan Noman Qasem
- Department of Computer Science, College of Computer and Information Sciences, AI Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh, Saudi Arabia
- Department of Computer Science, Faculty of Applied Science, Taiz University, Taiz, Yemen
| | - Bui Anh Tuan
- Department of Mathematics Education, Teachers College, Can Tho University, Can Tho City, Vietnam
| | - Kim-Hung Pho
- Fractional Calculus, Optimization and Algebra Research Group, Faculty of Mathematics and Statistics, Ton Duc Thang University, Ho Chi Minh City, Vietnam
| |
Collapse
|
9
|
Shi Y, Yu Z, Chen CLP, You J, Wong HS, Wang Y, Zhang J. Transfer Clustering Ensemble Selection. IEEE TRANSACTIONS ON CYBERNETICS 2020; 50:2872-2885. [PMID: 30596592 DOI: 10.1109/tcyb.2018.2885585] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Clustering ensemble (CE) takes multiple clustering solutions into consideration in order to effectively improve the accuracy and robustness of the final result. To reduce redundancy as well as noise, a CE selection (CES) step is added to further enhance performance. Quality and diversity are two important metrics of CES. However, most of the CES strategies adopt heuristic selection methods or a threshold parameter setting to achieve tradeoff between quality and diversity. In this paper, we propose a transfer CES (TCES) algorithm which makes use of the relationship between quality and diversity in a source dataset, and transfers it into a target dataset based on three objective functions. Furthermore, a multiobjective self-evolutionary process is designed to optimize these three objective functions. Finally, we construct a transfer CE framework (TCE-TCES) based on TCES to obtain better clustering results. The experimental results on 12 transfer clustering tasks obtained from the 20newsgroups dataset show that TCE-TCES can find a better tradeoff between quality and diversity, as well as obtaining more desirable clustering results.
Collapse
|
10
|
Abstract
Clustering ensemble indicates to an approach in which a number of (usually weak) base clusterings are performed and their consensus clustering is used as the final clustering. Knowing democratic decisions are better than dictatorial decisions, it seems clear and simple that ensemble (here, clustering ensemble) decisions are better than simple model (here, clustering) decisions. But it is not guaranteed that every ensemble is better than a simple model. An ensemble is considered to be a better ensemble if their members are valid or high-quality and if they participate according to their qualities in constructing consensus clustering. In this paper, we propose a clustering ensemble framework that uses a simple clustering algorithm based on kmedoids clustering algorithm. Our simple clustering algorithm guarantees that the discovered clusters are valid. From another point, it is also guaranteed that our clustering ensemble framework uses a mechanism to make use of each discovered cluster according to its quality. To do this mechanism an auxiliary ensemble named reference set is created by running several kmeans clustering algorithms.
Collapse
|
11
|
Yu Z, Zhang Y, Chen CLP, You J, Wong HS, Dai D, Wu S, Zhang J. Multiobjective Semisupervised Classifier Ensemble. IEEE TRANSACTIONS ON CYBERNETICS 2019; 49:2280-2293. [PMID: 29993923 DOI: 10.1109/tcyb.2018.2824299] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Classification of high-dimensional data with very limited labels is a challenging task in the field of data mining and machine learning. In this paper, we propose the multiobjective semisupervised classifier ensemble (MOSSCE) approach to address this challenge. Specifically, a multiobjective subspace selection process (MOSSP) in MOSSCE is first designed to generate the optimal combination of feature subspaces. Three objective functions are then proposed for MOSSP, which include the relevance of features, the redundancy between features, and the data reconstruction error. Then, MOSSCE generates an auxiliary training set based on the sample confidence to improve the performance of the classifier ensemble. Finally, the training set, combined with the auxiliary training set, is used to select the optimal combination of basic classifiers in the ensemble, train the classifier ensemble, and generate the final result. In addition, diversity analysis of the ensemble learning process is applied, and a set of nonparametric statistical tests is adopted for the comparison of semisupervised classification approaches on multiple datasets. The experiments on 12 gene expression datasets and two large image datasets show that MOSSCE has a better performance than other state-of-the-art semisupervised classifiers on high-dimensional data.
Collapse
|
12
|
Yu Z, Wang D, Zhao Z, Chen CLP, You J, Wong HS, Zhang J. Hybrid Incremental Ensemble Learning for Noisy Real-World Data Classification. IEEE TRANSACTIONS ON CYBERNETICS 2019; 49:403-416. [PMID: 29990215 DOI: 10.1109/tcyb.2017.2774266] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Traditional ensemble learning approaches explore the feature space and the sample space, respectively, which will prevent them to construct more powerful learning models for noisy real-world dataset classification. The random subspace method only search for the selection of features. Meanwhile, the bagging approach only search for the selection of samples. To overcome these limitations, we propose the hybrid incremental ensemble learning (HIEL) approach which takes into consideration the feature space and the sample space simultaneously to handle noisy dataset. Specifically, HIEL first adopts the bagging technique and linear discriminant analysis to remove noisy attributes, and generates a set of bootstraps and the corresponding ensemble members in the subspaces. Then, the classifiers are selected incrementally based on a classifier-specific criterion function and an ensemble criterion function. The corresponding weights for the classifiers are assigned during the same process. Finally, the final label is summarized by a weighted voting scheme, which serves as the final result of the classification. We also explore various classifier-specific criterion functions based on different newly proposed similarity measures, which will alleviate the effect of noisy samples on the distance functions. In addition, the computational cost of HIEL is analyzed theoretically. A set of nonparametric tests are adopted to compare HIEL and other algorithms over several datasets. The experiment results show that HIEL performs well on the noisy datasets. HIEL outperforms most of the compared classifier ensemble methods on 14 out of 24 noisy real-world UCI and KEEL datasets.
Collapse
|
13
|
Yu Z, Zhang Y, You J, Chen CLP, Wong HS, Han G, Zhang J. Adaptive Semi-Supervised Classifier Ensemble for High Dimensional Data Classification. IEEE TRANSACTIONS ON CYBERNETICS 2019; 49:366-379. [PMID: 29989979 DOI: 10.1109/tcyb.2017.2761908] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
High dimensional data classification with very limited labeled training data is a challenging task in the area of data mining. In order to tackle this task, we first propose a feature selection-based semi-supervised classifier ensemble framework (FSCE) to perform high dimensional data classification. Then, we design an adaptive semi-supervised classifier ensemble framework (ASCE) to improve the performance of FSCE. When compared with FSCE, ASCE is characterized by an adaptive feature selection process, an adaptive weighting process (AWP), and an auxiliary training set generation process (ATSGP). The adaptive feature selection process generates a set of compact subspaces based on the selected attributes obtained by the feature selection algorithms, while the AWP associates each basic semi-supervised classifier in the ensemble with a weight value. The ATSGP enlarges the training set with unlabeled samples. In addition, a set of nonparametric tests are adopted to compare multiple semi-supervised classifier ensemble (SSCE)approaches over different datasets. The experiments on 20 high dimensional real-world datasets show that: 1) the two adaptive processes in ASCE are useful for improving the performance of the SSCE approach and 2) ASCE works well on high dimensional datasets with very limited labeled training data, and outperforms most state-of-the-art SSCE approaches.
Collapse
|
14
|
Laplacian regularized low-rank representation for cancer samples clustering. Comput Biol Chem 2018; 78:504-509. [PMID: 30528509 DOI: 10.1016/j.compbiolchem.2018.11.003] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2018] [Accepted: 11/07/2018] [Indexed: 12/18/2022]
Abstract
Cancer samples clustering based on biomolecular data has been becoming an important tool for cancer classification. The recognition of cancer types is of great importance for cancer treatment. In this paper, in order to improve the accuracy of cancer recognition, we propose to use Laplacian regularized Low-Rank Representation (LLRR) to cluster the cancer samples based on genomic data. In LLRR method, the high-dimensional genomic data are approximately treated as samples extracted from a combination of several low-rank subspaces. The purpose of LLRR method is to seek the lowest-rank representation matrix based on a dictionary. Because a Laplacian regularization based on manifold is introduced into LLRR, compared to the Low-Rank Representation (LRR) method, besides capturing the global geometric structure, LLRR can capture the intrinsic local structure of high-dimensional observation data well. And what is more, in LLRR, the original data themselves are selected as a dictionary, so the lowest-rank representation is actually a similar expression between the samples. Therefore, corresponding to the low-rank representation matrix, the samples with high similarity are considered to come from the same subspace and are grouped into a class. The experiment results on real genomic data illustrate that LLRR method, compared with LRR and MLLRR, is more robust to noise and has a better ability to learn the inherent subspace structure of data, and achieves remarkable performance in the clustering of cancer samples.
Collapse
|
15
|
|
16
|
Zhang T. Optimized Fuzzy Clustering Algorithms for Brain MRI Image Segmentation Based on Local Gaussian Probability and Anisotropic Weight Models. INT J PATTERN RECOGN 2018. [DOI: 10.1142/s0218001418570057] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Brain Magnetic Resonance Imaging (MRI) image segmentation is one of the critical technologies of clinical medicine, and is the basis of three-dimensional reconstruction and downstream analysis between normal tissues and diseased tissues. However, there are various limitations in brain MRI images, such as gray irregularities, noise, and low contrast, reducing the accuracy of the brain MRI images segmentation. In this paper, we propose two optimization solutions for the fuzzy clustering algorithm based on local Gaussian probability fuzzy C-means (LGP-FCM) model and anisotropic weight fuzzy C-means (AW-FCM) model and apply it in brain MRI image segmentation. An FCM clustering algorithm is proposed based on AW-FCM. By introducing the new neighborhood weight calculation method, each point has the weight of anisotropy, effectively overcomes the influence of noise on the image segmentation. In addition, the LGP model is introduced in the objective function of fuzzy clustering, and a fuzzy clustering segmentation algorithm based on LGP-FCM is proposed. A clustering segmentation algorithm of adaptive scale fuzzy LGP model is proposed. The neighborhood scale corresponding to each pixel in the image is automatically estimated, which improves the robustness of the model and achieves the purpose of precise segmentation. Extensive experimental results demonstrate that the proposed LGP-FCM algorithm outperforms comparison algorithms in terms of sensitivity, specificity and accuracy. LGP-FCM can effectively segment the target regions from brain MRI images.
Collapse
Affiliation(s)
- Ting Zhang
- Software College, Fujian University of Technology, Fuzhou, Fujian 350003, P. R. China
| |
Collapse
|
17
|
Yu Z, Lu Y, Zhang J, You J, Wong HS, Wang Y, Han G. Progressive Semisupervised Learning of Multiple Classifiers. IEEE TRANSACTIONS ON CYBERNETICS 2018; 48:689-702. [PMID: 28113355 DOI: 10.1109/tcyb.2017.2651114] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Semisupervised learning methods are often adopted to handle datasets with very small number of labeled samples. However, conventional semisupervised ensemble learning approaches have two limitations: 1) most of them cannot obtain satisfactory results on high dimensional datasets with limited labels and 2) they usually do not consider how to use an optimization process to enlarge the training set. In this paper, we propose the progressive semisupervised ensemble learning approach (PSEMISEL) to address the above limitations and handle datasets with very small number of labeled samples. When compared with traditional semisupervised ensemble learning approaches, PSEMISEL is characterized by two properties: 1) it adopts the random subspace technique to investigate the structure of the dataset in the subspaces and 2) a progressive training set generation process and a self evolutionary sample selection process are proposed to enlarge the training set. We also use a set of nonparametric tests to compare different semisupervised ensemble learning methods over multiple datasets. The experimental results on 18 real-world datasets from the University of California, Irvine machine learning repository show that PSEMISEL works well on most of the real-world datasets, and outperforms other state-of-the-art approaches on 10 out of 18 datasets.
Collapse
|
18
|
Yu Z, Wang Z, You J, Zhang J, Liu J, Wong HS, Han G. A New Kind of Nonparametric Test for Statistical Comparison of Multiple Classifiers Over Multiple Datasets. IEEE TRANSACTIONS ON CYBERNETICS 2017; 47:4418-4431. [PMID: 28113414 DOI: 10.1109/tcyb.2016.2611020] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Nonparametric statistical analysis, such as the Friedman test (FT), is gaining more and more attention due to its useful applications in a lot of experimental studies. However, traditional FT for the comparison of multiple learning algorithms on different datasets adopts the naive ranking approach. The ranking is based on the average accuracy values obtained by the set of learning algorithms on the datasets, which neither considers the differences of the results obtained by the learning algorithms on each dataset nor takes into account the performance of the learning algorithms in each run. In this paper, we will first propose three kinds of ranking approaches, which are the weighted ranking approach, the global ranking approach (GRA), and the weighted GRA. Then, a theoretical analysis is performed to explore the properties of the proposed ranking approaches. Next, a set of the modified FTs based on the proposed ranking approaches are designed for the comparison of the learning algorithms. Finally, the modified FTs are evaluated through six classifier ensemble approaches on 34 real-world datasets. The experiments show the effectiveness of the modified FTs.
Collapse
|
19
|
Yu Z, Zhu X, Wong HS, You J, Zhang J, Han G. Distribution-Based Cluster Structure Selection. IEEE TRANSACTIONS ON CYBERNETICS 2017; 47:3554-3567. [PMID: 27254876 DOI: 10.1109/tcyb.2016.2569529] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
The objective of cluster structure ensemble is to find a unified cluster structure from multiple cluster structures obtained from different datasets. Unfortunately, not all the cluster structures contribute to the unified cluster structure. This paper investigates the problem of how to select the suitable cluster structures in the ensemble which will be summarized to a more representative cluster structure. Specifically, the cluster structure is first represented by a mixture of Gaussian distributions, the parameters of which are estimated using the expectation-maximization algorithm. Then, several distribution-based distance functions are designed to evaluate the similarity between two cluster structures. Based on the similarity comparison results, we propose a new approach, which is referred to as the distribution-based cluster structure ensemble (DCSE) framework, to find the most representative unified cluster structure. We then design a new technique, the distribution-based cluster structure selection strategy (DCSSS), to select a subset of cluster structures. Finally, we propose using a distribution-based normalized hypergraph cut algorithm to generate the final result. In our experiments, a nonparametric test is adopted to evaluate the difference between DCSE and its competitors. We adopt 20 real-world datasets obtained from the University of California, Irvine and knowledge extraction based on evolutionary learning repositories, and a number of cancer gene expression profiles to evaluate the performance of the proposed methods. The experimental results show that: 1) DCSE works well on the real-world datasets and 2) DCSE based on DCSSS can further improve the performance of the algorithm.
Collapse
|
20
|
Lu H, Chen J, Yan K, Jin Q, Xue Y, Gao Z. A hybrid feature selection algorithm for gene expression data classification. Neurocomputing 2017. [DOI: 10.1016/j.neucom.2016.07.080] [Citation(s) in RCA: 177] [Impact Index Per Article: 22.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
|
21
|
Jeong YS, Snasel V, Arabnia HR, Hung JC. Fuzzy neuro theory and technologies for cloud computing; edited by Young-Sik Jeong, Vaclav Snasel, Hamid R. Arabnia and Jason C. Hung. Neurocomputing 2017. [DOI: 10.1016/j.neucom.2017.02.074] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
22
|
|
23
|
Lu H, Yang L, Yan K, Xue Y, Gao Z. A cost-sensitive rotation forest algorithm for gene expression data classification. Neurocomputing 2017. [DOI: 10.1016/j.neucom.2016.09.077] [Citation(s) in RCA: 50] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
24
|
Yu X, Yu G, Wang J. Clustering cancer gene expression data by projective clustering ensemble. PLoS One 2017; 12:e0171429. [PMID: 28234920 PMCID: PMC5325197 DOI: 10.1371/journal.pone.0171429] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2016] [Accepted: 01/20/2017] [Indexed: 11/19/2022] Open
Abstract
Gene expression data analysis has paramount implications for gene treatments, cancer diagnosis and other domains. Clustering is an important and promising tool to analyze gene expression data. Gene expression data is often characterized by a large amount of genes but with limited samples, thus various projective clustering techniques and ensemble techniques have been suggested to combat with these challenges. However, it is rather challenging to synergy these two kinds of techniques together to avoid the curse of dimensionality problem and to boost the performance of gene expression data clustering. In this paper, we employ a projective clustering ensemble (PCE) to integrate the advantages of projective clustering and ensemble clustering, and to avoid the dilemma of combining multiple projective clusterings. Our experimental results on publicly available cancer gene expression data show PCE can improve the quality of clustering gene expression data by at least 4.5% (on average) than other related techniques, including dimensionality reduction based single clustering and ensemble approaches. The empirical study demonstrates that, to further boost the performance of clustering cancer gene expression data, it is necessary and promising to synergy projective clustering with ensemble clustering. PCE can serve as an effective alternative technique for clustering gene expression data.
Collapse
Affiliation(s)
- Xianxue Yu
- College of Computer and Information Science, Southwest University, Beibei, Chongqing, China
| | - Guoxian Yu
- College of Computer and Information Science, Southwest University, Beibei, Chongqing, China
| | - Jun Wang
- College of Computer and Information Science, Southwest University, Beibei, Chongqing, China
| |
Collapse
|
25
|
Applying Cost-Sensitive Extreme Learning Machine and Dissimilarity Integration to Gene Expression Data Classification. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2016; 2016:8056253. [PMID: 27642292 PMCID: PMC5011754 DOI: 10.1155/2016/8056253] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/01/2016] [Revised: 07/15/2016] [Accepted: 07/26/2016] [Indexed: 11/17/2022]
Abstract
Embedding cost-sensitive factors into the classifiers increases the classification stability and reduces the classification costs for classifying high-scale, redundant, and imbalanced datasets, such as the gene expression data. In this study, we extend our previous work, that is, Dissimilar ELM (D-ELM), by introducing misclassification costs into the classifier. We name the proposed algorithm as the cost-sensitive D-ELM (CS-D-ELM). Furthermore, we embed rejection cost into the CS-D-ELM to increase the classification stability of the proposed algorithm. Experimental results show that the rejection cost embedded CS-D-ELM algorithm effectively reduces the average and overall cost of the classification process, while the classification accuracy still remains competitive. The proposed method can be extended to classification problems of other redundant and imbalanced data.
Collapse
|
26
|
Saha S, Alok AK, Ekbal A. Use of Semisupervised Clustering and Feature-Selection Techniques for Identification of Co-expressed Genes. IEEE J Biomed Health Inform 2015. [PMID: 26208367 DOI: 10.1109/jbhi.2015.2451735] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Studying the patterns hidden in gene-expression data helps to understand the functionality of genes. In general, clustering techniques are widely used for the identification of natural partitionings from the gene expression data. In order to put constraints on dimensionality, feature selection is the key issue because not all features are important from clustering point of view. Moreover some limited amount of supervised information can help to fine tune the obtained clustering solution. In this paper, the problem of simultaneous feature selection and semisupervised clustering is formulated as a multiobjective optimization (MOO) task. A modern simulated annealing-based MOO technique namely AMOSA is utilized as the background optimization methodology. Here, features and cluster centers are represented in the form of a string and the assignment of genes to different clusters is done using a point symmetry-based distance. Six optimization criteria based on several internal and external cluster validity indices are utilized. In order to generate the supervised information, a popular clustering technique, Fuzzy C-mean, is utilized. Appropriate subset of features, proper number of clusters and the proper partitioning are determined using the search capability of AMOSA. The effectiveness of this proposed semisupervised clustering technique, Semi-FeaClustMOO, is demonstrated on five publicly available benchmark gene-expression datasets. Comparison results with the existing techniques for gene-expression data clustering again reveal the superiority of the proposed technique. Statistical and biological significance tests have also been carried out.
Collapse
|
27
|
Yu Z, Chen H, You J, Liu J, Wong HS, Han G, Li L. Adaptive Fuzzy Consensus Clustering Framework for Clustering Analysis of Cancer Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:887-901. [PMID: 26357330 DOI: 10.1109/tcbb.2014.2359433] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Performing clustering analysis is one of the important research topics in cancer discovery using gene expression profiles, which is crucial in facilitating the successful diagnosis and treatment of cancer. While there are quite a number of research works which perform tumor clustering, few of them considers how to incorporate fuzzy theory together with an optimization process into a consensus clustering framework to improve the performance of clustering analysis. In this paper, we first propose a random double clustering based cluster ensemble framework (RDCCE) to perform tumor clustering based on gene expression data. Specifically, RDCCE generates a set of representative features using a randomly selected clustering algorithm in the ensemble, and then assigns samples to their corresponding clusters based on the grouping results. In addition, we also introduce the random double clustering based fuzzy cluster ensemble framework (RDCFCE), which is designed to improve the performance of RDCCE by integrating the newly proposed fuzzy extension model into the ensemble framework. RDCFCE adopts the normalized cut algorithm as the consensus function to summarize the fuzzy matrices generated by the fuzzy extension models, partition the consensus matrix, and obtain the final result. Finally, adaptive RDCFCE (A-RDCFCE) is proposed to optimize RDCFCE and improve the performance of RDCFCE further by adopting a self-evolutionary process (SEPP) for the parameter set. Experiments on real cancer gene expression profiles indicate that RDCFCE and A-RDCFCE works well on these data sets, and outperform most of the state-of-the-art tumor clustering algorithms.
Collapse
|
28
|
|