1
|
Huo Q, Yin Y, Liu F, Ma Y, Wang L, Qin G. Cell type identification from single-cell transcriptomes in melanoma. BMC Med Genomics 2021; 14:263. [PMID: 34784909 PMCID: PMC8596920 DOI: 10.1186/s12920-021-01118-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2021] [Accepted: 10/14/2021] [Indexed: 11/11/2022] Open
Abstract
BACKGROUND Single-cell sequencing approaches allow gene expression to be measured at the single-cell level, providing opportunities and challenges to study the aetiology of complex diseases, including cancer. METHODS Based on single-cell gene and lncRNA expression levels, we proposed a computational framework for cell type identification that fully considers cell dropout characteristics. First, we defined the dropout features of the cells and identified the dropout clusters. Second, we constructed a differential co-expression network and identified differential modules. Finally, we identified cell types based on the differential modules. RESULTS The method was applied to single-cell melanoma data, and eight cell types were identified. Enrichment analysis of the candidate cell marker genes for the two key cell types showed that both key cell types were closely related to the physiological activities of the major histocompatibility complex (MHC); one key cell type was associated with mitosis-related activities, and the other with pathways related to ten diseases. CONCLUSIONS Through identification and analysis of key melanoma-related cell types, we explored the molecular mechanism of melanoma, providing insight into melanoma research. Moreover, the candidate cell markers for the two key cell types are potential therapeutic targets for melanoma.
Collapse
Affiliation(s)
- Qiuyan Huo
- School of Computer Science and Technology, Xidian University, Xi’an, 710071 China
| | - Yu Yin
- School of Computer Science and Technology, Xidian University, Xi’an, 710071 China
| | - Fangfang Liu
- School of Computer Science and Technology, Xidian University, Xi’an, 710071 China
| | - Yuying Ma
- School of Computer Science and Technology, Xidian University, Xi’an, 710071 China
| | - Liming Wang
- School of Computer Science and Technology, Xidian University, Xi’an, 710071 China
| | - Guimin Qin
- School of Computer Science and Technology, Xidian University, Xi’an, 710071 China
| |
Collapse
|
2
|
Damgacioglu H, Celik E, Celik N. Intra-Cluster Distance Minimization in DNA Methylation Analysis Using an Advanced Tabu-Based Iterative k-Medoids Clustering Algorithm (T-CLUST). IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1241-1252. [PMID: 30530337 DOI: 10.1109/tcbb.2018.2886006] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Recent advances in DNA methylation profiling have paved the way for understanding the underlying epigenetic mechanisms of various diseases such as cancer. While conventional distance-based clustering algorithms (e.g., hierarchical and k-means clustering) have been heavily used in such profiling owing to their speed in conduct of high-throughput analysis, these methods commonly converge to suboptimal solutions and/or trivial clusters due to their greedy search nature. Hence, methodologies are needed to improve the quality of clusters formed by these algorithms without sacrificing from their speed. In this study, we introduce three related algorithms for a complete high-throughput methylation analysis: a variance-based dimension reduction algorithm to handle high-dimensionality in data, an outlier detection algorithm to identify the outliers of data, and an advanced Tabu-based iterative k-medoids clustering algorithm (T-CLUST) to reduce the impact of initial solutions on the performance of conventional k-medoids algorithm. The performance of the proposed algorithms is demonstrated on nine different real DNA methylation datasets obtained from the Gene Expression Omnibus DataSets database. The accuracy of the cluster identification obtained by our proposed algorithms is higher than those of hierarchical and k-means clustering, as well as the conventional methods. The algorithms are implemented in MATLAB, and available at: http://www.coe.miami.edu/simlab/tclust.html.
Collapse
|
3
|
Yu Z, Chen H, You J, Liu J, Wong HS, Han G, Li L. Adaptive Fuzzy Consensus Clustering Framework for Clustering Analysis of Cancer Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:887-901. [PMID: 26357330 DOI: 10.1109/tcbb.2014.2359433] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Performing clustering analysis is one of the important research topics in cancer discovery using gene expression profiles, which is crucial in facilitating the successful diagnosis and treatment of cancer. While there are quite a number of research works which perform tumor clustering, few of them considers how to incorporate fuzzy theory together with an optimization process into a consensus clustering framework to improve the performance of clustering analysis. In this paper, we first propose a random double clustering based cluster ensemble framework (RDCCE) to perform tumor clustering based on gene expression data. Specifically, RDCCE generates a set of representative features using a randomly selected clustering algorithm in the ensemble, and then assigns samples to their corresponding clusters based on the grouping results. In addition, we also introduce the random double clustering based fuzzy cluster ensemble framework (RDCFCE), which is designed to improve the performance of RDCCE by integrating the newly proposed fuzzy extension model into the ensemble framework. RDCFCE adopts the normalized cut algorithm as the consensus function to summarize the fuzzy matrices generated by the fuzzy extension models, partition the consensus matrix, and obtain the final result. Finally, adaptive RDCFCE (A-RDCFCE) is proposed to optimize RDCFCE and improve the performance of RDCFCE further by adopting a self-evolutionary process (SEPP) for the parameter set. Experiments on real cancer gene expression profiles indicate that RDCFCE and A-RDCFCE works well on these data sets, and outperform most of the state-of-the-art tumor clustering algorithms.
Collapse
|
4
|
Yu Z, Chen H, You J, Wong HS, Liu J, Li L, Han G. Double Selection Based Semi-Supervised Clustering Ensemble for Tumor Clustering from Gene Expression Profiles. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:727-740. [PMID: 26356343 DOI: 10.1109/tcbb.2014.2315996] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Tumor clustering is one of the important techniques for tumor discovery from cancer gene expression profiles, which is useful for the diagnosis and treatment of cancer. While different algorithms have been proposed for tumor clustering, few make use of the expert's knowledge to better the performance of tumor discovery. In this paper, we first view the expert's knowledge as constraints in the process of clustering, and propose a feature selection based semi-supervised cluster ensemble framework (FS-SSCE) for tumor clustering from bio-molecular data. Compared with traditional tumor clustering approaches, the proposed framework FS-SSCE is featured by two properties: (1) The adoption of feature selection techniques to dispel the effect of noisy genes. (2) The employment of the binate constraint based K-means algorithm to take into account the effect of experts' knowledge. Then, a double selection based semi-supervised cluster ensemble framework (DS-SSCE) which not only applies the feature selection technique to perform gene selection on the gene dimension, but also selects an optimal subset of representative clustering solutions in the ensemble and improve the performance of tumor clustering using the normalized cut algorithm. DS-SSCE also introduces a confidence factor into the process of constructing the consensus matrix by considering the prior knowledge of the data set. Finally, we design a modified double selection based semi-supervised cluster ensemble framework (MDS-SSCE) which adopts multiple clustering solution selection strategies and an aggregated solution selection function to choose an optimal subset of clustering solutions. The results in the experiments on cancer gene expression profiles show that (i) FS-SSCE, DS-SSCE and MDS-SSCE are suitable for performing tumor clustering from bio-molecular data. (ii) MDS-SSCE outperforms a number of state-of-the-art tumor clustering approaches on most of the data sets.
Collapse
|
5
|
Yu Z, Chen H, You J, Han G, Li L. Hybrid fuzzy cluster ensemble framework for tumor clustering from biomolecular data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:657-670. [PMID: 24091399 DOI: 10.1109/tcbb.2013.59] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Cancer class discovery using biomolecular data is one of the most important tasks for cancer diagnosis and treatment. Tumor clustering from gene expression data provides a new way to perform cancer class discovery. Most of the existing research works adopt single-clustering algorithms to perform tumor clustering is from biomolecular data that lack robustness, stability, and accuracy. To further improve the performance of tumor clustering from biomolecular data, we introduce the fuzzy theory into the cluster ensemble framework for tumor clustering from biomolecular data, and propose four kinds of hybrid fuzzy cluster ensemble frameworks (HFCEF), named as HFCEF-I, HFCEF-II, HFCEF-III, and HFCEF-IV, respectively, to identify samples that belong to different types of cancers. The difference between HFCEF-I and HFCEF-II is that they adopt different ensemble generator approaches to generate a set of fuzzy matrices in the ensemble. Specifically, HFCEF-I applies the affinity propagation algorithm (AP) to perform clustering on the sample dimension and generates a set of fuzzy matrices in the ensemble based on the fuzzy membership function and base samples selected by AP. HFCEF-II adopts AP to perform clustering on the attribute dimension, generates a set of subspaces, and obtains a set of fuzzy matrices in the ensemble by performing fuzzy c-means on subspaces. Compared with HFCEF-I and HFCEF-II, HFCEF-III and HFCEF-IV consider the characteristics of HFCEF-I and HFCEF-II. HFCEF-III combines HFCEF-I and HFCEF-II in a serial way, while HFCEF-IV integrates HFCEF-I and HFCEF-II in a concurrent way. HFCEFs adopt suitable consensus functions, such as the fuzzy c-means algorithm or the normalized cut algorithm (Ncut), to summarize generated fuzzy matrices, and obtain the final results. The experiments on real data sets from UCI machine learning repository and cancer gene expression profiles illustrate that 1) the proposed hybrid fuzzy cluster ensemble frameworks work well on real data sets, especially biomolecular data, and 2) the proposed approaches are able to provide more robust, stable, and accurate results when compared with the state-of-the-art single clustering algorithms and traditional cluster ensemble approaches.
Collapse
Affiliation(s)
- Zhiwen Yu
- South China University of Technology, Guangzhou and Hong Kong Polytechnic University, Hong Kong
| | | | | | | | | |
Collapse
|
6
|
Zhiwen Y, Le L, Jane Y, Hau-San W, Guoqiang H. SC(3): Triple spectral clustering-based consensus clustering framework for class discovery from cancer gene expression profiles. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:1751-1765. [PMID: 22868680 DOI: 10.1109/tcbb.2012.108] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
In order to perform successful diagnosis and treatment of cancer, discovering, and classifying cancer types correctly is essential. One of the challenging properties of class discovery from cancer data sets is that cancer gene expression profiles not only include a large number of genes, but also contains a lot of noisy genes. In order to reduce the effect of noisy genes in cancer gene expression profiles, we propose two new consensus clustering frameworks, named as triple spectral clustering-based consensus clustering (SC3) and double spectral clustering-based consensus clustering (SC2Ncut) in this paper, for cancer discovery from gene expression profiles. SC3 integrates the spectral clustering (SC) algorithm multiple times into the ensemble framework to process gene expression profiles. Specifically, spectral clustering is applied to perform clustering on the gene dimension and the cancer sample dimension, and also used as the consensus function to partition the consensus matrix constructed from multiple clustering solutions.Compared with SC3, SC2Ncut adopts the normalized cut algorithm, instead of spectral clustering, as the consensus function.Experiments on both synthetic data sets and real cancer gene expression profiles illustrate that the proposed approaches not only achieve good performance on gene expression profiles, but also outperforms most of the existing approaches in the process of class discovery from these profiles.
Collapse
Affiliation(s)
- Yu Zhiwen
- School of Computer Science and Engineering, South China University of Technology, B3 Building, Higher Education Megacenter, Panyu, Guangzhou City, China 510006.
| | | | | | | | | |
Collapse
|
7
|
Mimaroglu S, Aksehirli E. DICLENS: divisive clustering ensemble with automatic cluster number. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 9:408-420. [PMID: 21968960 DOI: 10.1109/tcbb.2011.129] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
Clustering has a long and rich history in a variety of scientific fields. Finding natural groupings of a data set is a hard task as attested by hundreds of clustering algorithms in the literature. Each clustering technique makes some assumptions about the underlying data set. If the assumptions hold, good clusterings can be expected. It is hard, in some cases impossible, to satisfy all the assumptions. Therefore, it is beneficial to apply different clustering methods on the same data set, or the same method with varying input parameters or both. We propose a novel method, DICLENS, which combines a set of clusterings into a final clustering having better overall quality. Our method produces the final clustering automatically and does not take any input parameters, a feature missing in many existing algorithms. Extensive experimental studies on real, artificial, and gene expression data sets demonstrate that DICLENS produces very good quality clusterings in a short amount of time. DICLENS implementation runs on standard personal computers by being scalable, and by consuming very little memory and CPU.
Collapse
|
8
|
Zhu D, Acharya L, Zhang H. A generalized multivariate approach to pattern discovery from replicated and incomplete genome-wide measurements. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:1153-1169. [PMID: 21778521 DOI: 10.1109/tcbb.2010.102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
Estimation of pairwise correlation from incomplete and replicated molecular profiling data is an ubiquitous problem in pattern discovery analysis, such as clustering and networking. However, existing methods solve this problem by ad hoc data imputation, followed by aveGation coefficient type approaches, which might annihilate important patterns present in the molecular profiling data. Moreover, these approaches do not consider and exploit the underlying experimental design information that specifies the replication mechanisms. We develop an Expectation-Maximization (EM) type algorithm to estimate the correlation structure using incomplete and replicated molecular profiling data with a priori known replication mechanism. The approach is sufficiently generalized to be applicable to any known replication mechanism. In case of unknown replication mechanism, it is reduced to the parsimonious model introduced previously. The efficacy of our approach was first evaluated by comprehensively comparing various bivariate and multivariate imputation approaches using simulation studies. Results from real-world data analysis further confirmed the superior performance of the proposed approach to the commonly used approaches, where we assessed the robustness of the method using data sets with up to 30 percent missing values.
Collapse
Affiliation(s)
- Dongxiao Zhu
- Department of Computer Science, University of New Orleans, New Orleans, Children's Hospital, New Orleans, LA, USA.
| | | | | |
Collapse
|