1
|
Rudolph JE, Lau B, Genberg BL, Sun J, Kirk GD, Mehta SH. Characterizing multimorbidity in ALIVE: Comparing single and ensemble clustering methods. Am J Epidemiol 2024:kwae031. [PMID: 38576181 DOI: 10.1093/aje/kwae031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2022] [Revised: 02/06/2024] [Accepted: 03/29/2024] [Indexed: 04/06/2024] Open
Abstract
Multimorbidity, defined as having 2 or more chronic conditions, is a growing public health concern, but research in this area is complicated by the fact that multimorbidity is a highly heterogenous outcome. Individuals in a sample may have a differing number and varied combinations of conditions. Clustering methods, such as unsupervised machine learning algorithms, may allow us to tease out the unique multimorbidity phenotypes. However, many clustering methods exist and choosing which to use is challenging because we do not know the true underlying clusters. Here, we demonstrate the use of 3 individual algorithms (partition around medoids, hierarchical clustering, and probabilistic clustering) and a clustering ensemble approach (which pools different clustering approaches) to identify multimorbidity clusters in the AIDS Linked to the Intravenous Experience cohort study. We show how the clusters can be compared based on cluster quality, interpretability, and predictive ability. In practice, it is critical to compare the clustering results from multiple algorithms and to choose the approach that performs best in the domain(s) that aligns with plans to use the clusters in future analyses.
Collapse
Affiliation(s)
- Jacqueline E Rudolph
- Department of Epidemiology, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD, USA
| | - Bryan Lau
- Department of Epidemiology, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD, USA
| | - Becky L Genberg
- Department of Epidemiology, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD, USA
| | - Jing Sun
- Department of Epidemiology, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD, USA
| | - Gregory D Kirk
- Department of Epidemiology, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD, USA
- Division of Infectious Diseases, Johns Hopkins School of Medicine, Baltimore, MD, USA
| | - Shruti H Mehta
- Department of Epidemiology, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD, USA
| |
Collapse
|
2
|
Wu H, Zhou H, Zhou B, Wang M. SCMcluster: a high-precision cell clustering algorithm integrating marker gene set with single-cell RNA sequencing data. Brief Funct Genomics 2023:7058188. [PMID: 36848584 DOI: 10.1093/bfgp/elad004] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2022] [Revised: 01/12/2023] [Accepted: 01/23/2023] [Indexed: 03/01/2023] Open
Abstract
Single-cell clustering is the most significant part of single-cell RNA sequencing (scRNA-seq) data analysis. One main issue facing the scRNA-seq data is noise and sparsity, which poses a great challenge for the advance of high-precision clustering algorithms. This study adopts cellular markers to identify differences between cells, which contributes to feature extraction of single cells. In this work, we propose a high-precision single-cell clustering algorithm-SCMcluster (single-cell cluster using marker genes). This algorithm integrates two cell marker databases(CellMarker database and PanglaoDB database) with scRNA-seq data for feature extraction and constructs an ensemble clustering model based on the consensus matrix. We test the efficiency of this algorithm and compare it with other eight popular clustering algorithms on two scRNA-seq datasets derived from human and mouse tissues, respectively. The experimental results show that SCMcluster outperforms the existing methods in both feature extraction and clustering performance. The source code of SCMcluster is available for free at https://github.com/HaoWuLab-Bioinformatics/SCMcluster.
Collapse
Affiliation(s)
- Hao Wu
- College of Information Engineering, Northwest A&F University, Yangling, 712100, Shaanxi, China.,School of Software, Shandong University, Jinan, 250101, Shandong, China
| | - Haoru Zhou
- College of Information Engineering, Northwest A&F University, Yangling, 712100, Shaanxi, China
| | - Bing Zhou
- College of Information Engineering, Northwest A&F University, Yangling, 712100, Shaanxi, China
| | - Meili Wang
- College of Information Engineering, Northwest A&F University, Yangling, 712100, Shaanxi, China
| |
Collapse
|
3
|
Li H, Wang Y, Lai Y, Zeng F, Yang F. ProgClust: A progressive clustering method to identify cell populations. Front Genet 2023; 14:1183099. [PMID: 37091787 PMCID: PMC10115987 DOI: 10.3389/fgene.2023.1183099] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2023] [Accepted: 03/24/2023] [Indexed: 04/25/2023] Open
Abstract
Identifying different types of cells in scRNA-seq data is a critical task in single-cell data analysis. In this paper, we propose a method called ProgClust for the decomposition of cell populations and detection of rare cells. ProgClust represents the single-cell data with clustering trees where a progressive searching method is designed to select cell population-specific genes and cluster cells. The obtained trees reveal the structure of both abundant cell populations and rare cell populations. Additionally, it can automatically determine the number of clusters. Experimental results show that ProgClust outperforms the baseline method and is capable of accurately identifying both common and rare cells. Moreover, when applied to real unlabeled data, it reveals potential cell subpopulations which provides clues for further exploration. In summary, ProgClust shows potential in identifying subpopulations of complex single-cell data.
Collapse
Affiliation(s)
- Han Li
- Department of Automation, Xiamen University, Xiamen, China
| | - Ying Wang
- Department of Automation, Xiamen University, Xiamen, China
- National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, China
- Xiamen Key Lab Big Data Intelligent Anal and Decis, Xiamen, China
| | - Yongxuan Lai
- School of Informatics, Xiamen University, Xiamen, China
| | - Feng Zeng
- Department of Automation, Xiamen University, Xiamen, China
- National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, China
- Xiamen Key Lab Big Data Intelligent Anal and Decis, Xiamen, China
- *Correspondence: Feng Zeng, ; Fan Yang,
| | - Fan Yang
- Department of Automation, Xiamen University, Xiamen, China
- National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, China
- Xiamen Key Lab Big Data Intelligent Anal and Decis, Xiamen, China
- *Correspondence: Feng Zeng, ; Fan Yang,
| |
Collapse
|
4
|
Xu J, Wu J, Li T, Nan Y. Divergence-Based Locally Weighted Ensemble Clustering with Dictionary Learning and L2,1-Norm. Entropy (Basel) 2022; 24:1324. [PMID: 37420344 DOI: 10.3390/e24101324] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/17/2022] [Revised: 09/11/2022] [Accepted: 09/19/2022] [Indexed: 07/09/2023]
Abstract
Accurate clustering is a challenging task with unlabeled data. Ensemble clustering aims to combine sets of base clusterings to obtain a better and more stable clustering and has shown its ability to improve clustering accuracy. Dense representation ensemble clustering (DREC) and entropy-based locally weighted ensemble clustering (ELWEC) are two typical methods for ensemble clustering. However, DREC treats each microcluster equally and hence, ignores the differences between each microcluster, while ELWEC conducts clustering on clusters rather than microclusters and ignores the sample-cluster relationship. To address these issues, a divergence-based locally weighted ensemble clustering with dictionary learning (DLWECDL) is proposed in this paper. Specifically, the DLWECDL consists of four phases. First, the clusters from the base clustering are used to generate microclusters. Second, a Kullback-Leibler divergence-based ensemble-driven cluster index is used to measure the weight of each microcluster. With these weights, an ensemble clustering algorithm with dictionary learning and the L2,1-norm is employed in the third phase. Meanwhile, the objective function is resolved by optimizing four subproblems and a similarity matrix is learned. Finally, a normalized cut (Ncut) is used to partition the similarity matrix and the ensemble clustering results are obtained. In this study, the proposed DLWECDL was validated on 20 widely used datasets and compared to some other state-of-the-art ensemble clustering methods. The experimental results demonstrated that the proposed DLWECDL is a very promising method for ensemble clustering.
Collapse
Affiliation(s)
- Jiaxuan Xu
- School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu 611130, China
| | - Jiang Wu
- School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu 611130, China
| | - Taiyong Li
- School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu 611130, China
| | - Yang Nan
- Department of Computer Science, Harbin Finance University, Harbin 150030, China
| |
Collapse
|
5
|
Taus P, Pospisilova S, Plevova K. Identification of Clinically Relevant Subgroups of Chronic Lymphocytic Leukemia Through Discovery of Abnormal Molecular Pathways. Front Genet 2021; 12:627964. [PMID: 34262590 PMCID: PMC8273263 DOI: 10.3389/fgene.2021.627964] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2020] [Accepted: 05/04/2021] [Indexed: 11/13/2022] Open
Abstract
Chronic lymphocytic leukemia (CLL) is the most common form of adult leukemia in the Western world with a highly variable clinical course. Its striking genetic heterogeneity is not yet fully understood. Although the CLL genetic landscape has been well-described, patient stratification based on mutation profiles remains elusive mainly due to the heterogeneity of data. Here we attempted to decrease the heterogeneity of somatic mutation data by mapping mutated genes in the respective biological processes. From the sequencing data gathered by the International Cancer Genome Consortium for 506 CLL patients, we generated pathway mutation scores, applied ensemble clustering on them, and extracted abnormal molecular pathways with a machine learning approach. We identified four clusters differing in pathway mutational profiles and time to first treatment. Interestingly, common CLL drivers such as ATM or TP53 were associated with particular subtypes, while others like NOTCH1 or SF3B1 were not. This study provides an important step in understanding mutational patterns in CLL.
Collapse
Affiliation(s)
- Petr Taus
- Central European Institute of Technology, Masaryk University, Brno, Czechia
| | - Sarka Pospisilova
- Central European Institute of Technology, Masaryk University, Brno, Czechia.,Department of Internal Medicine - Hematology and Oncology, University Hospital Brno, Brno, Czechia.,Faculty of Medicine, Masaryk University, Brno, Czechia
| | - Karla Plevova
- Central European Institute of Technology, Masaryk University, Brno, Czechia.,Department of Internal Medicine - Hematology and Oncology, University Hospital Brno, Brno, Czechia.,Faculty of Medicine, Masaryk University, Brno, Czechia
| |
Collapse
|
6
|
Dobrovolska O, Strømland Ø, Handegård ØS, Jakubec M, Govasli ML, Skjevik ÅA, Frøystein NÅ, Teigen K, Halskau Ø. Investigating the Disordered and Membrane-Active Peptide A-Cage-C Using Conformational Ensembles. Molecules 2021; 26:3607. [PMID: 34204651 DOI: 10.3390/molecules26123607] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Revised: 05/28/2021] [Accepted: 06/01/2021] [Indexed: 11/16/2022] Open
Abstract
The driving forces and conformational pathways leading to amphitropic protein-membrane binding and in some cases also to protein misfolding and aggregation is the subject of intensive research. In this study, a chimeric polypeptide, A-Cage-C, derived from α-Lactalbumin is investigated with the aim of elucidating conformational changes promoting interaction with bilayers. From previous studies, it is known that A-Cage-C causes membrane leakages associated with the sporadic formation of amorphous aggregates on solid-supported bilayers. Here we express and purify double-labelled A-Cage-C and prepare partially deuterated bicelles as a membrane mimicking system. We investigate A-Cage-C in the presence and absence of these bicelles at non-binding (pH 7.0) and binding (pH 4.5) conditions. Using in silico analyses, NMR, conformational clustering, and Molecular Dynamics, we provide tentative insights into the conformations of bound and unbound A-Cage-C. The conformation of each state is dynamic and samples a large amount of overlapping conformational space. We identify one of the clusters as likely representing the binding conformation and conclude tentatively that the unfolding around the central W23 segment and its reorientation may be necessary for full intercalation at binding conditions (pH 4.5). We also see evidence for an overall elongation of A-Cage-C in the presence of model bilayers.
Collapse
|
7
|
Zhu Y, Zhang DX, Zhang XF, Yi M, Ou-Yang L, Wu M. EC-PGMGR: Ensemble Clustering Based on Probability Graphical Model With Graph Regularization for Single-Cell RNA-seq Data. Front Genet 2020; 11:572242. [PMID: 33329710 PMCID: PMC7673820 DOI: 10.3389/fgene.2020.572242] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2020] [Accepted: 09/30/2020] [Indexed: 11/21/2022] Open
Abstract
Advances in technology have made it convenient to obtain a large amount of single cell RNAsequencing (scRNA-seq) data. Since that clustering is a very important step in identifying or defining cellular phenotypes, many clustering approaches have been developed recently for these applications. The general methods can be roughly divided into normal clustering methods and integrated (ensemble) clustering methods which combine more than two normal clustering methods aiming to get much more informative performance. In order to make a contrast with the integrated clustering algorithm, the normal clustering method is often called individual or base clustering method. Note that the results of many individual clustering methods are often developed to capture one aspect of the data, and the results depend on the initial parameter settings, such as cluster number, distance metric and so on. Compared with individual clustering, although integrative clustering method may get much more accurate performance, the results depend on the base clustering results and integrated systems are often not self-regulation. Therefore, how to design a robust unsupervised clustering method is still a challenge. In order to tackle above limitations, we propose a novel Ensemble Clustering algorithm based on Probability Graphical Model with Graph Regularization, which is called EC-PGMGR for short. On one hand, we use parameter controlling in Probability Graphical Model (PGM) to automatically determine the cluster number without prior knowledge. On the other hand, we add a regularization term to reduce the effect deriving from some weak base clustering results. Particularly, the integrative results collected from base clustering methods can be assembled in the form of combination with self-regulation weights through a pre-learning process, which can efficiently enhance the effect of active clustering methods while weaken the effect of inactive clustering methods. Experiments are carried out on 7 data sets generated by different platforms with the number of single cells from 822 to 5,132. Results show that EC-PGMGR performs better than 4 alternative individual clustering methods and 2 ensemble methods in terms of accuracy including Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI), robustness, effectiveness and so on. EC-PGMGR provides an effective way to integrate different clustering results for more accurate and reliable results in further biological analysis as well. It may provide some new insights to the other applications of clustering.
Collapse
Affiliation(s)
- Yuan Zhu
- School of Automation, China University of Geosciences, Wuhan, China.,Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan, China
| | - De-Xin Zhang
- School of Automation, China University of Geosciences, Wuhan, China.,Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan, China
| | - Xiao-Fei Zhang
- Department of Statistics, School of Mathematics and Statistics, Central China Normal University, Wuhan, China
| | - Ming Yi
- School of Mathematics and Physics, China University of Geosciences, Wuhan, China
| | - Le Ou-Yang
- Guangdong Key Laboratory of Intelligent Information Processing and Shenzhen Key Laboratory of Media Security, Shenzhen University, Shenzhen, China
| | - Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| |
Collapse
|
8
|
Kurmukov A, Mussabaeva A, Denisova Y, Moyer D, Jahanshad N, Thompson PM, Gutman BA. Optimizing Connectivity-Driven Brain Parcellation Using Ensemble Clustering. Brain Connect 2020; 10:183-194. [PMID: 32264696 PMCID: PMC7247040 DOI: 10.1089/brain.2019.0722] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
This work addresses the problem of constructing a unified, topologically optimal connectivity-based brain atlas. The proposed approach aggregates an ensemble partition from individual parcellations without label agreement, providing a balance between sufficiently flexible individual parcellations and intuitive representation of the average topological structure of the connectome. The methods exploit a previously proposed dense connectivity representation, first performing graph-based hierarchical parcellation of individual brains, and subsequently aggregating the individual parcellations into a consensus parcellation. The search for consensus—based on the hard ensemble (HE) algorithm—approximately minimizes the sum of cluster membership distances, effectively estimating a pseudo-Karcher mean of individual parcellations. Computational stability, graph structure preservation, and biological relevance of the simplified representation resulting from the proposed parcellation are assessed on the Human Connectome Project data set. These aspects are assessed using (1) edge weight distribution divergence with respect to the dense connectome representation, (2) interhemispheric symmetry, (3) network characteristics' stability and agreement with respect to individually and anatomically parcellated networks, and (4) performance of the simplified connectome in a biological sex classification task. Ensemble parcellation was found to be highly stable with respect to subject sampling, outperforming anatomical atlases and other connectome-based parcellations in classification as well as preserving global connectome properties. The HE-based parcellation also showed a degree of symmetry comparable with anatomical atlases and a high degree of spatial contiguity without using explicit priors.
Collapse
Affiliation(s)
- Anvar Kurmukov
- Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russia.,Higher School of Economics, Moscow, Russia.,Department of Biomedical Engineering, Medical Imaging Research Center, Illinois Institute of Technology, Chicago, Illinois, USA
| | - Ayagoz Mussabaeva
- Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russia
| | - Yulia Denisova
- Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russia
| | - Daniel Moyer
- Computer Science and Artificial Intelligence Laboratory (CSAIL), Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - Neda Jahanshad
- Imaging Genetics Center, Stevens Neuroimaging and Informatics Institute, Keck School of Medicine of USC, University of Southern California, Marina del Rey, California, USA
| | - Paul M Thompson
- Imaging Genetics Center, Stevens Neuroimaging and Informatics Institute, Keck School of Medicine of USC, University of Southern California, Marina del Rey, California, USA
| | - Boris A Gutman
- Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russia.,Department of Biomedical Engineering, Medical Imaging Research Center, Illinois Institute of Technology, Chicago, Illinois, USA
| |
Collapse
|
9
|
Jang JY, Oh HS, Lim Y, Cheung YK. Ensemble clustering for step data via binning. Biometrics 2020; 77:293-304. [PMID: 32150282 DOI: 10.1111/biom.13258] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2019] [Revised: 02/26/2020] [Accepted: 02/27/2020] [Indexed: 11/29/2022]
Abstract
This paper considers the clustering problem of physical step count data recorded on wearable devices. Clustering step data give an insight into an individual's activity status and further provide the groundwork for health-related policies. However, classical methods, such as K-means clustering and hierarchical clustering, are not suitable for step count data that are typically high-dimensional and zero-inflated. This paper presents a new clustering method for step data based on a novel combination of ensemble clustering and binning. We first construct multiple sets of binned data by changing the size and starting position of the bin, and then merge the clustering results from the binned data using a voting method. The advantage of binning, as a critical component, is that it substantially reduces the dimension of the original data while preserving the essential characteristics of the data. As a result, combining clustering results from multiple binned data can provide an improved clustering result that reflects both local and global structures of the data. Simulation studies and real data analysis were carried out to evaluate the empirical performance of the proposed method and demonstrate its general utility.
Collapse
Affiliation(s)
- Ja-Yoon Jang
- Department of Statistics, Stanford University, Stanford, California
| | - Hee-Seok Oh
- Department of Statistics, Seoul National University, Seoul, Korea
| | - Yaeji Lim
- Department of Applied Statistics, Chung-Ang University, Seoul, Korea
| | | |
Collapse
|
10
|
Brereton AE, Karplus PA. Ensemblator v3: Robust atom-level comparative analyses and classification of protein structure ensembles. Protein Sci 2017; 27:41-50. [PMID: 28762605 DOI: 10.1002/pro.3249] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2017] [Revised: 07/24/2017] [Accepted: 07/27/2017] [Indexed: 12/19/2022]
Abstract
Ensembles of protein structures are increasingly used to represent the conformational variation of a protein as determined by experiment and/or by molecular simulations, as well as uncertainties that may be associated with structure determinations or predictions. Making the best use of such information requires the ability to quantitatively compare entire ensembles. For this reason, we recently introduced the Ensemblator (Clark et al., Protein Sci 2015; 24:1528), a novel approach to compare user-defined groups of models, in residue level detail. Here we describe Ensemblator v3, an open-source program that employs the same basic ensemble comparison strategy but includes major advances that make it more robust, powerful, and user-friendly. Ensemblator v3 carries out multiple sequence alignments to facilitate the generation of ensembles from non-identical input structures, automatically optimizes the key global overlay parameter, optionally performs "ensemble clustering" to classify the models into subgroups, and calculates a novel "discrimination index" that quantifies similarities and differences, at residue or atom level, between each pair of subgroups. The clustering and automatic options mean that no pre-knowledge about an ensemble is required for its analysis. After describing the novel features of Ensemblator v3, we demonstrate its utility using three case studies that illustrate the ease with which complex analyses are accomplished, and the kinds of insights derived from clustering into subgroups and from the detailed information that locates significant differences. The Ensemblator v3 enhances the structural biology toolbox by greatly expanding the kinds of problems to which this ensemble comparison strategy can be applied.
Collapse
Affiliation(s)
- Andrew E Brereton
- Department of Biochemistry and Biophysics, Oregon State University, Corvallis, Oregon, 97331
| | - P Andrew Karplus
- Department of Biochemistry and Biophysics, Oregon State University, Corvallis, Oregon, 97331
| |
Collapse
|
11
|
Abstract
RNA-seq, the next generation sequencing platform, enables researchers to explore deep into the transcriptome of organisms, such as identifying functional non-coding RNAs (ncRNAs), and quantify their expressions on tissues. The functions of ncRNAs are mostly related to their secondary structures. Thus by exploring the clustering in terms of structural profiles of the corresponding read-segments would be essential and this fuels in our motivation behind this research. In this manuscript we proposed PR2S2Clust, Patched RNA-seq Read Segments' Structure-oriented Clustering, which is an analysis platform to extract features to prepare the secondary structure profiles of the RNA-seq read segments. It provides a strategy to employ the profiles to annotate the segments into ncRNA classes using several clustering strategies. The system considers seven pairwise structural distance metrics by considering short-read mappings onto each structure, which we term as the "patched structure" while clustering the segments. In this regard, we show applications of both classical and ensemble clusterings of the partitional and hierarchical variations. Extensive real-world experiments over three publicly available RNA-seq datasets and a comparative analysis over four competitive systems confirm the effectiveness and superiority of the proposed system. The source codes and dataset of PR2S2Clust are available at the http://biomecis.uta.edu/~ashis/res/PR2S2Clust-suppl/ .
Collapse
Affiliation(s)
- Ashis Kumer Biswas
- 1 Department of Computer Science and Engineering, The University of Texas at Arlington, Texas 76019, USA
| | - Jean X Gao
- 1 Department of Computer Science and Engineering, The University of Texas at Arlington, Texas 76019, USA
| |
Collapse
|