1
|
Kumaresan A, Sinha MK, Paul N, Nag P, Ebenezer Samuel King JP, Kumar R, Datta TK. Establishment of a repertoire of fertility associated sperm proteins and their differential abundance in buffalo bulls (Bubalus bubalis) with contrasting fertility. Sci Rep 2023; 13:2272. [PMID: 36754964 PMCID: PMC9908891 DOI: 10.1038/s41598-023-29529-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2022] [Accepted: 02/06/2023] [Indexed: 02/10/2023] Open
Abstract
Sperm harbours a wide range of proteins regulating their functions and fertility. In the present study, we made an effort to characterize and quantify the proteome of buffalo bull spermatozoa, and to identify fertility associated sperm proteins through comparative proteomics. Using high-throughput mass spectrometry platform, we identified 1305 proteins from buffalo spermatozoa and found that these proteins were mostly enriched in glycolytic process, mitochondrial respiratory chain, tricarboxylic acid cycle, protein folding, spermatogenesis, sperm motility and sperm binding to zona pellucida (p < 7.74E-08) besides metabolic (p = 4.42E-31) and reactive oxygen species (p = 1.81E-30) pathways. Differential proteomic analysis revealed that 844 proteins were commonly expressed in spermatozoa from both the groups while 77 and 52 proteins were exclusively expressed in high- and low-fertile bulls, respectively. In low-fertile bulls, 75 proteins were significantly (p < 0.05) upregulated and 176 proteins were significantly (p < 0.05) downregulated; these proteins were highly enriched in mitochondrial respiratory chain complex I assembly (p = 2.63E-07) and flagellated sperm motility (p = 7.02E-05) processes besides oxidative phosphorylation pathway (p = 6.61E-15). The down regulated proteins in low-fertile bulls were involved in sperm motility, metabolism, sperm-egg recognition and fertilization. These variations in the sperm proteome could be used as potential markers for the selection of buffalo bulls for fertility.
Collapse
Affiliation(s)
- Arumugam Kumaresan
- Theriogenology Laboratory, Southern Regional Station of ICAR-National Dairy Research Institute, Bengaluru, Karnataka, 560030, India.
| | - Manish Kumar Sinha
- Theriogenology Laboratory, Southern Regional Station of ICAR-National Dairy Research Institute, Bengaluru, Karnataka, 560030, India
| | - Nilendu Paul
- Theriogenology Laboratory, Southern Regional Station of ICAR-National Dairy Research Institute, Bengaluru, Karnataka, 560030, India
| | - Pradeep Nag
- Theriogenology Laboratory, Southern Regional Station of ICAR-National Dairy Research Institute, Bengaluru, Karnataka, 560030, India
| | - John Peter Ebenezer Samuel King
- Theriogenology Laboratory, Southern Regional Station of ICAR-National Dairy Research Institute, Bengaluru, Karnataka, 560030, India
| | - Rakesh Kumar
- Animal Genomics Laboratory, ICAR-National Dairy Research Institute, Karnal, Haryana, 132 001, India
| | - Tirtha Kumar Datta
- ICAR-Central Institute for Research on Buffaloes, Hisar, Haryana, 125 001, India
| |
Collapse
|
2
|
Yu Z, Wang D, Meng XB, Chen CLP. Clustering Ensemble Based on Hybrid Multiview Clustering. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:6518-6530. [PMID: 33284761 DOI: 10.1109/tcyb.2020.3034157] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
As an effective method for clustering applications, the clustering ensemble algorithm integrates different clustering solutions into a final one, thus improving the clustering efficiency. The key to designing the clustering ensemble algorithm is to improve the diversities of base learners and optimize the ensemble strategies. To address these problems, we propose a clustering ensemble framework that consists of three parts. First, three view transformation methods, including random principal component analysis, random nearest neighbor, and modified fuzzy extension model, are used as base learners to learn different clustering views. A random transformation and hybrid multiview learning-based clustering ensemble method (RTHMC) is then designed to synthesize the multiview clustering results. Second, a new random subspace transformation is integrated into RTHMC to enhance its performance. Finally, a view-based self-evolutionary strategy is developed to further improve the proposed method by optimizing random subspace sets. Experiments and comparisons demonstrate the effectiveness and superiority of the proposed method for clustering different kinds of data.
Collapse
|
3
|
He K. Pharmacological affinity fingerprints derived from bioactivity data for the identification of designer drugs. J Cheminform 2022; 14:35. [PMID: 35672835 PMCID: PMC9171973 DOI: 10.1186/s13321-022-00607-6] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2022] [Accepted: 05/05/2022] [Indexed: 12/15/2022] Open
Abstract
Facing the continuous emergence of new psychoactive substances (NPS) and their threat to public health, more effective methods for NPS prediction and identification are critical. In this study, the pharmacological affinity fingerprints (Ph-fp) of NPS compounds were predicted by Random Forest classification models using bioactivity data from the ChEMBL database. The binary Ph-fp is the vector consisting of a compound's activity against a list of molecular targets reported to be responsible for the pharmacological effects of NPS. Their performance in similarity searching and unsupervised clustering was assessed and compared to 2D structure fingerprints Morgan and MACCS (1024-bits ECFP4 and 166-bits SMARTS-based MACCS implementation of RDKit). The performance in retrieving compounds according to their pharmacological categorizations is influenced by the predicted active assay counts in Ph-fp and the choice of similarity metric. Overall, the comparative unsupervised clustering analysis suggests the use of a classification model with Morgan fingerprints as input for the construction of Ph-fp. This combination gives satisfactory clustering performance based on external and internal clustering validation indices.
Collapse
Affiliation(s)
- Kedan He
- Physical Sciences, Eastern Connecticut State University, 83 Windham St, Willimantic, CT, 06226, USA.
| |
Collapse
|
4
|
Shi Y, Yu Z, Cao W, Chen CLP, Wong HS, Han G. Fast and Effective Active Clustering Ensemble Based on Density Peak. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2021; 32:3593-3607. [PMID: 32845845 DOI: 10.1109/tnnls.2020.3015795] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Semisupervised clustering methods improve performance by randomly selecting pairwise constraints, which may lead to redundancy and instability. In this context, active clustering is proposed to maximize the efficacy of annotations by effectively using pairwise constraints. However, existing methods lack an overall consideration of the querying criteria and repeatedly run semisupervised clustering to update labels. In this work, we first propose an active density peak (ADP) clustering algorithm that considers both representativeness and informativeness. Representative instances are selected to capture data patterns, while informative instances are queried to reduce the uncertainty of clustering results. Meanwhile, we design a fast-update-strategy to update labels efficiently. In addition, we propose an active clustering ensemble framework that combines local and global uncertainties to query the most ambiguous instances for better separation between the clusters. A weighted voting consensus method is introduced for better integration of clustering results. We conducted experiments by comparing our methods with state-of-the-art methods on real-world data sets. Experimental results demonstrate the effectiveness of our methods.
Collapse
|
5
|
Wang C, Long Y, Li W, Dai W, Xie S, Liu Y, Zhang Y, Liu M, Tian Y, Li Q, Duan Y. Exploratory study on classification of lung cancer subtypes through a combined K-nearest neighbor classifier in breathomics. Sci Rep 2020; 10:5880. [PMID: 32246031 PMCID: PMC7125212 DOI: 10.1038/s41598-020-62803-4] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2019] [Accepted: 02/05/2020] [Indexed: 11/10/2022] Open
Abstract
Accurate classification of adenocarcinoma (AC) and squamous cell carcinoma (SCC) in lung cancer is critical to physicians’ clinical decision-making. Exhaled breath analysis provides a tremendous potential approach in non-invasive diagnosis of lung cancer but was rarely reported for lung cancer subtypes classification. In this paper, we firstly proposed a combined method, integrating K-nearest neighbor classifier (KNN), borderline2-synthetic minority over-sampling technique (borderlin2-SMOTE), and feature reduction methods, to investigate the ability of exhaled breath to distinguish AC from SCC patients. The classification performance of the proposed method was compared with the results of four classification algorithms under different combinations of borderline2-SMOTE and feature reduction methods. The result indicated that the KNN classifier combining borderline2-SMOTE and feature reduction methods was the most promising method to discriminate AC from SCC patients and obtained the highest mean area under the receiver operating characteristic curve (0.63) and mean geometric mean (58.50) when compared to others classifiers. The result revealed that the combined algorithm could improve the classification performance of lung cancer subtypes in breathomics and suggested that combining non-invasive exhaled breath analysis with multivariate analysis is a promising screening method for informing treatment options and facilitating individualized treatment of lung cancer subtypes patients.
Collapse
Affiliation(s)
- Chunyan Wang
- Research Center of Analytical Instrumentation, Key Laboratory of Bio-source and Eco-environment, Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, 610064, P.R. China
| | - Yijing Long
- Research Center of Analytical Instrumentation, Key Laboratory of Bio-source and Eco-environment, Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, 610064, P.R. China
| | - Wenwen Li
- West China School of Public Health and West China Fourth Hospital, Sichuan University, Chengdu, 610041, P.R. China
| | - Wei Dai
- Department of Thoracic Surgery, Sichuan Cancer Hospital & Institute, Sichuan Cancer Center, School of Medicine, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Shaohua Xie
- Department of Thoracic Surgery, Sichuan Cancer Hospital & Institute, Sichuan Cancer Center, School of Medicine, University of Electronic Science and Technology of China, Chengdu, Sichuan, China.,Graduate School, Chengdu Medical College, Chengdu, Sichuan, China
| | - Yuanling Liu
- Research Center of Analytical Instrumentation, Key Laboratory of Bio-source and Eco-environment, Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, 610064, P.R. China
| | - Yinchenxi Zhang
- Research Center of Analytical Instrumentation, Key Laboratory of Bio-source and Eco-environment, Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, 610064, P.R. China
| | - Mingxin Liu
- Department of Thoracic Surgery, Sichuan Cancer Hospital & Institute, Sichuan Cancer Center, School of Medicine, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Yonghui Tian
- College of Chemistry and Material Science, Northwest University Department of Chemistry and Material Science, Xi'an, 710127, Shanxi Province, P.R. China.
| | - Qiang Li
- Department of Thoracic Surgery, Sichuan Cancer Hospital & Institute, Sichuan Cancer Center, School of Medicine, University of Electronic Science and Technology of China, Chengdu, Sichuan, China.
| | - Yixiang Duan
- Research Center of Analytical Instrumentation, Key Laboratory of Bio-source and Eco-environment, Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, 610064, P.R. China.
| |
Collapse
|
6
|
A Hierarchical Clustering algorithm based on Silhouette Index for cancer subtype discovery from genomic data. Neural Comput Appl 2019. [DOI: 10.1007/s00521-019-04636-5] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
|
7
|
Yu Z, Zhang Y, Chen CLP, You J, Wong HS, Dai D, Wu S, Zhang J. Multiobjective Semisupervised Classifier Ensemble. IEEE TRANSACTIONS ON CYBERNETICS 2019; 49:2280-2293. [PMID: 29993923 DOI: 10.1109/tcyb.2018.2824299] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Classification of high-dimensional data with very limited labels is a challenging task in the field of data mining and machine learning. In this paper, we propose the multiobjective semisupervised classifier ensemble (MOSSCE) approach to address this challenge. Specifically, a multiobjective subspace selection process (MOSSP) in MOSSCE is first designed to generate the optimal combination of feature subspaces. Three objective functions are then proposed for MOSSP, which include the relevance of features, the redundancy between features, and the data reconstruction error. Then, MOSSCE generates an auxiliary training set based on the sample confidence to improve the performance of the classifier ensemble. Finally, the training set, combined with the auxiliary training set, is used to select the optimal combination of basic classifiers in the ensemble, train the classifier ensemble, and generate the final result. In addition, diversity analysis of the ensemble learning process is applied, and a set of nonparametric statistical tests is adopted for the comparison of semisupervised classification approaches on multiple datasets. The experiments on 12 gene expression datasets and two large image datasets show that MOSSCE has a better performance than other state-of-the-art semisupervised classifiers on high-dimensional data.
Collapse
|
8
|
Yu Z, Wang D, Zhao Z, Chen CLP, You J, Wong HS, Zhang J. Hybrid Incremental Ensemble Learning for Noisy Real-World Data Classification. IEEE TRANSACTIONS ON CYBERNETICS 2019; 49:403-416. [PMID: 29990215 DOI: 10.1109/tcyb.2017.2774266] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Traditional ensemble learning approaches explore the feature space and the sample space, respectively, which will prevent them to construct more powerful learning models for noisy real-world dataset classification. The random subspace method only search for the selection of features. Meanwhile, the bagging approach only search for the selection of samples. To overcome these limitations, we propose the hybrid incremental ensemble learning (HIEL) approach which takes into consideration the feature space and the sample space simultaneously to handle noisy dataset. Specifically, HIEL first adopts the bagging technique and linear discriminant analysis to remove noisy attributes, and generates a set of bootstraps and the corresponding ensemble members in the subspaces. Then, the classifiers are selected incrementally based on a classifier-specific criterion function and an ensemble criterion function. The corresponding weights for the classifiers are assigned during the same process. Finally, the final label is summarized by a weighted voting scheme, which serves as the final result of the classification. We also explore various classifier-specific criterion functions based on different newly proposed similarity measures, which will alleviate the effect of noisy samples on the distance functions. In addition, the computational cost of HIEL is analyzed theoretically. A set of nonparametric tests are adopted to compare HIEL and other algorithms over several datasets. The experiment results show that HIEL performs well on the noisy datasets. HIEL outperforms most of the compared classifier ensemble methods on 14 out of 24 noisy real-world UCI and KEEL datasets.
Collapse
|
9
|
Yu Z, Zhang Y, You J, Chen CLP, Wong HS, Han G, Zhang J. Adaptive Semi-Supervised Classifier Ensemble for High Dimensional Data Classification. IEEE TRANSACTIONS ON CYBERNETICS 2019; 49:366-379. [PMID: 29989979 DOI: 10.1109/tcyb.2017.2761908] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
High dimensional data classification with very limited labeled training data is a challenging task in the area of data mining. In order to tackle this task, we first propose a feature selection-based semi-supervised classifier ensemble framework (FSCE) to perform high dimensional data classification. Then, we design an adaptive semi-supervised classifier ensemble framework (ASCE) to improve the performance of FSCE. When compared with FSCE, ASCE is characterized by an adaptive feature selection process, an adaptive weighting process (AWP), and an auxiliary training set generation process (ATSGP). The adaptive feature selection process generates a set of compact subspaces based on the selected attributes obtained by the feature selection algorithms, while the AWP associates each basic semi-supervised classifier in the ensemble with a weight value. The ATSGP enlarges the training set with unlabeled samples. In addition, a set of nonparametric tests are adopted to compare multiple semi-supervised classifier ensemble (SSCE)approaches over different datasets. The experiments on 20 high dimensional real-world datasets show that: 1) the two adaptive processes in ASCE are useful for improving the performance of the SSCE approach and 2) ASCE works well on high dimensional datasets with very limited labeled training data, and outperforms most state-of-the-art SSCE approaches.
Collapse
|
10
|
Ye X, Sakurai T. Spectral clustering with adaptive similarity measure in Kernel space. INTELL DATA ANAL 2018. [DOI: 10.3233/ida-173436] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
11
|
Yu Z, Lu Y, Zhang J, You J, Wong HS, Wang Y, Han G. Progressive Semisupervised Learning of Multiple Classifiers. IEEE TRANSACTIONS ON CYBERNETICS 2018; 48:689-702. [PMID: 28113355 DOI: 10.1109/tcyb.2017.2651114] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Semisupervised learning methods are often adopted to handle datasets with very small number of labeled samples. However, conventional semisupervised ensemble learning approaches have two limitations: 1) most of them cannot obtain satisfactory results on high dimensional datasets with limited labels and 2) they usually do not consider how to use an optimization process to enlarge the training set. In this paper, we propose the progressive semisupervised ensemble learning approach (PSEMISEL) to address the above limitations and handle datasets with very small number of labeled samples. When compared with traditional semisupervised ensemble learning approaches, PSEMISEL is characterized by two properties: 1) it adopts the random subspace technique to investigate the structure of the dataset in the subspaces and 2) a progressive training set generation process and a self evolutionary sample selection process are proposed to enlarge the training set. We also use a set of nonparametric tests to compare different semisupervised ensemble learning methods over multiple datasets. The experimental results on 18 real-world datasets from the University of California, Irvine machine learning repository show that PSEMISEL works well on most of the real-world datasets, and outperforms other state-of-the-art approaches on 10 out of 18 datasets.
Collapse
|
12
|
Yu Z, Wang Z, You J, Zhang J, Liu J, Wong HS, Han G. A New Kind of Nonparametric Test for Statistical Comparison of Multiple Classifiers Over Multiple Datasets. IEEE TRANSACTIONS ON CYBERNETICS 2017; 47:4418-4431. [PMID: 28113414 DOI: 10.1109/tcyb.2016.2611020] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Nonparametric statistical analysis, such as the Friedman test (FT), is gaining more and more attention due to its useful applications in a lot of experimental studies. However, traditional FT for the comparison of multiple learning algorithms on different datasets adopts the naive ranking approach. The ranking is based on the average accuracy values obtained by the set of learning algorithms on the datasets, which neither considers the differences of the results obtained by the learning algorithms on each dataset nor takes into account the performance of the learning algorithms in each run. In this paper, we will first propose three kinds of ranking approaches, which are the weighted ranking approach, the global ranking approach (GRA), and the weighted GRA. Then, a theoretical analysis is performed to explore the properties of the proposed ranking approaches. Next, a set of the modified FTs based on the proposed ranking approaches are designed for the comparison of the learning algorithms. Finally, the modified FTs are evaluated through six classifier ensemble approaches on 34 real-world datasets. The experiments show the effectiveness of the modified FTs.
Collapse
|
13
|
Yu Z, Zhu X, Wong HS, You J, Zhang J, Han G. Distribution-Based Cluster Structure Selection. IEEE TRANSACTIONS ON CYBERNETICS 2017; 47:3554-3567. [PMID: 27254876 DOI: 10.1109/tcyb.2016.2569529] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
The objective of cluster structure ensemble is to find a unified cluster structure from multiple cluster structures obtained from different datasets. Unfortunately, not all the cluster structures contribute to the unified cluster structure. This paper investigates the problem of how to select the suitable cluster structures in the ensemble which will be summarized to a more representative cluster structure. Specifically, the cluster structure is first represented by a mixture of Gaussian distributions, the parameters of which are estimated using the expectation-maximization algorithm. Then, several distribution-based distance functions are designed to evaluate the similarity between two cluster structures. Based on the similarity comparison results, we propose a new approach, which is referred to as the distribution-based cluster structure ensemble (DCSE) framework, to find the most representative unified cluster structure. We then design a new technique, the distribution-based cluster structure selection strategy (DCSSS), to select a subset of cluster structures. Finally, we propose using a distribution-based normalized hypergraph cut algorithm to generate the final result. In our experiments, a nonparametric test is adopted to evaluate the difference between DCSE and its competitors. We adopt 20 real-world datasets obtained from the University of California, Irvine and knowledge extraction based on evolutionary learning repositories, and a number of cancer gene expression profiles to evaluate the performance of the proposed methods. The experimental results show that: 1) DCSE works well on the real-world datasets and 2) DCSE based on DCSSS can further improve the performance of the algorithm.
Collapse
|
14
|
Lu H, Chen J, Yan K, Jin Q, Xue Y, Gao Z. A hybrid feature selection algorithm for gene expression data classification. Neurocomputing 2017. [DOI: 10.1016/j.neucom.2016.07.080] [Citation(s) in RCA: 177] [Impact Index Per Article: 22.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
|
15
|
Zhongxin W, Gang S, Jing Z, Jia Z. Feature Selection Algorithm Based on Mutual Information and Lasso for Microarray Data. ACTA ACUST UNITED AC 2016. [DOI: 10.2174/1874070701610010278] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
With the development of microarray technology, massive microarray data is produced by gene expression experiments, and it provides a new approach for the study of human disease. Due to the characteristics of high dimensionality, much noise and data redundancy for microarray data, it is difficult to my knowledge from microarray data profoundly and accurately,and it also brings enormous difficulty for information genes selection. Therefore, a new feature selection algorithm for high dimensional microarray data is proposed in this paper, which mainly involves two steps. In the first step, mutual information method is used to calculate all genes, and according to the mutual information value, information genes is selected as candidate genes subset and irrelevant genes are filtered. In the second step, an improved method based on Lasso is used to select information genes from candidate genes subset, which aims to remove the redundant genes. Experimental results show that the proposed algorithm can select fewer genes, and it has better classification ability, stable performance and strong generalization ability. It is an effective genes feature selection algorithm.
Collapse
|
16
|
|
17
|
Applying Cost-Sensitive Extreme Learning Machine and Dissimilarity Integration to Gene Expression Data Classification. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2016; 2016:8056253. [PMID: 27642292 PMCID: PMC5011754 DOI: 10.1155/2016/8056253] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/01/2016] [Revised: 07/15/2016] [Accepted: 07/26/2016] [Indexed: 11/17/2022]
Abstract
Embedding cost-sensitive factors into the classifiers increases the classification stability and reduces the classification costs for classifying high-scale, redundant, and imbalanced datasets, such as the gene expression data. In this study, we extend our previous work, that is, Dissimilar ELM (D-ELM), by introducing misclassification costs into the classifier. We name the proposed algorithm as the cost-sensitive D-ELM (CS-D-ELM). Furthermore, we embed rejection cost into the CS-D-ELM to increase the classification stability of the proposed algorithm. Experimental results show that the rejection cost embedded CS-D-ELM algorithm effectively reduces the average and overall cost of the classification process, while the classification accuracy still remains competitive. The proposed method can be extended to classification problems of other redundant and imbalanced data.
Collapse
|
18
|
Jung S. In-silico interaction-resolution pathway activity quantification and application to identifying cancer subtypes. BMC Med Inform Decis Mak 2016; 16 Suppl 1:55. [PMID: 27455040 PMCID: PMC4959392 DOI: 10.1186/s12911-016-0295-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Background Identifying subtypes of complex diseases such as cancer is the very first step toward developing highly customized therapeutics on such diseases, as their origins significantly vary even with similar physiological characteristics. There have been many studies to recognize subtypes of various cancer based on genomic signatures, and most of them rely on approaches based on the signatures or features developed from individual genes. However, the idea of network-driven activities of biological functions has gained a lot of interests, as more evidence is found that biological systems can show highly diverse activity patterns because genes can interact differentially across specific molecular contexts. Methods In this study, we proposed an in-silico method to quantify pathway activities with a resolution of genetic interactions for individual samples, and developed a method to compute the discrepancy between samples based on the quantified pathway activities. Results By using the proposed discrepancy measure between sample pathway activities in clustering melanoma gene expression data, we identified two potential subtypes of melanoma with distinguished pathway activities, where the two groups of patients showed significantly different survival patterns. We also investigated selected pathways with distinguished activity patterns between the two groups, and the result suggests hypotheses on the mechanisms driving the two potential subtypes. Conclusions By using the proposed approach of modeling pathway activities with a resolution of genetic interactions, potential novel subtypes of disease were proposed with accompanying hypotheses on subtype-specific genetic interaction information.
Collapse
Affiliation(s)
- Sungwon Jung
- Department of Genome Medicine and Science, Gachon University School of Medicine, Incheon, 21565, Republic of Korea.
| |
Collapse
|
19
|
Survey of Programs Used to Detect Alternative Splicing Isoforms from Deep Sequencing Data In Silico. BIOMED RESEARCH INTERNATIONAL 2015; 2015:831352. [PMID: 26421304 PMCID: PMC4573434 DOI: 10.1155/2015/831352] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/26/2014] [Revised: 02/17/2015] [Accepted: 03/02/2015] [Indexed: 11/29/2022]
Abstract
Next-generation sequencing techniques have been rapidly emerging. However, the massive sequencing reads hide a great deal of unknown important information. Advances have enabled researchers to discover alternative splicing (AS) sites and isoforms using computational approaches instead of molecular experiments. Given the importance of AS for gene expression and protein diversity in eukaryotes, detecting alternative splicing and isoforms represents a hot topic in systems biology and epigenetics research. The computational methods applied to AS prediction have improved since the emergence of next-generation sequencing. In this study, we introduce state-of-the-art research on AS and then compare the research methods and software tools available for AS based on next-generation sequencing reads. Finally, we discuss the prospects of computational methods related to AS.
Collapse
|
20
|
Yu Z, Chen H, You J, Liu J, Wong HS, Han G, Li L. Adaptive Fuzzy Consensus Clustering Framework for Clustering Analysis of Cancer Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:887-901. [PMID: 26357330 DOI: 10.1109/tcbb.2014.2359433] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Performing clustering analysis is one of the important research topics in cancer discovery using gene expression profiles, which is crucial in facilitating the successful diagnosis and treatment of cancer. While there are quite a number of research works which perform tumor clustering, few of them considers how to incorporate fuzzy theory together with an optimization process into a consensus clustering framework to improve the performance of clustering analysis. In this paper, we first propose a random double clustering based cluster ensemble framework (RDCCE) to perform tumor clustering based on gene expression data. Specifically, RDCCE generates a set of representative features using a randomly selected clustering algorithm in the ensemble, and then assigns samples to their corresponding clusters based on the grouping results. In addition, we also introduce the random double clustering based fuzzy cluster ensemble framework (RDCFCE), which is designed to improve the performance of RDCCE by integrating the newly proposed fuzzy extension model into the ensemble framework. RDCFCE adopts the normalized cut algorithm as the consensus function to summarize the fuzzy matrices generated by the fuzzy extension models, partition the consensus matrix, and obtain the final result. Finally, adaptive RDCFCE (A-RDCFCE) is proposed to optimize RDCFCE and improve the performance of RDCFCE further by adopting a self-evolutionary process (SEPP) for the parameter set. Experiments on real cancer gene expression profiles indicate that RDCFCE and A-RDCFCE works well on these data sets, and outperform most of the state-of-the-art tumor clustering algorithms.
Collapse
|
21
|
Gan S, Cosgrove DA, Gardiner EJ, Gillet VJ. Investigation of the use of spectral clustering for the analysis of molecular data. J Chem Inf Model 2014; 54:3302-19. [PMID: 25379955 DOI: 10.1021/ci500480b] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Spectral clustering involves placing objects into clusters based on the eigenvectors and eigenvalues of an associated matrix. The technique was first applied to molecular data by Brewer [J. Chem. Inf. Model. 2007, 47, 1727-1733] who demonstrated its use on a very small dataset of 125 COX-2 inhibitors. We have determined suitable parameters for spectral clustering using a wide variety of molecular descriptors and several datasets of a few thousand compounds and compared the results of clustering using a nonoverlapping version of Brewer's use of Sarker and Boyer's algorithm with that of Ward's and k-means clustering. We then replaced the exact eigendecomposition method with two different approximate methods and concluded that Singular Value Decomposition is the most appropriate method for clustering larger compound collections of up to 100,000 compounds. We have also used spectral clustering with the Tversky coefficient to generate two sets of clusters linked by a common set of eigenvalues and have used this novel approach to cluster sets of fragments such as those used in fragment-based drug design.
Collapse
Affiliation(s)
- Sonny Gan
- Information School, University of Sheffield , Regent Court, 211 Portobello Street, Sheffield S1 4DP, United Kingdom
| | | | | | | |
Collapse
|
22
|
Yu Z, Chen H, You J, Wong HS, Liu J, Li L, Han G. Double Selection Based Semi-Supervised Clustering Ensemble for Tumor Clustering from Gene Expression Profiles. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:727-740. [PMID: 26356343 DOI: 10.1109/tcbb.2014.2315996] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Tumor clustering is one of the important techniques for tumor discovery from cancer gene expression profiles, which is useful for the diagnosis and treatment of cancer. While different algorithms have been proposed for tumor clustering, few make use of the expert's knowledge to better the performance of tumor discovery. In this paper, we first view the expert's knowledge as constraints in the process of clustering, and propose a feature selection based semi-supervised cluster ensemble framework (FS-SSCE) for tumor clustering from bio-molecular data. Compared with traditional tumor clustering approaches, the proposed framework FS-SSCE is featured by two properties: (1) The adoption of feature selection techniques to dispel the effect of noisy genes. (2) The employment of the binate constraint based K-means algorithm to take into account the effect of experts' knowledge. Then, a double selection based semi-supervised cluster ensemble framework (DS-SSCE) which not only applies the feature selection technique to perform gene selection on the gene dimension, but also selects an optimal subset of representative clustering solutions in the ensemble and improve the performance of tumor clustering using the normalized cut algorithm. DS-SSCE also introduces a confidence factor into the process of constructing the consensus matrix by considering the prior knowledge of the data set. Finally, we design a modified double selection based semi-supervised cluster ensemble framework (MDS-SSCE) which adopts multiple clustering solution selection strategies and an aggregated solution selection function to choose an optimal subset of clustering solutions. The results in the experiments on cancer gene expression profiles show that (i) FS-SSCE, DS-SSCE and MDS-SSCE are suitable for performing tumor clustering from bio-molecular data. (ii) MDS-SSCE outperforms a number of state-of-the-art tumor clustering approaches on most of the data sets.
Collapse
|
23
|
Booma PM, Prabhakaran S, Dhanalakshmi R. An improved Pearson's correlation proximity-based hierarchical clustering for mining biological association between genes. ScientificWorldJournal 2014; 2014:357873. [PMID: 25136661 PMCID: PMC4083291 DOI: 10.1155/2014/357873] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2014] [Revised: 05/22/2014] [Accepted: 05/26/2014] [Indexed: 01/06/2023] Open
Abstract
Microarray gene expression datasets has concerned great awareness among molecular biologist, statisticians, and computer scientists. Data mining that extracts the hidden and usual information from datasets fails to identify the most significant biological associations between genes. A search made with heuristic for standard biological process measures only the gene expression level, threshold, and response time. Heuristic search identifies and mines the best biological solution, but the association process was not efficiently addressed. To monitor higher rate of expression levels between genes, a hierarchical clustering model was proposed, where the biological association between genes is measured simultaneously using proximity measure of improved Pearson's correlation (PCPHC). Additionally, the Seed Augment algorithm adopts average linkage methods on rows and columns in order to expand a seed PCPHC model into a maximal global PCPHC (GL-PCPHC) model and to identify association between the clusters. Moreover, a GL-PCPHC applies pattern growing method to mine the PCPHC patterns. Compared to existing gene expression analysis, the PCPHC model achieves better performance. Experimental evaluations are conducted for GL-PCPHC model with standard benchmark gene expression datasets extracted from UCI repository and GenBank database in terms of execution time, size of pattern, significance level, biological association efficiency, and pattern quality.
Collapse
Affiliation(s)
- P. M. Booma
- Department of Computer and Engineering, KCG College of Technology, KCG Nagar, Rajiv Gandhi Salai, Karapakkam, Chennai, Tamil Nadu 600097, India
| | - S. Prabhakaran
- Department of Computer Science and Engineering, SRM University, SRM Nagar, Kattankulathur, Kanchipuram, National Highway 45, Potheri, Tamil Nadu 603203, India
| | - R. Dhanalakshmi
- Department of Computer and Engineering, KCG College of Technology, KCG Nagar, Rajiv Gandhi Salai, Karapakkam, Chennai, Tamil Nadu 600097, India
| |
Collapse
|
24
|
|
25
|
Mixed pattern matching-based traffic abnormal behavior recognition. ScientificWorldJournal 2014; 2014:834013. [PMID: 24605045 PMCID: PMC3926328 DOI: 10.1155/2014/834013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2013] [Accepted: 11/14/2013] [Indexed: 11/25/2022] Open
Abstract
A motion trajectory is an intuitive representation form in time-space domain for a micromotion behavior of moving target. Trajectory analysis is an important approach to recognize abnormal behaviors of moving targets. Against the complexity of vehicle trajectories, this paper first proposed a trajectory pattern learning method based on dynamic time warping (DTW) and spectral clustering. It introduced the DTW distance to measure the distances between vehicle trajectories and determined the number of clusters automatically by a spectral clustering algorithm based on the distance matrix. Then, it clusters sample data points into different clusters. After the spatial patterns and direction patterns learned from the clusters, a recognition method for detecting vehicle abnormal behaviors based on mixed pattern matching was proposed. The experimental results show that the proposed technical scheme can recognize main types of traffic abnormal behaviors effectively and has good robustness. The real-world application verified its feasibility and the validity.
Collapse
|
26
|
Yu Z, Chen H, You J, Han G, Li L. Hybrid fuzzy cluster ensemble framework for tumor clustering from biomolecular data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:657-670. [PMID: 24091399 DOI: 10.1109/tcbb.2013.59] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Cancer class discovery using biomolecular data is one of the most important tasks for cancer diagnosis and treatment. Tumor clustering from gene expression data provides a new way to perform cancer class discovery. Most of the existing research works adopt single-clustering algorithms to perform tumor clustering is from biomolecular data that lack robustness, stability, and accuracy. To further improve the performance of tumor clustering from biomolecular data, we introduce the fuzzy theory into the cluster ensemble framework for tumor clustering from biomolecular data, and propose four kinds of hybrid fuzzy cluster ensemble frameworks (HFCEF), named as HFCEF-I, HFCEF-II, HFCEF-III, and HFCEF-IV, respectively, to identify samples that belong to different types of cancers. The difference between HFCEF-I and HFCEF-II is that they adopt different ensemble generator approaches to generate a set of fuzzy matrices in the ensemble. Specifically, HFCEF-I applies the affinity propagation algorithm (AP) to perform clustering on the sample dimension and generates a set of fuzzy matrices in the ensemble based on the fuzzy membership function and base samples selected by AP. HFCEF-II adopts AP to perform clustering on the attribute dimension, generates a set of subspaces, and obtains a set of fuzzy matrices in the ensemble by performing fuzzy c-means on subspaces. Compared with HFCEF-I and HFCEF-II, HFCEF-III and HFCEF-IV consider the characteristics of HFCEF-I and HFCEF-II. HFCEF-III combines HFCEF-I and HFCEF-II in a serial way, while HFCEF-IV integrates HFCEF-I and HFCEF-II in a concurrent way. HFCEFs adopt suitable consensus functions, such as the fuzzy c-means algorithm or the normalized cut algorithm (Ncut), to summarize generated fuzzy matrices, and obtain the final results. The experiments on real data sets from UCI machine learning repository and cancer gene expression profiles illustrate that 1) the proposed hybrid fuzzy cluster ensemble frameworks work well on real data sets, especially biomolecular data, and 2) the proposed approaches are able to provide more robust, stable, and accurate results when compared with the state-of-the-art single clustering algorithms and traditional cluster ensemble approaches.
Collapse
Affiliation(s)
- Zhiwen Yu
- South China University of Technology, Guangzhou and Hong Kong Polytechnic University, Hong Kong
| | | | | | | | | |
Collapse
|