1
|
Sinha K, Chakraborty S, Bardhan A, Saha R, Chakraborty S, Biswas S. A New Differential Gene Expression Based Simulated Annealing for Solving Gene Selection Problem: A Case Study on Eosinophilic Esophagitis and Few Other Gastro-intestinal Diseases. Biochem Genet 2024:10.1007/s10528-024-10987-z. [PMID: 39643769 DOI: 10.1007/s10528-024-10987-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2024] [Accepted: 11/25/2024] [Indexed: 12/09/2024]
Abstract
Identifying the set of genes collectively responsible for causing a disease from differential gene expression data is called gene selection problem. Though many complex methodologies have been applied to solve gene selection, formulated as an optimization problem, this study introduces a new simple, efficient, and biologically plausible solution procedure where the collective power of the targeted gene set to discriminate between diseased and normal gene expression profiles was focused. It uses Simulated Annealing to solve the underlying optimization problem and termed here as Differential Gene Expression Based Simulated Annealing (DGESA). The Ranked Variance (RV) method has been applied to prioritize genes to form reference set to compare with the outcome of DGESA. In a case study on Eosinophilic Esophagitis (EoE) and other gastrointestinal diseases, RV identified the top 40 high-variance genes, overlapping with disease-causing genes from DGESA. DGESA identified 40 gene pathways each for EoE, Crohn's Disease (CD), and Ulcerative Colitis (UC), with 10 genes for EoE, 8 for CD, and 7 for UC confirmed in literature. For EoE, confirmed genes include KRT79, CRISP2, IL36G, SPRR2B, SPRR2D, and SPRR2E. For CD, validated genes are NPDC1, SLC2A4RG, LGALS8, CDKN1A, XAF1, and CYBA. For UC, confirmed genes include TRAF3, BAG6, CCDC80, CDC42SE2, and HSPA9. RV and DGESA effectively elucidate molecular signatures in gastrointestinal diseases. Validating genes like SPRR2B, SPRR2D, SPRR2E, and STAT6 for EoE demonstrates DGESA's efficacy, highlighting potential targets for future research.
Collapse
Affiliation(s)
- Koushiki Sinha
- Department of CSE, Meghnad Saha Institute of Technology, Behind Urbana Complex Near Ruby General Hospital, Anandapur Rd, Uchhepota, Kolkata, West Bengal, 700150, India
| | - Sanchari Chakraborty
- Department of CSE, Meghnad Saha Institute of Technology, Behind Urbana Complex Near Ruby General Hospital, Anandapur Rd, Uchhepota, Kolkata, West Bengal, 700150, India
| | - Arohit Bardhan
- Department of CSE, Meghnad Saha Institute of Technology, Behind Urbana Complex Near Ruby General Hospital, Anandapur Rd, Uchhepota, Kolkata, West Bengal, 700150, India
| | - Riju Saha
- Department of CSE, Meghnad Saha Institute of Technology, Behind Urbana Complex Near Ruby General Hospital, Anandapur Rd, Uchhepota, Kolkata, West Bengal, 700150, India
| | - Srijan Chakraborty
- Department of CSE, Meghnad Saha Institute of Technology, Behind Urbana Complex Near Ruby General Hospital, Anandapur Rd, Uchhepota, Kolkata, West Bengal, 700150, India
| | - Surama Biswas
- Department of CSE, Meghnad Saha Institute of Technology, Behind Urbana Complex Near Ruby General Hospital, Anandapur Rd, Uchhepota, Kolkata, West Bengal, 700150, India.
| |
Collapse
|
2
|
Genç M. Penalized logistic regression with prior information for microarray gene expression classification. Int J Biostat 2024; 20:107-122. [PMID: 36427223 DOI: 10.1515/ijb-2022-0025] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2022] [Accepted: 11/07/2022] [Indexed: 02/17/2024]
Abstract
Cancer classification and gene selection are important applications in DNA microarray gene expression data analysis. Since DNA microarray data suffers from the high-dimensionality problem, automatic gene selection methods are used to enhance the classification performance of expert classifier systems. In this paper, a new penalized logistic regression method that performs simultaneous gene coefficient estimation and variable selection in DNA microarray data is discussed. The method employs prior information about the gene coefficients to improve the classification accuracy of the underlying model. The coordinate descent algorithm with screening rules is given to obtain the gene coefficient estimates of the proposed method efficiently. The performance of the method is examined on five high-dimensional cancer classification datasets using the area under the curve, the number of selected genes, misclassification rate and F-score measures. The real data analysis results indicate that the proposed method achieves a good cancer classification performance with a small misclassification rate, large area under the curve and F-score by trading off some sparsity level of the underlying model. Hence, the proposed method can be seen as a reliable penalized logistic regression method in the scope of high-dimensional cancer classification.
Collapse
Affiliation(s)
- Murat Genç
- Department of Management Information Systems, Faculty of Economics and Administrative Sciences, Tarsus University Mersin, Mersin 33400, Türkiye
| |
Collapse
|
3
|
Roffo G, Melzi S, Castellani U, Vinciarelli A, Cristani M. Infinite Feature Selection: A Graph-based Feature Filtering Approach. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2021; 43:4396-4410. [PMID: 32750789 DOI: 10.1109/tpami.2020.3002843] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
We propose a filtering feature selection framework that considers subsets of features as paths in a graph, where a node is a feature and an edge indicates pairwise (customizable) relations among features, dealing with relevance and redundancy principles. By two different interpretations (exploiting properties of power series of matrices and relying on Markov chains fundamentals) we can evaluate the values of paths (i.e., feature subsets) of arbitrary lengths, eventually go to infinite, from which we dub our framework Infinite Feature Selection (Inf-FS). Going to infinite allows to constrain the computational complexity of the selection process, and to rank the features in an elegant way, that is, considering the value of any path (subset) containing a particular feature. We also propose a simple unsupervised strategy to cut the ranking, so providing the subset of features to keep. In the experiments, we analyze diverse settings with heterogeneous features, for a total of 11 benchmarks, comparing against 18 widely-known comparative approaches. The results show that Inf-FS behaves better in almost any situation, that is, when the number of features to keep are fixed a priori, or when the decision of the subset cardinality is part of the process.
Collapse
|
4
|
Yang H, Han X, Hao Z. An Immune-Gene-Based Classifier Predicts Prognosis in Patients With Cervical Squamous Cell Carcinoma. Front Mol Biosci 2021; 8:679474. [PMID: 34291084 PMCID: PMC8289438 DOI: 10.3389/fmolb.2021.679474] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2021] [Accepted: 06/21/2021] [Indexed: 01/10/2023] Open
Abstract
Objective: Immunity plays a vital role in the human papilloma virus (HPV) persistent infection, and closely associates with occurrence and development of cervical squamous cell carcinoma (CSCC). Herein, we performed an integrated bioinformatics analysis to establish an immune-gene signature and immune-associated nomogram for predicting prognosis of CSCC patients. Methods: The list of immunity-associated genes was retrieved from ImmPort database. The gene and clinical information of CSCC patients were obtained from The Cancer Genome Atlas (TCGA) website. The immune gene signature for predicting overall survival (OS) of CSCC patients was constructed using the univariate Cox-regression analysis, random survival forests, and multivariate Cox-regression analysis. This signature was externally validated in GSE44001 cohort from Gene Expression Omnibus (GEO). Then, based on the established signature and the TCGA cohort with the corresponding clinical information, a nomogram was constructed and evaluated via Cox regression analysis, concordance index (C-index), receiver operating characteristic (ROC) curves, calibration plots and decision curve analyses (DCAs). Results: A 5-immune-gene prognostic signature for CSCC was established. Low expression of ICOS, ISG20 and high expression of ANGPTL4, SBDS, LTBR were risk factors for CSCC prognosis indicating poor OS. Based on this signature, the OS was significantly worse in high-risk group than in low-risk group (p-value < 0.001), the area under curves (AUCs) for 1-, 3-, 5-years OS were, respectively, 0.784, 0.727, and 0.715. A nomogram incorporating the risk score of signature and the clinical stage was constructed. The C-index of this nomogram was 0.76. AUC values were 0.811, 0.717, and 0.712 for 1-, 3-, 5-years OS. The nomogram showed good calibration and gained more net benefits than the 5-immune-gene signature and the clinical stage. Conclusion: The 5-immune-gene signature may serve as a novel, independent predictor for prognosis in patients with CSCC. The nomogram incorporating the signature risk score and clinical stage improved the predictive performance than the signature and clinical stage alone for predicting 1-year OS.
Collapse
Affiliation(s)
- Huixia Yang
- Department of Gynecology and Obstetrics, Beijing Friendship Hospital, Capital Medical University, Beijing, China
| | - Xiaoyan Han
- Department of Gynecology and Obstetrics, Beijing Friendship Hospital, Capital Medical University, Beijing, China
| | - Zengping Hao
- Department of Gynecology and Obstetrics, Beijing Friendship Hospital, Capital Medical University, Beijing, China
| |
Collapse
|
5
|
Mhiri I, Khalifa AB, Mahjoub MA, Rekik I. Brain graph super-resolution for boosting neurological disorder diagnosis using unsupervised multi-topology connectional brain template learning. Med Image Anal 2020; 65:101768. [DOI: 10.1016/j.media.2020.101768] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2019] [Revised: 04/12/2020] [Accepted: 06/23/2020] [Indexed: 10/24/2022]
|
6
|
Granular Mining and Big Data Analytics: Rough Models and Challenges. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES INDIA SECTION A-PHYSICAL SCIENCES 2020. [DOI: 10.1007/s40010-018-0578-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
7
|
Tian Q, Zou J, Fang Y, Yu Z, Tang J, Song Y, Fan S. A Hybrid Ensemble Approach for Identifying Robust Differentially Methylated Loci in Pan-Cancers. Front Genet 2019; 10:774. [PMID: 31543899 PMCID: PMC6739624 DOI: 10.3389/fgene.2019.00774] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2019] [Accepted: 07/23/2019] [Indexed: 12/14/2022] Open
Abstract
DNA methylation is a widely investigated epigenetic mark that plays a vital role in tumorigenesis. Advancements in high-throughput assays, such as the Infinium 450K platform, provide genome-scale DNA methylation landscapes in single-CpG locus resolution, and the identification of differentially methylated loci has become an insightful approach to deepen our understanding of cancers. However, the situation with extremely unbalanced numbers of samples and loci (approximately 1:1,000) makes it rather difficult to explore differential methylation between the sick and the normal. In this article, a hybrid approach based on ensemble feature selection for identifying differentially methylated loci (HyDML) was proposed by incorporating instance perturbation and multiple function models. Experiments on data from The Cancer Genome Atlas showed that HyDML not only achieved effective DML identification, but also outperformed the single-feature selection approach in terms of classification performance and the robustness of feature selection. The intensive analysis of the DML indicated that different types of cancers have mutual patterns, and the stable DML sharing in pan-cancers is of the great potential to be biomarkers, which may strengthen the confidence of domain experts to implement biological validations.
Collapse
Affiliation(s)
- Qi Tian
- School of Automation Engineering, University of Electronic Science and Technology of China
| | - Jianxiao Zou
- School of Automation Engineering, University of Electronic Science and Technology of China
| | - Yuan Fang
- School of Automation Engineering, University of Electronic Science and Technology of China
| | - Zhongli Yu
- School of Automation Engineering, University of Electronic Science and Technology of China
| | - Jianxiong Tang
- School of Automation Engineering, University of Electronic Science and Technology of China
| | - Ying Song
- School of Automation Engineering, University of Electronic Science and Technology of China
| | - Shicai Fan
- School of Automation Engineering, University of Electronic Science and Technology of China.,Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
8
|
A novel filter–wrapper hybrid greedy ensemble approach optimized using the genetic algorithm to reduce the dimensionality of high-dimensional biomedical datasets. Appl Soft Comput 2019. [DOI: 10.1016/j.asoc.2019.105538] [Citation(s) in RCA: 26] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
9
|
Bhola A, Singh S. Visualisation and Modelling of High-Dimensional Cancerous Gene Expression Dataset. JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT 2019. [DOI: 10.1142/s0219649219500011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The increase in the number of dimensions of cancerous gene expression dataset causes an increase in complexity, misinterpretation and decrease in the visualisation of the particular dataset for further analysis. Therefore, dimensionality reduction, visualisation and modelling tasks of these dataset become challenging. In this paper, a framework is developed which helps to understand, visualise and model high-dimensional cancerous gene expression dataset into lower dimensions which may be helpful in revealing cancer mechanism and diagnosis. Initially, cancerous gene expression datasets are preprocessed to make them complete, precise and efficient; and principal component analysis is applied for dimensionality reduction and visualisation purpose. The regression is used to model the cancerous gene expression dataset so that type of association (linear or nonlinear) and directions between gene profiles may be estimated. To assess the performance of the developed framework, three different types of cancerous gene expression datasets are taken namely: breast (GEO Acc. No. GDS5076), lung (GEO Acc. No. GDS5040) and prostate (GEO Acc. No. GDS5072) which are publicly available. To validate the results of the regression the cross-validation method is used. The results revealed that a linear approach is to be used for prostate cancer dataset and nonlinear approach for breast and lung cancer datasets in finding an association between gene pairs.
Collapse
Affiliation(s)
- Abhishek Bhola
- Department of Computer Science and Engineering, Punjab Engineering College (Deemed to be University), Sector 12, Chandigarh 160012, India
| | - Shailendra Singh
- Department of Computer Science and Engineering, Punjab Engineering College (Deemed to be University), Sector 12, Chandigarh 160012, India
| |
Collapse
|
10
|
Yan Y, Dai T, Yang M, Du X, Zhang Y, Zhang Y. Classifying Incomplete Gene-Expression Data: Ensemble Learning with Non-Pre-Imputation Feature Filtering and Best-First Search Technique. Int J Mol Sci 2018; 19:ijms19113398. [PMID: 30380746 PMCID: PMC6274900 DOI: 10.3390/ijms19113398] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2018] [Revised: 10/20/2018] [Accepted: 10/23/2018] [Indexed: 01/09/2023] Open
Abstract
(1) Background: Gene-expression data usually contain missing values (MVs). Numerous methods focused on how to estimate MVs have been proposed in the past few years. Recent studies show that those imputation algorithms made little difference in classification. Thus, some scholars believe that how to select the informative genes for downstream classification is more important than how to impute MVs. However, most feature-selection (FS) algorithms need beforehand imputation, and the impact of beforehand MV imputation on downstream FS performance is seldom considered. (2) Method: A modified chi-square test-based FS is introduced for gene-expression data. To deal with the challenge of a small sample size of gene-expression data, a heuristic method called recursive element aggregation is proposed in this study. Our approach can directly handle incomplete data without any imputation methods or missing-data assumptions. The most informative genes can be selected through a threshold. After that, the best-first search strategy is utilized to find optimal feature subsets for classification. (3) Results: We compare our method with several FS algorithms. Evaluation is performed on twelve original incomplete cancer gene-expression datasets. We demonstrate that MV imputation on an incomplete dataset impacts subsequent FS in terms of classification tasks. Through directly conducting FS on incomplete data, our method can avoid potential disturbances on subsequent FS procedures caused by MV imputation. An experiment on small, round blue cell tumor (SRBCT) dataset showed that our method found additional genes besides many common genes with the two compared existing methods.
Collapse
Affiliation(s)
- Yuanting Yan
- School of Computer Science and Technology, Anhui University, Hefei 230601, China.
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui University, Hefei 230601, China.
| | - Tao Dai
- School of Computer Science and Technology, Anhui University, Hefei 230601, China.
| | - Meili Yang
- School of Computer Science and Technology, Anhui University, Hefei 230601, China.
| | - Xiuquan Du
- School of Computer Science and Technology, Anhui University, Hefei 230601, China.
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui University, Hefei 230601, China.
| | - Yiwen Zhang
- School of Computer Science and Technology, Anhui University, Hefei 230601, China.
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui University, Hefei 230601, China.
| | - Yanping Zhang
- School of Computer Science and Technology, Anhui University, Hefei 230601, China.
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui University, Hefei 230601, China.
| |
Collapse
|
11
|
Wang WH, Xie TY, Xie GL, Ren ZL, Li JM. An Integrated Approach for Identifying Molecular Subtypes in Human Colon Cancer Using Gene Expression Data. Genes (Basel) 2018; 9:E397. [PMID: 30072645 PMCID: PMC6115727 DOI: 10.3390/genes9080397] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2018] [Revised: 07/18/2018] [Accepted: 07/27/2018] [Indexed: 02/08/2023] Open
Abstract
Identifying molecular subtypes of colorectal cancer (CRC) may allow for more rational, patient-specific treatment. Various studies have identified molecular subtypes for CRC using gene expression data, but they are inconsistent and further research is necessary. From a methodological point of view, a progressive approach is needed to identify molecular subtypes in human colon cancer using gene expression data. We propose an approach to identify the molecular subtypes of colon cancer that integrates denoising by the Bayesian robust principal component analysis (BRPCA) algorithm, hierarchical clustering by the directed bubble hierarchical tree (DBHT) algorithm, and feature gene selection by an improved differential evolution based feature selection method (DEFSW) algorithm. In this approach, the normal samples being completely and exclusively clustered into one class is considered to be the standard of reasonable clustering subtypes, and the feature selection pays attention to imbalances of samples among subtypes. With this approach, we identified the molecular subtypes of colon cancer on the mRNA gene expression dataset of 153 colon cancer samples and 19 normal control samples of the Cancer Genome Atlas (TCGA) project. The colon cancer was clustered into 7 subtypes with 44 feature genes. Our approach could identify finer subtypes of colon cancer with fewer feature genes than the other two recent studies and exhibits a generic methodology that might be applied to identify the subtypes of other cancers.
Collapse
Affiliation(s)
- Wen-Hui Wang
- State Key Laboratory of Organ Failure Research, Division of Nephrology, Southern Medical University, Guangzhou 510515, China.
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, Guangzhou 510515, China.
- Network Information Center, The Sixth Affiliated Hospital of Sun Yat-Sen University, Guangzhou 510655, China.
| | - Ting-Yan Xie
- State Key Laboratory of Organ Failure Research, Division of Nephrology, Southern Medical University, Guangzhou 510515, China.
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, Guangzhou 510515, China.
| | - Guang-Lei Xie
- State Key Laboratory of Organ Failure Research, Division of Nephrology, Southern Medical University, Guangzhou 510515, China.
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, Guangzhou 510515, China.
| | - Zhong-Lu Ren
- Center for Systems Medical Genetics, Department of Obstetrics & Gynecology Nanfang Hospital, Southern Medical University, Guangzhou 510515, China.
- Laboratory of Systems Neuroscience, Institute of Mental Health Southern Medical University, Southern Medical University, Guangzhou 510515, China.
| | - Jin-Ming Li
- State Key Laboratory of Organ Failure Research, Division of Nephrology, Southern Medical University, Guangzhou 510515, China.
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, Guangzhou 510515, China.
| |
Collapse
|
12
|
An Occlusion-Robust Feature Selection Framework in Pedestrian Detection †. SENSORS 2018; 18:s18072272. [PMID: 30011869 PMCID: PMC6068818 DOI: 10.3390/s18072272] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/07/2018] [Revised: 07/09/2018] [Accepted: 07/10/2018] [Indexed: 11/21/2022]
Abstract
Better features have been driving the progress of pedestrian detection over the past years. However, as features become richer and higher dimensional, noise and redundancy in the feature sets become bigger problems. These problems slow down learning and can even reduce the performance of the learned model. Current solutions typically exploit dimension reduction techniques. In this paper, we propose a simple but effective feature selection framework for pedestrian detection. Moreover, we introduce occluded pedestrian samples into the training process and combine it with a new feature selection criterion, which enables improved performances for occlusion handling problems. Experimental results on the Caltech Pedestrian dataset demonstrate the efficiency of our method over the state-of-art methods, especially for the occluded pedestrians.
Collapse
|
13
|
Pal JK, Ray SS, Cho SB, Pal SK. Fuzzy-Rough Entropy Measure and Histogram Based Patient Selection for miRNA Ranking in Cancer. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:659-672. [PMID: 27831888 DOI: 10.1109/tcbb.2016.2623605] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
MicroRNAs (miRNAs) are known as an important indicator of cancers. The presence of cancer can be detected by identifying the responsible miRNAs. A fuzzy-rough entropy measure (FREM) is developed which can rank the miRNAs and thereby identify the relevant ones. FREM is used to determine the relevance of a miRNA in terms of separability between normal and cancer classes. While computing the FREM for a miRNA, fuzziness takes care of the overlapping between normal and cancer expressions, whereas rough lower approximation determines their class sizes. MiRNAs are sorted according to the highest relevance (i.e., the capability of class separation) and a percentage among them is selected from the top ranked ones. FREM is also used to determine the redundancy between two miRNAs and the redundant ones are removed from the selected set, as per the necessity. A histogram based patient selection method is also developed which can help to reduce the number of patients to be dealt during the computation of FREM, while compromising very little with the performance of the selected miRNAs for most of the data sets. The superiority of the FREM as compared to some existing methods is demonstrated extensively on six data sets in terms of sensitivity, specificity, and score. While for these data sets the score of the miRNAs selected by our method varies from 0.70 to 0.91 using SVM, those results vary from 0.37 to 0.90 for some other methods. Moreover, all the selected miRNAs corroborate with the findings of biological investigations or pathway analysis tools. The source code of FREM is available at http://www.jayanta.droppages.com/FREM.html.
Collapse
|
14
|
Chlis NK, Bei ES, Zervakis M. Introducing a Stable Bootstrap Validation Framework for Reliable Genomic Signature Extraction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:181-190. [PMID: 27913357 DOI: 10.1109/tcbb.2016.2633267] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/07/2023]
Abstract
The application of machine learning methods for the identification of candidate genes responsible for phenotypes of interest, such as cancer, is a major challenge in the field of bioinformatics. These lists of genes are often called genomic signatures and their linkage to phenotype associations may form a significant step in discovering the causation between genotypes and phenotypes. Traditional methods that produce genomic signatures from DNA Microarray data tend to extract significantly different lists under relatively small variations of the training data. That instability hinders the validity of research findings and raises skepticism about the reliability of such methods. In this study, a complete framework for the extraction of stable and reliable lists of candidate genes is presented. The proposed methodology enforces stability of results at the validation step and as a result, it is independent of the feature selection and classification methods used. Furthermore, two different statistical tests are performed in order to assess the statistical significance of the observed results. Moreover, the consistency of the signatures extracted by independent executions of the proposed method is also evaluated. The results of this study highlight the importance of stability issues in genomic signatures, beyond their prediction capabilities.
Collapse
|
15
|
Biswas S, Dutta S, Acharyya S. Identification of Disease Critical Genes Using Collective Meta-heuristic Approaches: An Application to Preeclampsia. Interdiscip Sci 2017; 11:444-459. [PMID: 29196984 DOI: 10.1007/s12539-017-0276-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2017] [Revised: 11/06/2017] [Accepted: 11/21/2017] [Indexed: 10/18/2022]
Abstract
Identifying a small subset of disease critical genes out of a large size of microarray gene expression data is a challenge in computational life sciences. This paper has applied four meta-heuristic algorithms, namely, honey bee mating optimization (HBMO), harmony search (HS), differential evolution (DE) and genetic algorithm (basic version GA) to find disease critical genes of preeclampsia which affects women during gestation. Two hybrid algorithms, namely, HBMO-kNN and HS-kNN have been newly proposed here where kNN (k nearest neighbor classifier) is used for sample classification. Performances of these new approaches have been compared with other two hybrid algorithms, namely, DE-kNN and SGA-kNN. Three datasets of different sizes have been used. In a dataset, the set of genes found common in the output of each algorithm is considered here as disease critical genes. In different datasets, the percentage of classification or classification accuracy of meta-heuristic algorithms varied between 92.46 and 100%. HBMO-kNN has the best performance (99.64-100%) in almost all data sets. DE-kNN secures the second position (99.42-100%). Disease critical genes obtained here match with clinically revealed preeclampsia genes to a large extent.
Collapse
Affiliation(s)
- Surama Biswas
- Department of Computer Science and Engineering, Maulana Abul Kalam Azad University of Technology, West Bengal (MAKAUT, WB), BF-142, Sector-I, Salt Lake, Kolkata, West Bengal, 700064, India.
| | - Subarna Dutta
- Department of Computer Science and Engineering, Maulana Abul Kalam Azad University of Technology, West Bengal (MAKAUT, WB), BF-142, Sector-I, Salt Lake, Kolkata, West Bengal, 700064, India
| | - Sriyankar Acharyya
- Department of Computer Science and Engineering, Maulana Abul Kalam Azad University of Technology, West Bengal (MAKAUT, WB), BF-142, Sector-I, Salt Lake, Kolkata, West Bengal, 700064, India
| |
Collapse
|
16
|
|
17
|
Wang W, Ackland DC, McClelland JA, Webster KE, Halgamuge S. Assessment of Gait Characteristics in Total Knee Arthroplasty Patients Using a Hierarchical Partial Least Squares Method. IEEE J Biomed Health Inform 2017; 22:205-214. [PMID: 28371786 DOI: 10.1109/jbhi.2017.2689070] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Quantitative gait analysis is an important tool in objective assessment and management of total knee arthroplasty (TKA) patients. Studies evaluating gait patterns in TKA patients have tended to focus on discrete data such as spatiotemporal information, joint range of motion and peak values of kinematics and kinetics, or consider selected principal components of gait waveforms for analysis. These strategies may not have the capacity to capture small variations in gait patterns associated with each joint across an entire gait cycle, and may ultimately limit the accuracy of gait classification. The aim of this study was to develop an automatic feature extraction method to analyse patterns from high-dimensional autocorrelated gait waveforms. A general linear feature extraction framework was proposed and a hierarchical partial least squares method derived for discriminant analysis of multiple gait waveforms. The effectiveness of this strategy was verified using a dataset of joint angle and ground reaction force waveforms from 43 patients after TKA surgery and 31 healthy control subjects. Compared with principal component analysis and partial least squares methods, the hierarchical partial least squares method achieved generally better classification performance on all possible combinations of waveforms, with the highest classification accuracy . The novel hierarchical partial least squares method proposed is capable of capturing virtually all significant differences between TKA patients and the controls, and provides new insights into data visualization. The proposed framework presents a foundation for more rigorous classification of gait, and may ultimately be used to evaluate the effects of interventions such as surgery and rehabilitation.
Collapse
|
18
|
Ang JC, Mirzal A, Haron H, Hamed HNA. Supervised, Unsupervised, and Semi-Supervised Feature Selection: A Review on Gene Selection. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:971-989. [PMID: 26390495 DOI: 10.1109/tcbb.2015.2478454] [Citation(s) in RCA: 196] [Impact Index Per Article: 21.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Recently, feature selection and dimensionality reduction have become fundamental tools for many data mining tasks, especially for processing high-dimensional data such as gene expression microarray data. Gene expression microarray data comprises up to hundreds of thousands of features with relatively small sample size. Because learning algorithms usually do not work well with this kind of data, a challenge to reduce the data dimensionality arises. A huge number of gene selection are applied to select a subset of relevant features for model construction and to seek for better cancer classification performance. This paper presents the basic taxonomy of feature selection, and also reviews the state-of-the-art gene selection methods by grouping the literatures into three categories: supervised, unsupervised, and semi-supervised. The comparison of experimental results on top 5 representative gene expression datasets indicates that the classification accuracy of unsupervised and semi-supervised feature selection is competitive with supervised feature selection.
Collapse
|
19
|
Wang L, Wang Y, Chang Q. Feature selection methods for big data bioinformatics: A survey from the search perspective. Methods 2016; 111:21-31. [PMID: 27592382 DOI: 10.1016/j.ymeth.2016.08.014] [Citation(s) in RCA: 110] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2016] [Revised: 08/25/2016] [Accepted: 08/30/2016] [Indexed: 11/26/2022] Open
Abstract
This paper surveys main principles of feature selection and their recent applications in big data bioinformatics. Instead of the commonly used categorization into filter, wrapper, and embedded approaches to feature selection, we formulate feature selection as a combinatorial optimization or search problem and categorize feature selection methods into exhaustive search, heuristic search, and hybrid methods, where heuristic search methods may further be categorized into those with or without data-distilled feature ranking measures.
Collapse
Affiliation(s)
- Lipo Wang
- School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore.
| | - Yaoli Wang
- College of Information Engineering, Taiyuan University of Technology, Taiyuan, China.
| | - Qing Chang
- College of Information Engineering, Taiyuan University of Technology, Taiyuan, China.
| |
Collapse
|
20
|
Lovato P, Bicego M, Kesa M, Jojic N, Murino V, Perina A. Traveling on discrete embeddings of gene expression. Artif Intell Med 2016; 70:1-11. [PMID: 27431033 DOI: 10.1016/j.artmed.2016.05.002] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2015] [Revised: 05/20/2016] [Accepted: 05/21/2016] [Indexed: 12/24/2022]
Abstract
OBJECTIVE High-throughput technologies have generated an unprecedented amount of high-dimensional gene expression data. Algorithmic approaches could be extremely useful to distill information and derive compact interpretable representations of the statistical patterns present in the data. This paper proposes a mining approach to extract an informative representation of gene expression profiles based on a generative model called the Counting Grid (CG). METHOD Using the CG model, gene expression values are arranged on a discrete grid, learned in a way that "similar" co-expression patterns are arranged in close proximity, thus resulting in an intuitive visualization of the dataset. More than this, the model permits to identify the genes that distinguish between classes (e.g. different types of cancer). Finally, each sample can be characterized with a discriminative signature - extracted from the model - that can be effectively employed for classification. RESULTS A thorough evaluation on several gene expression datasets demonstrate the suitability of the proposed approach from a twofold perspective: numerically, we reached state-of-the-art classification accuracies on 5 datasets out of 7, and similar results when the approach is tested in a gene selection setting (with a stability always above 0.87); clinically, by confirming that many of the genes highlighted by the model as significant play also a key role for cancer biology. CONCLUSION The proposed framework can be successfully exploited to meaningfully visualize the samples; detect medically relevant genes; properly classify samples.
Collapse
Affiliation(s)
- Pietro Lovato
- Department of Computer Science, University of Verona, Strada le Grazie 15, 37134 Verona, Italy.
| | - Manuele Bicego
- Department of Computer Science, University of Verona, Strada le Grazie 15, 37134 Verona, Italy
| | - Maria Kesa
- Tallinn University of Technology, Ehitajate tee 5, 19086 Tallinn, Estonia
| | - Nebojsa Jojic
- Microsoft Research, One Microsoft Way, 98052 Redmond, WA, USA
| | - Vittorio Murino
- Pattern Analysis and Computer Vision (PAVIS), Istituto Italiano di Tecnologia (IIT), Via Morego 30, 16163 Genova, Italy
| | | |
Collapse
|
21
|
Wang H, Yang F, Luo Z. An experimental study of the intrinsic stability of random forest variable importance measures. BMC Bioinformatics 2016; 17:60. [PMID: 26842629 PMCID: PMC4739337 DOI: 10.1186/s12859-016-0900-5] [Citation(s) in RCA: 93] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2015] [Accepted: 12/15/2015] [Indexed: 12/27/2022] Open
Abstract
Background The stability of Variable Importance Measures (VIMs) based on random forest has recently received increased attention. Despite the extensive attention on traditional stability of data perturbations or parameter variations, few studies include influences coming from the intrinsic randomness in generating VIMs, i.e. bagging, randomization and permutation. To address these influences, in this paper we introduce a new concept of intrinsic stability of VIMs, which is defined as the self-consistence among feature rankings in repeated runs of VIMs without data perturbations and parameter variations. Two widely used VIMs, i.e., Mean Decrease Accuracy (MDA) and Mean Decrease Gini (MDG) are comprehensively investigated. The motivation of this study is two-fold. First, we empirically verify the prevalence of intrinsic stability of VIMs over many real-world datasets to highlight that the instability of VIMs does not originate exclusively from data perturbations or parameter variations, but also stems from the intrinsic randomness of VIMs. Second, through Spearman and Pearson tests we comprehensively investigate how different factors influence the intrinsic stability. Results The experiments are carried out on 19 benchmark datasets with diverse characteristics, including 10 high-dimensional and small-sample gene expression datasets. Experimental results demonstrate the prevalence of intrinsic stability of VIMs. Spearman and Pearson tests on the correlations between intrinsic stability and different factors show that #feature (number of features) and #sample (size of sample) have a coupling effect on the intrinsic stability. The synthetic indictor, #feature/#sample, shows both negative monotonic correlation and negative linear correlation with the intrinsic stability, while OOB accuracy has monotonic correlations with intrinsic stability. This indicates that high-dimensional, small-sample and high complexity datasets may suffer more from intrinsic instability of VIMs. Furthermore, with respect to parameter settings of random forest, a large number of trees is preferred. No significant correlations can be seen between intrinsic stability and other factors. Finally, the magnitude of intrinsic stability is always smaller than that of traditional stability. Conclusion First, the prevalence of intrinsic stability of VIMs demonstrates that the instability of VIMs not only comes from data perturbations or parameter variations, but also stems from the intrinsic randomness of VIMs. This finding gives a better understanding of VIM stability, and may help reduce the instability of VIMs. Second, by investigating the potential factors of intrinsic stability, users would be more aware of the risks and hence more careful when using VIMs, especially on high-dimensional, small-sample and high complexity datasets.
Collapse
Affiliation(s)
- Huazhen Wang
- College of Computer Science and Technology, Huaqiao University, Jimei Avenue, Xiamen, 361021, China. .,Computer Learning Research Centre, Royal Holloway, University of London, Egham, Surrey, TW20 0EX, UK.
| | - Fan Yang
- Automation Department, Xiamen University, Siming South Road, Xiamen, 361005, China.
| | - Zhiyuan Luo
- Computer Learning Research Centre, Royal Holloway, University of London, Egham, Surrey, TW20 0EX, UK.
| |
Collapse
|
22
|
Mundra PA, Rajapakse JC. Gene and sample selection using T-score with sample selection. J Biomed Inform 2016; 59:31-41. [DOI: 10.1016/j.jbi.2015.11.003] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2014] [Revised: 10/13/2015] [Accepted: 11/04/2015] [Indexed: 10/22/2022]
|
23
|
Bonilla-Huerta E, Hernández-Montiel A, Caporal RM, López MA. Hybrid Framework Using Multiple-Filters and an Embedded Approach for an Efficient Selection and Classification of Microarray Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:12-26. [PMID: 26336138 DOI: 10.1109/tcbb.2015.2474384] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
A hybrid framework composed of two stages for gene selection and classification of DNA microarray data is proposed. At the first stage, five traditional statistical methods are combined for preliminary gene selection (Multiple Fusion Filter). Then, different relevant gene subsets are selected by using an embedded Genetic Algorithm (GA), Tabu Search (TS), and Support Vector Machine (SVM). A gene subset, consisting of the most relevant genes, is obtained from this process, by analyzing the frequency of each gene in the different gene subsets. Finally, the most frequent genes are evaluated by the embedded approach to obtain a final relevant small gene subset with high performance. The proposed method is tested in four DNA microarray datasets. From simulation study, it is observed that the proposed approach works better than other methods reported in the literature.
Collapse
|
24
|
Nogueira S, Brown G. Measuring the Stability of Feature Selection. MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES 2016. [DOI: 10.1007/978-3-319-46227-1_28] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
|
25
|
Lai HM, Albrecht AA, Steinhöfel KK. iRDA: a new filter towards predictive, stable, and enriched candidate genes. BMC Genomics 2015; 16:1041. [PMID: 26647162 PMCID: PMC4673793 DOI: 10.1186/s12864-015-2129-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2015] [Accepted: 10/22/2015] [Indexed: 11/28/2022] Open
Abstract
Background Gene expression profiling using high-throughput screening (HTS) technologies allows clinical researchers to find prognosis gene signatures that could better discriminate between different phenotypes and serve as potential biological markers in disease diagnoses. In recent years, many feature selection methods have been devised for finding such discriminative genes, and more recently information theoretic filters have also been introduced for capturing feature-to-class relevance and feature-to-feature correlations in microarray-based classification. Methods In this paper, we present and fully formulate a new multivariate filter, iRDA, for the discovery of HTS gene-expression candidate genes. The filter constitutes a four-step framework and includes feature relevance, feature redundancy, and feature interdependence in the context of feature-pairs. The method is based upon approximate Markov blankets, information theory, several heuristic search strategies with forward, backward and insertion phases, and the method is aiming at higher order gene interactions. Results To show the strengths of iRDA, three performance measures, two evaluation schemes, two stability index sets, and the gene set enrichment analysis (GSEA) are all employed in our experimental studies. Its effectiveness has been validated by using seven well-known cancer gene-expression benchmarks and four other disease experiments, including a comparison to three popular information theoretic filters. In terms of classification performance, candidate genes selected by iRDA perform better than the sets discovered by the other three filters. Two stability measures indicate that iRDA is the most robust with the least variance. GSEA shows that iRDA produces more statistically enriched gene sets on five out of the six benchmark datasets. Conclusions Through the classification performance, the stability performance, and the enrichment analysis, iRDA is a promising filter to find predictive, stable, and enriched gene-expression candidate genes. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-2129-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Hung-Ming Lai
- Algorithms and Bioinformatics Research Group, Department of Informatics, King's College London, Strand, London, WC2R 2LS, UK.
| | - Andreas A Albrecht
- School of Science and Technology, Middlesex University, Burroughs, London, NW4 4BT, UK.
| | - Kathleen K Steinhöfel
- Algorithms and Bioinformatics Research Group, Department of Informatics, King's College London, Strand, London, WC2R 2LS, UK.
| |
Collapse
|
26
|
Algamal ZY, Lee MH. Regularized logistic regression with adjusted adaptive elastic net for gene selection in high dimensional cancer classification. Comput Biol Med 2015; 67:136-45. [DOI: 10.1016/j.compbiomed.2015.10.008] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2015] [Revised: 10/07/2015] [Accepted: 10/08/2015] [Indexed: 10/22/2022]
|
27
|
Sehhati M, Mehridehnavi A, Rabbani H, Pourhossein M. Stable Gene Signature Selection for Prediction of Breast Cancer Recurrence Using Joint Mutual Information. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:1440-1448. [PMID: 26671813 DOI: 10.1109/tcbb.2015.2407407] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
In this experiment, a gene selection technique was proposed to select a robust gene signature from microarray data for prediction of breast cancer recurrence. In this regard, a hybrid scoring criterion was designed as linear combinations of the scores that were determined in the mutual information (MI) domain and protein-protein interactions network. Whereas, the MI-based score represents the complementary information between the selected genes for outcome prediction; and the number of connections in the PPI network between the selected genes builds the PPI-based score. All genes were scored by using the proposed function in a hybrid forward-backward gene-set selection process to select the optimum biomarker-set from the gene expression microarray data. The accuracy and stability of the finally selected biomarkers were evaluated by using five-fold cross-validation (CV) to classify available data on breast cancer patients into two cohorts of poor and good prognosis. The results showed an appealing improvement in the cross-dataset accuracy in comparison with similar studies whenever we applied a primary signature, which was selected from one dataset, to predict survival in other independent datasets. Moreover, the proposed method demonstrated 58-92 percent overlap between 50-genes signatures, which were selected from seven independent datasets individually.
Collapse
|
28
|
Nguyen T, Khosravi A, Creighton D, Nahavandi S. A novel aggregate gene selection method for microarray data classification. Pattern Recognit Lett 2015. [DOI: 10.1016/j.patrec.2015.03.018] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
29
|
Li Y, Si J, Zhou G, Huang S, Chen S. FREL: A Stable Feature Selection Algorithm. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2015; 26:1388-1402. [PMID: 25134091 DOI: 10.1109/tnnls.2014.2341627] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Two factors characterize a good feature selection algorithm: its accuracy and stability. This paper aims at introducing a new approach to stable feature selection algorithms. The innovation of this paper centers on a class of stable feature selection algorithms called feature weighting as regularized energy-based learning (FREL). Stability properties of FREL using L1 or L2 regularization are investigated. In addition, as a commonly adopted implementation strategy for enhanced stability, an ensemble FREL is proposed. A stability bound for the ensemble FREL is also presented. Our experiments using open source real microarray data, which are challenging high dimensionality small sample size problems demonstrate that our proposed ensemble FREL is not only stable but also achieves better or comparable accuracy than some other popular stable feature weighting methods.
Collapse
|
30
|
Yu Z, Chen H, You J, Wong HS, Liu J, Li L, Han G. Double Selection Based Semi-Supervised Clustering Ensemble for Tumor Clustering from Gene Expression Profiles. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:727-740. [PMID: 26356343 DOI: 10.1109/tcbb.2014.2315996] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Tumor clustering is one of the important techniques for tumor discovery from cancer gene expression profiles, which is useful for the diagnosis and treatment of cancer. While different algorithms have been proposed for tumor clustering, few make use of the expert's knowledge to better the performance of tumor discovery. In this paper, we first view the expert's knowledge as constraints in the process of clustering, and propose a feature selection based semi-supervised cluster ensemble framework (FS-SSCE) for tumor clustering from bio-molecular data. Compared with traditional tumor clustering approaches, the proposed framework FS-SSCE is featured by two properties: (1) The adoption of feature selection techniques to dispel the effect of noisy genes. (2) The employment of the binate constraint based K-means algorithm to take into account the effect of experts' knowledge. Then, a double selection based semi-supervised cluster ensemble framework (DS-SSCE) which not only applies the feature selection technique to perform gene selection on the gene dimension, but also selects an optimal subset of representative clustering solutions in the ensemble and improve the performance of tumor clustering using the normalized cut algorithm. DS-SSCE also introduces a confidence factor into the process of constructing the consensus matrix by considering the prior knowledge of the data set. Finally, we design a modified double selection based semi-supervised cluster ensemble framework (MDS-SSCE) which adopts multiple clustering solution selection strategies and an aggregated solution selection function to choose an optimal subset of clustering solutions. The results in the experiments on cancer gene expression profiles show that (i) FS-SSCE, DS-SSCE and MDS-SSCE are suitable for performing tumor clustering from bio-molecular data. (ii) MDS-SSCE outperforms a number of state-of-the-art tumor clustering approaches on most of the data sets.
Collapse
|
31
|
|
32
|
Kurnaz MN, Seker H. A framework towards computational discovery of disease sub-types and associated (sub-)biomarkers. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2013; 2013:4074-4077. [PMID: 24110627 DOI: 10.1109/embc.2013.6610440] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Biomarker related patient data is generally assessed in order to determine relevant but generalized subset of the biomarkers. However, it fails to identify specific sub-groups of the patients or their corresponding (subset of) the biomarkers. This paper therefore proposes a novel framework that is capable of discovering disease sub-groups (types) and associated subset of biomarkers, which is expected to lead to enable the discovery of personalized bio-marker set. The framework is based on the utilization of a histogram obtained by using the Euclidean distances between the samples in a given data set. The t-test method is used for the selection of sub-set(s) of the biomarkers whereas the classification is performed by means of k-nearest neighbor, support vector machines and naive Bayes (NBayes) classifiers. For the assessment of the methods, leave-out-out cross validation is employed. As a case study, the method is applied in the analysis of male hypertension microarray data that consists of 159 patients and 22184 gene expressions. The method has helped identify specific sub-groups of the patients and their corresponding bio-marker sub-sets. The results therefore suggest that the generalized bio-marker sub-sets are not representative of the disease and therefore more focus should be on the sub-groups of the patients and their biomarker subsets identified through the proposed approach. It is particularly observed that the threshold values over the histogram are crucial to discover both sub-sets of the samples and biomarkers, and therefore can be used to determine complexity level of the study.
Collapse
|
33
|
|
34
|
Wu MY, Dai DQ, Shi Y, Yan H, Zhang XF. Biomarker identification and cancer classification based on microarray data using Laplace naive Bayes model with mean shrinkage. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:1649-1662. [PMID: 22868679 DOI: 10.1109/tcbb.2012.105] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Biomarker identification and cancer classification are two closely related problems. In gene expression data sets, the correlation between genes can be high when they share the same biological pathway. Moreover, the gene expression data sets may contain outliers due to either chemical or electrical reasons. A good gene selection method should take group effects into account and be robust to outliers. In this paper, we propose a Laplace naive Bayes model with mean shrinkage (LNB-MS). The Laplace distribution instead of the normal distribution is used as the conditional distribution of the samples for the reasons that it is less sensitive to outliers and has been applied in many fields. The key technique is the L1 penalty imposed on the mean of each class to achieve automatic feature selection. The objective function of the proposed model is a piecewise linear function with respect to the mean of each class, of which the optimal value can be evaluated at the breakpoints simply. An efficient algorithm is designed to estimate the parameters in the model. A new strategy that uses the number of selected features to control the regularization parameter is introduced. Experimental results on simulated data sets and 17 publicly available cancer data sets attest to the accuracy, sparsity, efficiency, and robustness of the proposed algorithm. Many biomarkers identified with our method have been verified in biochemical or biomedical research. The analysis of biological and functional correlation of the genes based on Gene Ontology (GO) terms shows that the proposed method guarantees the selection of highly correlated genes simultaneously
Collapse
Affiliation(s)
- Meng-Yun Wu
- Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University,Guangzhou 510275, China.
| | | | | | | | | |
Collapse
|
35
|
Jurman G, Riccadonna S, Visintainer R, Furlanello C. Algebraic comparison of partial lists in bioinformatics. PLoS One 2012; 7:e36540. [PMID: 22615778 PMCID: PMC3355159 DOI: 10.1371/journal.pone.0036540] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2011] [Accepted: 04/06/2012] [Indexed: 12/20/2022] Open
Abstract
The outcome of a functional genomics pipeline is usually a partial list of genomic features, ranked by their relevance in modelling biological phenotype in terms of a classification or regression model. Due to resampling protocols or to a meta-analysis comparison, it is often the case that sets of alternative feature lists (possibly of different lengths) are obtained, instead of just one list. Here we introduce a method, based on permutations, for studying the variability between lists ("list stability") in the case of lists of unequal length. We provide algorithms evaluating stability for lists embedded in the full feature set or just limited to the features occurring in the partial lists. The method is demonstrated by finding and comparing gene profiles on a large prostate cancer dataset, consisting of two cohorts of patients from different countries, for a total of 455 samples.
Collapse
|
36
|
Lovato P, Bicego M, Cristani M, Jojic N, Perina A. Feature Selection Using Counting Grids: Application to Microarray Data. ACTA ACUST UNITED AC 2012. [DOI: 10.1007/978-3-642-34166-3_69] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/08/2023]
|