101
|
Mundra PA, Rajapakse JC. Gene and sample selection using T-score with sample selection. J Biomed Inform 2016; 59:31-41. [DOI: 10.1016/j.jbi.2015.11.003] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2014] [Revised: 10/13/2015] [Accepted: 11/04/2015] [Indexed: 10/22/2022]
|
102
|
Mollaee M, Moattar MH. A novel feature extraction approach based on ensemble feature selection and modified discriminant independent component analysis for microarray data classification. Biocybern Biomed Eng 2016. [DOI: 10.1016/j.bbe.2016.05.001] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
103
|
Lai HM, Albrecht AA, Steinhöfel KK. iRDA: a new filter towards predictive, stable, and enriched candidate genes. BMC Genomics 2015; 16:1041. [PMID: 26647162 PMCID: PMC4673793 DOI: 10.1186/s12864-015-2129-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2015] [Accepted: 10/22/2015] [Indexed: 11/28/2022] Open
Abstract
Background Gene expression profiling using high-throughput screening (HTS) technologies allows clinical researchers to find prognosis gene signatures that could better discriminate between different phenotypes and serve as potential biological markers in disease diagnoses. In recent years, many feature selection methods have been devised for finding such discriminative genes, and more recently information theoretic filters have also been introduced for capturing feature-to-class relevance and feature-to-feature correlations in microarray-based classification. Methods In this paper, we present and fully formulate a new multivariate filter, iRDA, for the discovery of HTS gene-expression candidate genes. The filter constitutes a four-step framework and includes feature relevance, feature redundancy, and feature interdependence in the context of feature-pairs. The method is based upon approximate Markov blankets, information theory, several heuristic search strategies with forward, backward and insertion phases, and the method is aiming at higher order gene interactions. Results To show the strengths of iRDA, three performance measures, two evaluation schemes, two stability index sets, and the gene set enrichment analysis (GSEA) are all employed in our experimental studies. Its effectiveness has been validated by using seven well-known cancer gene-expression benchmarks and four other disease experiments, including a comparison to three popular information theoretic filters. In terms of classification performance, candidate genes selected by iRDA perform better than the sets discovered by the other three filters. Two stability measures indicate that iRDA is the most robust with the least variance. GSEA shows that iRDA produces more statistically enriched gene sets on five out of the six benchmark datasets. Conclusions Through the classification performance, the stability performance, and the enrichment analysis, iRDA is a promising filter to find predictive, stable, and enriched gene-expression candidate genes. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-2129-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Hung-Ming Lai
- Algorithms and Bioinformatics Research Group, Department of Informatics, King's College London, Strand, London, WC2R 2LS, UK.
| | - Andreas A Albrecht
- School of Science and Technology, Middlesex University, Burroughs, London, NW4 4BT, UK.
| | - Kathleen K Steinhöfel
- Algorithms and Bioinformatics Research Group, Department of Informatics, King's College London, Strand, London, WC2R 2LS, UK.
| |
Collapse
|
104
|
Liao B, Jiang Y, Liang W, Peng L, Peng L, Hanyurwimfura D, Li Z, Chen M. On Efficient Feature Ranking Methods for High-Throughput Data Analysis. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:1374-1384. [PMID: 26684461 DOI: 10.1109/tcbb.2015.2415790] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Efficient mining of high-throughput data has become one of the popular themes in the big data era. Existing biology-related feature ranking methods mainly focus on statistical and annotation information. In this study, two efficient feature ranking methods are presented. Multi-target regression and graph embedding are incorporated in an optimization framework, and feature ranking is achieved by introducing structured sparsity norm. Unlike existing methods, the presented methods have two advantages: (1) the feature subset simultaneously account for global margin information as well as locality manifold information. Consequently, both global and locality information are considered. (2) Features are selected by batch rather than individually in the algorithm framework. Thus, the interactions between features are considered and the optimal feature subset can be guaranteed. In addition, this study presents a theoretical justification. Empirical experiments demonstrate the effectiveness and efficiency of the two algorithms in comparison with some state-of-the-art feature ranking methods through a set of real-world gene expression data sets.
Collapse
|
105
|
Pal JK, Ray SS, Pal SK. Identifying relevant group of miRNAs in cancer using fuzzy mutual information. Med Biol Eng Comput 2015; 54:701-10. [PMID: 26264058 DOI: 10.1007/s11517-015-1360-1] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2015] [Accepted: 07/21/2015] [Indexed: 12/17/2022]
Abstract
MicroRNAs (miRNAs) act as a major biomarker of cancer. All miRNAs in human body are not equally important for cancer identification. We propose a methodology, called FMIMS, which automatically selects the most relevant miRNAs for a particular type of cancer. In FMIMS, miRNAs are initially grouped by using a SVM-based algorithm; then the group with highest relevance is determined and the miRNAs in that group are finally ranked for selection according to their redundancy. Fuzzy mutual information is used in computing the relevance of a group and the redundancy of miRNAs within it. Superiority of the most relevant group to all others, in deciding normal or cancer, is demonstrated on breast, renal, colorectal, lung, melanoma and prostate data. The merit of FMIMS as compared to several existing methods is established. While 12 out of 15 selected miRNAs by FMIMS corroborate with those of biological investigations, three of them viz., "hsa-miR-519," "hsa-miR-431" and "hsa-miR-320c" are possible novel predictions for renal cancer, lung cancer and melanoma, respectively. The selected miRNAs are found to be involved in disease-specific pathways by targeting various genes. The method is also able to detect the responsible miRNAs even at the primary stage of cancer. The related code is available at http://www.jayanta.droppages.com/FMIMS.html .
Collapse
Affiliation(s)
- Jayanta Kumar Pal
- Center for Soft Computing Research, Indian Statistical Institute, Kolkata, India.
| | - Shubhra Sankar Ray
- Center for Soft Computing Research, Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India
| | - Sankar K Pal
- Center for Soft Computing Research, Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India
| |
Collapse
|
106
|
ProSim: A Method for Prioritizing Disease Genes Based on Protein Proximity and Disease Similarity. BIOMED RESEARCH INTERNATIONAL 2015; 2015:213750. [PMID: 26339594 PMCID: PMC4538409 DOI: 10.1155/2015/213750] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/15/2014] [Accepted: 01/16/2015] [Indexed: 01/19/2023]
Abstract
Predicting disease genes for a particular genetic disease is very challenging in bioinformatics. Based on current research studies, this challenge can be tackled via network-based approaches. Furthermore, it has been highlighted that it is necessary to consider disease similarity along with the protein's proximity to disease genes in a protein-protein interaction (PPI) network in order to improve the accuracy of disease gene prioritization. In this study we propose a new algorithm called proximity disease similarity algorithm (ProSim), which takes both of the aforementioned properties into consideration, to prioritize disease genes. To illustrate the proposed algorithm, we have conducted six case studies, namely, prostate cancer, Alzheimer's disease, diabetes mellitus type 2, breast cancer, colorectal cancer, and lung cancer. We employed leave-one-out cross validation, mean enrichment, tenfold cross validation, and ROC curves to evaluate our proposed method and other existing methods. The results show that our proposed method outperforms existing methods such as PRINCE, RWR, and DADA.
Collapse
|
107
|
Albashish D, Sahran S, Abdullah A, Adam A, Shukor NA, Pauzi SHM. Multi-scoring feature selection method based on SVM-RFE for prostate cancer diagnosis. 2015 INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING AND INFORMATICS (ICEEI) 2015. [DOI: 10.1109/iceei.2015.7352585] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/02/2023]
|
108
|
Ganegoda GU, Li M, Wang W, Feng Q. Heterogeneous Network Model to Infer Human Disease-Long Intergenic Non-Coding RNA Associations. IEEE Trans Nanobioscience 2015; 14:175-83. [DOI: 10.1109/tnb.2015.2391133] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
109
|
Hou D, Koyutürk M. Comprehensive evaluation of composite gene features in cancer outcome prediction. Cancer Inform 2015; 13:93-104. [PMID: 25780335 PMCID: PMC4345828 DOI: 10.4137/cin.s14028] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2014] [Revised: 09/29/2014] [Accepted: 10/04/2014] [Indexed: 11/24/2022] Open
Abstract
Owing to the heterogeneous and continuously evolving nature of cancers, classifiers based on the expression of individual genes usually do not result in robust prediction of cancer outcome. As an alternative, composite gene features that combine functionally related genes have been proposed. It is expected that such features can be more robust and reproducible since they can capture the alterations in relevant biological processes as a whole and may be less sensitive to fluctuations in the expression of individual genes. Various algorithms have been developed for the identification of composite features and inference of composite gene feature activity, which all claim to improve the prediction accuracy. However, because of the limitations of test datasets incorporated by each individual study and inconsistent test procedures, the results of these studies are sometimes conflicting and unproducible. For this reason, it is difficult to have a comprehensive understanding of the prediction performance of composite gene features, particularly across different cancers, cancer subtypes, and cohorts. In this study, we implement various algorithms for the identification of composite gene features and their utilization in cancer outcome prediction, and perform extensive comparison and evaluation using seven microarray datasets covering two cancer types and three different phenotypes. Our results show that, while some algorithms outperform others for certain classification tasks, no single algorithm consistently outperforms other algorithms and individual gene features.
Collapse
Affiliation(s)
- Dezhi Hou
- Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH, USA
| | - Mehmet Koyutürk
- Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH, USA. ; Center for Proteomics and Bioinformatics, Case Western Reserve University, Cleveland, OH, USA
| |
Collapse
|
110
|
Liao B, Jiang Y, Liang W, Zhu W, Cai L, Cao Z. Gene Selection Using Locality Sensitive Laplacian Score. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:1146-1156. [PMID: 26357051 DOI: 10.1109/tcbb.2014.2328334] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Gene selection based on microarray data, is highly important for classifying tumors accurately. Existing gene selection schemes are mainly based on ranking statistics. From manifold learning standpoint, local geometrical structure is more essential to characterize features compared with global information. In this study, we propose a supervised gene selection method called locality sensitive Laplacian score (LSLS), which incorporates discriminative information into local geometrical structure, by minimizing local within-class information and maximizing local between-class information simultaneously. In addition, variance information is considered in our algorithm framework. Eventually, to find more superior gene subsets, which is significant for biomarker discovery, a two-stage feature selection method that combines the LSLS and wrapper method (sequential forward selection or sequential backward selection) is presented. Experimental results of six publicly available gene expression profile data sets demonstrate the effectiveness of the proposed approach compared with a number of state-of-the-art gene selection methods.
Collapse
|
111
|
Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A, Benítez J, Herrera F. A review of microarray datasets and applied feature selection methods. Inf Sci (N Y) 2014. [DOI: 10.1016/j.ins.2014.05.042] [Citation(s) in RCA: 386] [Impact Index Per Article: 35.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
112
|
Yu Z, Chen H, You J, Wong HS, Liu J, Li L, Han G. Double Selection Based Semi-Supervised Clustering Ensemble for Tumor Clustering from Gene Expression Profiles. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:727-740. [PMID: 26356343 DOI: 10.1109/tcbb.2014.2315996] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Tumor clustering is one of the important techniques for tumor discovery from cancer gene expression profiles, which is useful for the diagnosis and treatment of cancer. While different algorithms have been proposed for tumor clustering, few make use of the expert's knowledge to better the performance of tumor discovery. In this paper, we first view the expert's knowledge as constraints in the process of clustering, and propose a feature selection based semi-supervised cluster ensemble framework (FS-SSCE) for tumor clustering from bio-molecular data. Compared with traditional tumor clustering approaches, the proposed framework FS-SSCE is featured by two properties: (1) The adoption of feature selection techniques to dispel the effect of noisy genes. (2) The employment of the binate constraint based K-means algorithm to take into account the effect of experts' knowledge. Then, a double selection based semi-supervised cluster ensemble framework (DS-SSCE) which not only applies the feature selection technique to perform gene selection on the gene dimension, but also selects an optimal subset of representative clustering solutions in the ensemble and improve the performance of tumor clustering using the normalized cut algorithm. DS-SSCE also introduces a confidence factor into the process of constructing the consensus matrix by considering the prior knowledge of the data set. Finally, we design a modified double selection based semi-supervised cluster ensemble framework (MDS-SSCE) which adopts multiple clustering solution selection strategies and an aggregated solution selection function to choose an optimal subset of clustering solutions. The results in the experiments on cancer gene expression profiles show that (i) FS-SSCE, DS-SSCE and MDS-SSCE are suitable for performing tumor clustering from bio-molecular data. (ii) MDS-SSCE outperforms a number of state-of-the-art tumor clustering approaches on most of the data sets.
Collapse
|
113
|
GaneshKumar P, Rani C, Devaraj D, Victoire TAA. Hybrid Ant Bee Algorithm for Fuzzy Expert System Based Sample Classification. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:347-360. [PMID: 26355782 DOI: 10.1109/tcbb.2014.2307325] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Accuracy maximization and complexity minimization are the two main goals of a fuzzy expert system based microarray data classification. Our previous Genetic Swarm Algorithm (GSA) approach has improved the classification accuracy of the fuzzy expert system at the cost of their interpretability. The if-then rules produced by the GSA are lengthy and complex which is difficult for the physician to understand. To address this interpretability-accuracy tradeoff, the rule set is represented using integer numbers and the task of rule generation is treated as a combinatorial optimization task. Ant colony optimization (ACO) with local and global pheromone updations are applied to find out the fuzzy partition based on the gene expression values for generating simpler rule set. In order to address the formless and continuous expression values of a gene, this paper employs artificial bee colony (ABC) algorithm to evolve the points of membership function. Mutual Information is used for idenfication of informative genes. The performance of the proposed hybrid Ant Bee Algorithm (ABA) is evaluated using six gene expression data sets. From the simulation study, it is found that the proposed approach generated an accurate fuzzy system with highly interpretable and compact rules for all the data sets when compared with other approaches.
Collapse
|
114
|
Hidalgo-Muñoz AR, Ramírez J, Górriz JM, Padilla P. Regions of interest computed by SVM wrapped method for Alzheimer's disease examination from segmented MRI. Front Aging Neurosci 2014; 6:20. [PMID: 24634656 PMCID: PMC3929832 DOI: 10.3389/fnagi.2014.00020] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2013] [Accepted: 02/02/2014] [Indexed: 01/26/2023] Open
Abstract
Accurate identification of the most relevant brain regions linked to Alzheimer's disease (AD) is crucial in order to improve diagnosis techniques and to better understand this neurodegenerative process. For this purpose, statistical classification is suitable. In this work, a novel method based on support vector machine recursive feature elimination (SVM-RFE) is proposed to be applied on segmented brain MRI for detecting the most discriminant AD regions of interest (ROIs). The analyses are performed both on gray and white matter tissues, achieving up to 100% accuracy after classification and outperforming the results obtained by the standard t-test feature selection. The present method, applied on different subject sets, permits automatically determining high-resolution areas surrounding the hippocampal area without needing to divide the brain images according to any common template.
Collapse
Affiliation(s)
- Antonio R Hidalgo-Muñoz
- Department of Signal Theory, Networking and Communications, University of Granada Granada, Spain
| | - Javier Ramírez
- Department of Signal Theory, Networking and Communications, University of Granada Granada, Spain
| | - Juan M Górriz
- Department of Signal Theory, Networking and Communications, University of Granada Granada, Spain
| | - Pablo Padilla
- Department of Signal Theory, Networking and Communications, University of Granada Granada, Spain
| |
Collapse
|
115
|
You W, Yang Z, Ji G. PLS-based recursive feature elimination for high-dimensional small sample. Knowl Based Syst 2014. [DOI: 10.1016/j.knosys.2013.10.004] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
|
116
|
Li X, Yin M. Multiobjective Binary Biogeography Based Optimization for Feature Selection Using Gene Expression Data. IEEE Trans Nanobioscience 2013; 12:343-53. [DOI: 10.1109/tnb.2013.2294716] [Citation(s) in RCA: 103] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
117
|
Ma C, Dong X, Li R, Liu L. A computational study identifies HIV progression-related genes using mRMR and shortest path tracing. PLoS One 2013; 8:e78057. [PMID: 24244287 PMCID: PMC3823927 DOI: 10.1371/journal.pone.0078057] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2013] [Accepted: 09/13/2013] [Indexed: 01/18/2023] Open
Abstract
Since statistical relationships between HIV load and CD4+ T cell loss have been demonstrated to be weak, searching for host factors contributing to the pathogenesis of HIV infection becomes a key point for both understanding the disease pathology and developing treatments. We applied Maximum Relevance Minimum Redundancy (mRMR) algorithm to a set of microarray data generated from the CD4+ T cells of viremic non-progressors (VNPs) and rapid progressors (RPs) to identify host factors associated with the different responses to HIV infection. Using mRMR algorithm, 147 gene had been identified. Furthermore, we constructed a weighted molecular interaction network with the existing protein-protein interaction data from STRING database and identified 1331 genes on the shortest-paths among the genes identified with mRMR. Functional analysis shows that the functions relating to apoptosis play important roles during the pathogenesis of HIV infection. These results bring new insights of understanding HIV progression.
Collapse
Affiliation(s)
- Chengcheng Ma
- Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, P.R. China
- University of Chinese Academy of Sciences, Beijing, P.R. China
| | - Xiao Dong
- Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, P.R. China
- University of Chinese Academy of Sciences, Beijing, P.R. China
- Shanghai Center for Bioinformation Technology, Shanghai, P.R. China
| | - Rudong Li
- Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, P.R. China
- University of Chinese Academy of Sciences, Beijing, P.R. China
| | - Lei Liu
- Institutes for Biomedical Sciences, Fudan University, Shanghai, P.R. China
- * E-mail:
| |
Collapse
|
118
|
Maleki M, Vasudev G, Rueda L. The role of electrostatic energy in prediction of obligate protein-protein interactions. Proteome Sci 2013; 11:S11. [PMID: 24564955 PMCID: PMC3907787 DOI: 10.1186/1477-5956-11-s1-s11] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Prediction and analysis of protein-protein interactions (PPI) and specifically types of PPIs is an important problem in life science research because of the fundamental roles of PPIs in many biological processes in living cells. In addition, electrostatic interactions are important in understanding inter-molecular interactions, since they are long-range, and because of their influence in charged molecules. This is the main motivation for using electrostatic energy for prediction of PPI types. RESULTS We propose a prediction model to analyze protein interaction types, namely obligate and non-obligate, using electrostatic energy values as properties. The prediction approach uses electrostatic energy values for pairs of atoms and amino acids present in interfaces where the interaction occurs. The main features of the complexes are found and then the prediction is performed via several state-of-the-art classification techniques, including linear dimensionality reduction (LDR), support vector machine (SVM), naive Bayes (NB) and k-nearest neighbor (k-NN). For an in-depth analysis of classification results, some other experiments were performed by varying the distance cutoffs between atom pairs of interacting chains, ranging from 5Å to 13Å. Moreover, several feature selection algorithms including gain ratio (GR), information gain (IG), chi-square (Chi2) and minimum redundancy maximum relevance (mRMR) are applied on the available datasets to obtain more discriminative pairs of atom types and amino acid types as features for prediction. CONCLUSIONS Our results on two well-known datasets of obligate and non-obligate complexes confirm that electrostatic energy is an important property to predict obligate and non-obligate protein interaction types on the basis of all the experimental results, achieving accuracies of over 98%. Furthermore, a comparison performed by changing the distance cutoff demonstrates that the best values for prediction of PPI types using electrostatic energy range from 9Å to 12Å, which show that electrostatic interactions are long-range and cover a broader area in the interface. In addition, the results on using feature selection before prediction confirm that (a) a few pairs of atoms and amino acids are appropriate for prediction, and (b) prediction performance can be improved by eliminating irrelevant and noisy features and selecting the most discriminative ones.
Collapse
Affiliation(s)
- Mina Maleki
- School of Computer Science, University of Windsor, 401 Sunset Avenue, Windsor, Ontario, N9B 3P4, Canada
| | - Gokul Vasudev
- School of Computer Science, University of Windsor, 401 Sunset Avenue, Windsor, Ontario, N9B 3P4, Canada
| | - Luis Rueda
- School of Computer Science, University of Windsor, 401 Sunset Avenue, Windsor, Ontario, N9B 3P4, Canada
| |
Collapse
|
119
|
Sun X, Liu Y, Wei D, Xu M, Chen H, Han J. Selection of interdependent genes via dynamic relevance analysis for cancer diagnosis. J Biomed Inform 2013; 46:252-8. [DOI: 10.1016/j.jbi.2012.10.004] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2012] [Revised: 10/01/2012] [Accepted: 10/03/2012] [Indexed: 11/16/2022]
|
120
|
Rajapakse JC, Mundra PA. Multiclass gene selection using Pareto-fronts. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:87-97. [PMID: 23702546 DOI: 10.1109/tcbb.2013.1] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Filter methods are often used for selection of genes in multiclass sample classification by using microarray data. Such techniques usually tend to bias toward a few classes that are easily distinguishable from other classes due to imbalances of strong features and sample sizes of different classes. It could therefore lead to selection of redundant genes while missing the relevant genes, leading to poor classification of tissue samples. In this manuscript, we propose to decompose multiclass ranking statistics into class-specific statistics and then use Pareto-front analysis for selection of genes. This alleviates the bias induced by class intrinsic characteristics of dominating classes. The use of Pareto-front analysis is demonstrated on two filter criteria commonly used for gene selection: F-score and KW-score. A significant improvement in classification performance and reduction in redundancy among top-ranked genes were achieved in experiments with both synthetic and real-benchmark data sets.
Collapse
|
121
|
Burton M, Thomassen M, Tan Q, Kruse TA. Prediction of breast cancer metastasis by gene expression profiles: a comparison of metagenes and single genes. Cancer Inform 2012; 11:193-217. [PMID: 23304070 PMCID: PMC3529607 DOI: 10.4137/cin.s10375] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Abstract
Background The popularity of a large number of microarray applications has in cancer research led to the development of predictive or prognostic gene expression profiles. However, the diversity of microarray platforms has made the full validation of such profiles and their related gene lists across studies difficult and, at the level of classification accuracies, rarely validated in multiple independent datasets. Frequently, while the individual genes between such lists may not match, genes with same function are included across such gene lists. Development of such lists does not take into account the fact that genes can be grouped together as metagenes (MGs) based on common characteristics such as pathways, regulation, or genomic location. Such MGs might be used as features in building a predictive model applicable for classifying independent data. It is, therefore, demanding to systematically compare independent validation of gene lists or classifiers based on metagene or individual gene (SG) features. Methods In this study we compared the performance of either metagene-or single gene-based feature sets and classifiers using random forest and two support vector machines for classifier building. The performance within the same dataset, feature set validation performance, and validation performance of entire classifiers in strictly independent datasets were assessed by 10 times repeated 10-fold cross validation, leave-one-out cross validation, and one-fold validation, respectively. To test the significance of the performance difference between MG- and SG-features/classifiers, we used a repeated down-sampled binomial test approach. Results MG- and SG-feature sets are transferable and perform well for training and testing prediction of metastasis outcome in strictly independent data sets, both between different and within similar microarray platforms, while classifiers had a poorer performance when validated in strictly independent datasets. The study showed that MG- and SG-feature sets perform equally well in classifying independent data. Furthermore, SG-classifiers significantly outperformed MG-classifier when validation is conducted between datasets using similar platforms, while no significant performance difference was found when validation was performed between different platforms. Conclusion Prediction of metastasis outcome in lymph node–negative patients by MG- and SG-classifiers showed that SG-classifiers performed significantly better than MG-classifiers when validated in independent data based on the same microarray platform as used for developing the classifier. However, the MG- and SG-classifiers had similar performance when conducting classifier validation in independent data based on a different microarray platform. The latter was also true when only validating sets of MG- and SG-features in independent datasets, both between and within similar and different platforms.
Collapse
Affiliation(s)
- Mark Burton
- Institute of Clinical Research, Research Unit of Human Genetics, University of Southern Denmark, Odense, Denmark ; Department of Clinical Genetics, Odense University Hospital, Odense, Denmark
| | | | | | | |
Collapse
|
122
|
|
123
|
Pang H, George SL, Hui K, Tong T. Gene selection using iterative feature elimination random forests for survival outcomes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:1422-31. [PMID: 22547432 PMCID: PMC3495190 DOI: 10.1109/tcbb.2012.63] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
Although many feature selection methods for classification have been developed, there is a need to identify genes in high-dimensional data with censored survival outcomes. Traditional methods for gene selection in classification problems have several drawbacks. First, the majority of the gene selection approaches for classification are single-gene based. Second, many of the gene selection procedures are not embedded within the algorithm itself. The technique of random forests has been found to perform well in high-dimensional data settings with survival outcomes. It also has an embedded feature to identify variables of importance. Therefore, it is an ideal candidate for gene selection in high-dimensional data with survival outcomes. In this paper, we develop a novel method based on the random forests to identify a set of prognostic genes. We compare our method with several machine learning methods and various node split criteria using several real data sets. Our method performed well in both simulations and real data analysis.Additionally, we have shown the advantages of our approach over single-gene-based approaches. Our method incorporates multivariate correlations in microarray data for survival outcomes. The described method allows us to better utilize the information available from microarray data with survival outcomes.
Collapse
Affiliation(s)
- Herbert Pang
- Biostatistics and Bioinformatics Department, Duke University School of Medicine, Durham, NC 27705.
| | - Stephen L. George
- Biostatistics and Bioinformatics Department, Duke University School of Medicine, Durham, NC 27705.
| | - Ken Hui
- School of Medicine, Yale University, New Haven, CT 06510.
| | - Tiejun Tong
- Mathematics Department, Hong Kong Baptist University, Hong Kong SAR, China.
| |
Collapse
|
124
|
Li X, Peng S, Chen J, Lü B, Zhang H, Lai M. SVM-T-RFE: a novel gene selection algorithm for identifying metastasis-related genes in colorectal cancer using gene expression profiles. Biochem Biophys Res Commun 2012; 419:148-53. [PMID: 22306013 DOI: 10.1016/j.bbrc.2012.01.087] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2012] [Accepted: 01/18/2012] [Indexed: 11/16/2022]
Abstract
Although metastasis is the principal cause of death cause for colorectal cancer (CRC) patients, the molecular mechanisms underlying CRC metastasis are still not fully understood. In an attempt to identify metastasis-related genes in CRC, we obtained gene expression profiles of 55 early stage primary CRCs, 56 late stage primary CRCs, and 34 metastatic CRCs from the expression project in Oncology (http://www.intgen.org/expo/). We developed a novel gene selection algorithm (SVM-T-RFE), which extends support vector machine recursive feature elimination (SVM-RFE) algorithm by incorporating T-statistic. We achieved highest classification accuracy (100%) with smaller gene subsets (10 and 6, respectively), when classifying between early and late stage primary CRCs, as well as between metastatic CRCs and late stage primary CRCs. We also compared the performance of SVM-T-RFE and SVM-RFE gene selection algorithms on another large-scale CRC dataset and the five public microarray datasets. SVM-T-RFE bestowed SVM-RFE algorithm in identifying more differentially expressed genes, and achieving highest prediction accuracy using equal or smaller number of selected genes. A fraction of selected genes have been reported to be associated with CRC development or metastasis.
Collapse
Affiliation(s)
- Xiaobo Li
- Department of Pathology, School of Medicine, Zhejiang University, Hangzhou 310058, People's Republic of China.
| | | | | | | | | | | |
Collapse
|
125
|
Yu L, Han Y, Berens ME. Stable gene selection from microarray data via sample weighting. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:262-272. [PMID: 21383420 DOI: 10.1109/tcbb.2011.47] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Feature selection from gene expression microarray data is a widely used technique for selecting candidate genes in various cancer studies. Besides predictive ability of the selected genes, an important aspect in evaluating a selection method is the stability of the selected genes. Experts instinctively have high confidence in the result of a selection method that selects similar sets of genes under some variations to the samples. However, a common problem of existing feature selection methods for gene expression data is that the selected genes by the same method often vary significantly with sample variations. In this work, we propose a general framework of sample weighting to improve the stability of feature selection methods under sample variations. The framework first weights each sample in a given training set according to its influence to the estimation of feature relevance, and then provides the weighted training set to a feature selection method. We also develop an efficient margin-based sample weighting algorithm under this framework. Experiments on a set of microarray data sets show that the proposed algorithm significantly improves the stability of representative feature selection algorithms such as SVM-RFE and ReliefF, without sacrificing their classification performance. Moreover, the proposed algorithm also leads to more stable gene signatures than the state-of-the-art ensemble method, particularly for small signature sizes.
Collapse
Affiliation(s)
- Lei Yu
- Binghamton University, Binghamton
| | | | | |
Collapse
|
126
|
Tapia E, Bulacio P, Angelone L. Sparse and stable gene selection with consensus SVM-RFE. Pattern Recognit Lett 2012. [DOI: 10.1016/j.patrec.2011.09.031] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
127
|
Mohamad MS, Omatu S, Deris S, Yoshioka M. A modified binary particle swarm optimization for selecting the small subset of informative genes from gene expression data. ACTA ACUST UNITED AC 2011; 15:813-22. [PMID: 21914573 DOI: 10.1109/titb.2011.2167756] [Citation(s) in RCA: 68] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Gene expression data are expected to be of significant help in the development of efficient cancer diagnoses and classification platforms. In order to select a small subset of informative genes from the data for cancer classification, recently, many researchers are analyzing gene expression data using various computational intelligence methods. However, due to the small number of samples compared to the huge number of genes (high dimension), irrelevant genes, and noisy genes, many of the computational methods face difficulties to select the small subset. Thus, we propose an improved (modified) binary particle swarm optimization to select the small subset of informative genes that is relevant for the cancer classification. In this proposed method, we introduce particles' speed for giving the rate at which a particle changes its position, and we propose a rule for updating particle's positions. By performing experiments on ten different gene expression datasets, we have found that the performance of the proposed method is superior to other previous related works, including the conventional version of binary particle swarm optimization (BPSO) in terms of classification accuracy and the number of selected genes. The proposed method also produces lower running times compared to BPSO.
Collapse
Affiliation(s)
- Mohd Saberi Mohamad
- Faculty of Computer Science and Information Systems, Universiti Teknologi Malaysia, Skudai, Johore, Malaysia.
| | | | | | | |
Collapse
|
128
|
Mohammadi A, Saraee MH, Salehi M. Identification of disease-causing genes using microarray data mining and Gene Ontology. BMC Med Genomics 2011; 4:12. [PMID: 21269461 PMCID: PMC3037837 DOI: 10.1186/1755-8794-4-12] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2010] [Accepted: 01/26/2011] [Indexed: 01/25/2023] Open
Abstract
BACKGROUND One of the best and most accurate methods for identifying disease-causing genes is monitoring gene expression values in different samples using microarray technology. One of the shortcomings of microarray data is that they provide a small quantity of samples with respect to the number of genes. This problem reduces the classification accuracy of the methods, so gene selection is essential to improve the predictive accuracy and to identify potential marker genes for a disease. Among numerous existing methods for gene selection, support vector machine-based recursive feature elimination (SVMRFE) has become one of the leading methods, but its performance can be reduced because of the small sample size, noisy data and the fact that the method does not remove redundant genes. METHODS We propose a novel framework for gene selection which uses the advantageous features of conventional methods and addresses their weaknesses. In fact, we have combined the Fisher method and SVMRFE to utilize the advantages of a filtering method as well as an embedded method. Furthermore, we have added a redundancy reduction stage to address the weakness of the Fisher method and SVMRFE. In addition to gene expression values, the proposed method uses Gene Ontology which is a reliable source of information on genes. The use of Gene Ontology can compensate, in part, for the limitations of microarrays, such as having a small number of samples and erroneous measurement results. RESULTS The proposed method has been applied to colon, Diffuse Large B-Cell Lymphoma (DLBCL) and prostate cancer datasets. The empirical results show that our method has improved classification performance in terms of accuracy, sensitivity and specificity. In addition, the study of the molecular function of selected genes strengthened the hypothesis that these genes are involved in the process of cancer growth. CONCLUSIONS The proposed method addresses the weakness of conventional methods by adding a redundancy reduction stage and utilizing Gene Ontology information. It predicts marker genes for colon, DLBCL and prostate cancer with a high accuracy. The predictions made in this study can serve as a list of candidates for subsequent wet-lab verification and might help in the search for a cure for cancers.
Collapse
Affiliation(s)
- Azadeh Mohammadi
- Intelligent Databases, Data mining and Bioinformatics Laboratory, Isfahan University of Technology, Isfahan, Iran
| | - Mohammad H Saraee
- Intelligent Databases, Data mining and Bioinformatics Laboratory, Isfahan University of Technology, Isfahan, Iran
| | - Mansoor Salehi
- Dept. of Genetics, Medical School, Isfahan University of Medical Sciences, Isfahan, Iran
| |
Collapse
|
129
|
Cheng J, Veronika M, Rajapakse JC. Identifying Cells in Histopathological Images. RECOGNIZING PATTERNS IN SIGNALS, SPEECH, IMAGES AND VIDEOS 2010. [DOI: 10.1007/978-3-642-17711-8_25] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
|