1
|
Qiu F, Heidari AA, Chen Y, Chen H, Liang G. Advancing forensic-based investigation incorporating slime mould search for gene selection of high-dimensional genetic data. Sci Rep 2024; 14:8599. [PMID: 38615048 PMCID: PMC11016116 DOI: 10.1038/s41598-024-59064-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2024] [Accepted: 04/06/2024] [Indexed: 04/15/2024] Open
Abstract
Modern medicine has produced large genetic datasets of high dimensions through advanced gene sequencing technology, and processing these data is of great significance for clinical decision-making. Gene selection (GS) is an important data preprocessing technique that aims to select a subset of feature information to improve performance and reduce data dimensionality. This study proposes an improved wrapper GS method based on forensic-based investigation (FBI). The method introduces the search mechanism of the slime mould algorithm in the FBI to improve the original FBI; the newly proposed algorithm is named SMA_FBI; then GS is performed by converting the continuous optimizer to a binary version of the optimizer through a transfer function. In order to verify the superiority of SMA_FBI, experiments are first executed on the 30-function test set of CEC2017 and compared with 10 original algorithms and 10 state-of-the-art algorithms. The experimental results show that SMA_FBI is better than other algorithms in terms of finding the optimal solution, convergence speed, and robustness. In addition, BSMA_FBI (binary version of SMA_FBI) is compared with 8 binary algorithms on 18 high-dimensional genetic data from the UCI repository. The results indicate that BSMA_FBI is able to obtain high classification accuracy with fewer features selected in GS applications. Therefore, SMA_FBI is considered an optimization tool with great potential for dealing with global optimization problems, and its binary version, BSMA_FBI, can be used for GS tasks.
Collapse
Affiliation(s)
- Feng Qiu
- Institute of Big Data and Information Technology, Wenzhou University, Wenzhou, 325035, China
| | - Ali Asghar Heidari
- School of Surveying and Geospatial Engineering, College of Engineering, University of Tehran, Tehran, Iran
| | - Yi Chen
- Department of Computer Science and Artificial Intelligence, Wenzhou University, Wenzhou, 325035, China
| | - Huiling Chen
- Institute of Big Data and Information Technology, Wenzhou University, Wenzhou, 325035, China.
| | - Guoxi Liang
- Department of Artificial Intelligence, Wenzhou Polytechnic, Wenzhou, 325035, China.
| |
Collapse
|
2
|
Agarwalla P, Mukhopadhyay S. GENEmops: Supervised feature selection from high dimensional biomedical dataset. Appl Soft Comput 2022. [DOI: 10.1016/j.asoc.2022.108963] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
3
|
Alhenawi E, Al-Sayyed R, Hudaib A, Mirjalili S. Feature selection methods on gene expression microarray data for cancer classification: A systematic review. Comput Biol Med 2022; 140:105051. [PMID: 34839186 DOI: 10.1016/j.compbiomed.2021.105051] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2021] [Revised: 11/01/2021] [Accepted: 11/15/2021] [Indexed: 11/29/2022]
Abstract
This systematic review provides researchers interested in feature selection (FS) for processing microarray data with comprehensive information about the main research directions for gene expression classification conducted during the recent seven years. A set of 132 researches published by three different publishers is reviewed. The studied papers are categorized into nine directions based on their objectives. The FS directions that received various levels of attention were then summarized. The review revealed that 'propose hybrid FS methods' represented the most interesting research direction with a percentage of 34.9%, while the other directions have lower percentages that ranged from 13.6% down to 3%. This guides researchers to select the most competitive research direction. Papers in each category are thoroughly reviewed based on six perspectives, mainly: method(s), classifier(s), dataset(s), dataset dimension(s) range, performance metric(s), and result(s) achieved.
Collapse
Affiliation(s)
- Esra'a Alhenawi
- King Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan.
| | - Rizik Al-Sayyed
- King Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan.
| | - Amjad Hudaib
- King Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan.
| | - Seyedali Mirjalili
- Center for Artificial Intelligence Research and Optimization, Torrens University Australia, Fortitude Valley, Brisbane, 4006, QLD, Australia; Yonsei Frontier Lab, Yonsei University, Seoul, South Korea.
| |
Collapse
|
4
|
Ahmed MM, Palaniswamy T. A novel TMGWO–SLBNC‐based multidimensional feature subset selection and classification framework for frequent diagnosis of breast lesion abnormalities. INT J INTELL SYST 2021. [DOI: 10.1002/int.22768] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Affiliation(s)
- Marwa M. Ahmed
- Department of Electrical and Computer Engineering King Abdulaziz University Jeddah Saudi Arabia
| | - Thangam Palaniswamy
- Department of Electrical and Computer Engineering King Abdulaziz University Jeddah Saudi Arabia
| |
Collapse
|
5
|
Multi-category multi-state information ensemble-based classification method for precise diagnosis of three cancers. Neural Comput Appl 2021. [DOI: 10.1007/s00521-021-06211-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
6
|
Generalisation Power Analysis for finding a stable set of features using evolutionary algorithms for feature selection. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2021.107450] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
7
|
An Ensemble Feature Selection Approach to Identify Relevant Features from EEG Signals. APPLIED SCIENCES-BASEL 2021. [DOI: 10.3390/app11156983] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Identifying relevant data to support the automatic analysis of electroencephalograms (EEG) has become a challenge. Although there are many proposals to support the diagnosis of neurological pathologies, the current challenge is to improve the reliability of the tools to classify or detect abnormalities. In this study, we used an ensemble feature selection approach to integrate the advantages of several feature selection algorithms to improve the identification of the characteristics with high power of differentiation in the classification of normal and abnormal EEG signals. Discrimination was evaluated using several classifiers, i.e., decision tree, logistic regression, random forest, and Support Vecctor Machine (SVM); furthermore, performance was assessed by accuracy, specificity, and sensitivity metrics. The evaluation results showed that Ensemble Feature Selection (EFS) is a helpful tool to select relevant features from the EEGs. Thus, the stability calculated for the EFS method proposed was almost perfect in most of the cases evaluated. Moreover, the assessed classifiers evidenced that the models improved in performance when trained with the EFS approach’s features. In addition, the classifier of epileptiform events built using the features selected by the EFS method achieved an accuracy, sensitivity, and specificity of 97.64%, 96.78%, and 97.95%, respectively; finally, the stability of the EFS method evidenced a reliable subset of relevant features. Moreover, the accuracy, sensitivity, and specificity of the EEG detector are equal to or greater than the values reported in the literature.
Collapse
|
8
|
TAGA: Tabu Asexual Genetic Algorithm embedded in a filter/filter feature selection approach for high-dimensional data. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2021.01.020] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
|
9
|
Nagarajan G, Dhinesh Babu LD. A hybrid feature selection model based on improved squirrel search algorithm and rank aggregation using fuzzy techniques for biomedical data classification. ACTA ACUST UNITED AC 2021; 10:39. [PMID: 34094808 PMCID: PMC8170065 DOI: 10.1007/s13721-021-00313-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2020] [Revised: 04/30/2021] [Accepted: 05/06/2021] [Indexed: 11/29/2022]
Abstract
Feature selection has gained its importance due to the voluminous nature of the data. Owing to the computational complexity of wrapper approaches, the poor performance of filtering techniques, and the classifier dependency of embedded approaches, hybrid approaches are more commonly used in feature selection. Hybrid approaches use filtering metrics to reduce the computational complexity of wrapper algorithms and are proved to yield better feature subset. Though filtering metrics select the features based on their significance, most of them are unstable and biased towards the metric used. Moreover, the choice of filtering metrics depends largely on the distribution of data and data types. Biomedical datasets contain features with different distribution and types adding to the complexity in the choice of filtering metric. We address this problem by proposing a stable filtering method based on rank aggregation in hybrid feature selection model with Improved Squirrel search algorithm for biomedical datasets. Our proposed model is compared with other well-known and state-of-the-art methods and the results prove that our model exhibited superior performance in terms of classification accuracy and computational time. The robustness of our proposed model is proved by conducting experiments on nine biomedical datasets and with three different classifiers.
Collapse
Affiliation(s)
- Gayathri Nagarajan
- School of Information Technology and Engineering, VIT university, Vellore, India
| | - L. D. Dhinesh Babu
- School of Information Technology and Engineering, VIT university, Vellore, India
| |
Collapse
|
10
|
Jayanthi S, Rene Robin CR. Analysis of Microarray Data by Empirical Wavelet Transform for Cancer Classification Using Block by Block Method. JOURNAL OF MEDICAL IMAGING AND HEALTH INFORMATICS 2021. [DOI: 10.1166/jmihi.2021.3318] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
In this study, DNA microarray data is analyzed from a signal processing perspective for cancer classification. An adaptive wavelet transform named Empirical Wavelet Transform (EWT) is analyzed using block-by-block procedure to characterize microarray data. The EWT wavelet basis depends
on the input data rather predetermined like in conventional wavelets. Thus, EWT gives more sparse representations than wavelets. The characterization of microarray data is made by block-by-block procedure with predefined block sizes in powers of 2 that starts from 128 to 2048. After characterization,
a statistical hypothesis test is employed to select the informative EWT coefficients. Only the selected coefficients are used for Microarray Data Classification (MDC) by the Support Vector Machine (SVM). Computational experiments are employed on five microarray datasets; colon, breast, leukemia,
CNS and ovarian to test the developed cancer classification system. The obtained results demonstrate that EWT coefficients with SVM emerged as an effective approach with no misclassification for MDC system.
Collapse
Affiliation(s)
- S. Jayanthi
- Research Scholar, Anna University, 600025, Tamilnadu, India; Department of Computer Science and Engineering, Agni College of Technology, 600130, Tamilnadu, India
| | - C. R. Rene Robin
- Department of Computer Science and Engineering, Jerusalem College of Engineering, 600100, Tamilnadu, India
| |
Collapse
|
11
|
A survey on single and multi omics data mining methods in cancer data classification. J Biomed Inform 2020; 107:103466. [DOI: 10.1016/j.jbi.2020.103466] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Revised: 05/01/2020] [Accepted: 05/31/2020] [Indexed: 01/09/2023]
|
12
|
Santhakumar D, Logeswari S. Efficient attribute selection technique for leukaemia prediction using microarray gene data. Soft comput 2020. [DOI: 10.1007/s00500-020-04793-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
13
|
García-Díaz P, Sánchez-Berriel I, Martínez-Rojas JA, Diez-Pascual AM. Unsupervised feature selection algorithm for multiclass cancer classification of gene expression RNA-Seq data. Genomics 2019; 112:1916-1925. [PMID: 31759118 DOI: 10.1016/j.ygeno.2019.11.004] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2019] [Revised: 10/04/2019] [Accepted: 11/11/2019] [Indexed: 01/01/2023]
Abstract
This paper presents a Grouping Genetic Algorithm (GGA) to solve a maximally diverse grouping problem. It has been applied for the classification of an unbalanced database of 801 samples of gene expression RNA-Seq data in 5 types of cancer. The samples are composed by 20,531 genes. GGA extracts several groups of genes that achieve high accuracy in multiple classification. Accuracy has been evaluated by an Extreme Learning Machine algorithm and was found to be slightly higher in balanced databases than in unbalanced ones. The final classification decision has been made through a weighted majority vote system between the groups of features. The proposed algorithm finally selects 49 genes to classify samples with an average accuracy of 98.81% and a standard deviation of 0.0174.
Collapse
Affiliation(s)
- Pilar García-Díaz
- Department of Signal Theory and Communications, Polytechnic School, University of Alcalá, 28805 Alcalá de Henares, Madrid, Spain.
| | - Isabel Sánchez-Berriel
- Department of Computer and Systems Engineering, Higher School of Engineering and Technology, University of La Laguna, 38200 San Cristobal de La Laguna, S/C de Tenerife, Spain.
| | - Juan A Martínez-Rojas
- Department of Signal Theory and Communications, Polytechnic School, University of Alcalá, 28805 Alcalá de Henares, Madrid, Spain.
| | - Ana M Diez-Pascual
- Department of Analytical Chemistry, Physical Chemistry and Chemical Engineering, Faculty of Sciences, University of Alcalá, 28805 Alcalá de Henares, Madrid, Spain.
| |
Collapse
|
14
|
Shukla AK. Identification of cancerous gene groups from microarray data by employing adaptive genetic and support vector machine technique. Comput Intell 2019. [DOI: 10.1111/coin.12245] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Affiliation(s)
- Alok Kumar Shukla
- Department of Computer Science & EngineeringG.L. Bajaj Institute of Technology & Management Greater Noida India
| |
Collapse
|
15
|
Shukla AK, Tripathi D. Identification of potential biomarkers on microarray data using distributed gene selection approach. Math Biosci 2019; 315:108230. [DOI: 10.1016/j.mbs.2019.108230] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2019] [Revised: 06/04/2019] [Accepted: 07/16/2019] [Indexed: 02/09/2023]
|
16
|
Shukla AK, Singh P, Vardhan M. A hybrid framework for optimal feature subset selection. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2019. [DOI: 10.3233/jifs-169936] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Alok Kumar Shukla
- Department of Computer Science & Engineering, NIT Raipur, Chhattisgarh (C.G), India
| | - Pradeep Singh
- Department of Computer Science & Engineering, NIT Raipur, Chhattisgarh (C.G), India
| | - Manu Vardhan
- Department of Computer Science & Engineering, NIT Raipur, Chhattisgarh (C.G), India
| |
Collapse
|
17
|
Yan Y, Dai T, Yang M, Du X, Zhang Y, Zhang Y. Classifying Incomplete Gene-Expression Data: Ensemble Learning with Non-Pre-Imputation Feature Filtering and Best-First Search Technique. Int J Mol Sci 2018; 19:ijms19113398. [PMID: 30380746 PMCID: PMC6274900 DOI: 10.3390/ijms19113398] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2018] [Revised: 10/20/2018] [Accepted: 10/23/2018] [Indexed: 01/09/2023] Open
Abstract
(1) Background: Gene-expression data usually contain missing values (MVs). Numerous methods focused on how to estimate MVs have been proposed in the past few years. Recent studies show that those imputation algorithms made little difference in classification. Thus, some scholars believe that how to select the informative genes for downstream classification is more important than how to impute MVs. However, most feature-selection (FS) algorithms need beforehand imputation, and the impact of beforehand MV imputation on downstream FS performance is seldom considered. (2) Method: A modified chi-square test-based FS is introduced for gene-expression data. To deal with the challenge of a small sample size of gene-expression data, a heuristic method called recursive element aggregation is proposed in this study. Our approach can directly handle incomplete data without any imputation methods or missing-data assumptions. The most informative genes can be selected through a threshold. After that, the best-first search strategy is utilized to find optimal feature subsets for classification. (3) Results: We compare our method with several FS algorithms. Evaluation is performed on twelve original incomplete cancer gene-expression datasets. We demonstrate that MV imputation on an incomplete dataset impacts subsequent FS in terms of classification tasks. Through directly conducting FS on incomplete data, our method can avoid potential disturbances on subsequent FS procedures caused by MV imputation. An experiment on small, round blue cell tumor (SRBCT) dataset showed that our method found additional genes besides many common genes with the two compared existing methods.
Collapse
Affiliation(s)
- Yuanting Yan
- School of Computer Science and Technology, Anhui University, Hefei 230601, China.
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui University, Hefei 230601, China.
| | - Tao Dai
- School of Computer Science and Technology, Anhui University, Hefei 230601, China.
| | - Meili Yang
- School of Computer Science and Technology, Anhui University, Hefei 230601, China.
| | - Xiuquan Du
- School of Computer Science and Technology, Anhui University, Hefei 230601, China.
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui University, Hefei 230601, China.
| | - Yiwen Zhang
- School of Computer Science and Technology, Anhui University, Hefei 230601, China.
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui University, Hefei 230601, China.
| | - Yanping Zhang
- School of Computer Science and Technology, Anhui University, Hefei 230601, China.
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui University, Hefei 230601, China.
| |
Collapse
|
18
|
Kale A, Sonavane S. F-WSS $$^{++}$$ + + : incremental wrapper subset selection algorithm for fuzzy extreme learning machine. INT J MACH LEARN CYB 2018. [DOI: 10.1007/s13042-018-0859-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
|
19
|
Wang A, An N, Chen G, Liu L, Alterovitz G. Subtype dependent biomarker identification and tumor classification from gene expression profiles. Knowl Based Syst 2018. [DOI: 10.1016/j.knosys.2018.01.025] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
20
|
Shukla AK, Singh P, Vardhan M. A hybrid gene selection method for microarray recognition. Biocybern Biomed Eng 2018. [DOI: 10.1016/j.bbe.2018.08.004] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
21
|
Bi-stage hierarchical selection of pathway genes for cancer progression using a swarm based computational approach. Appl Soft Comput 2018. [DOI: 10.1016/j.asoc.2017.10.024] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
|
22
|
Al-Rajab M, Lu J, Xu Q. Examining applying high performance genetic data feature selection and classification algorithms for colon cancer diagnosis. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2017; 146:11-24. [PMID: 28688481 DOI: 10.1016/j.cmpb.2017.05.001] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/29/2016] [Revised: 04/17/2017] [Accepted: 05/02/2017] [Indexed: 06/07/2023]
Abstract
BACKGROUND AND OBJECTIVES This paper examines the accuracy and efficiency (time complexity) of high performance genetic data feature selection and classification algorithms for colon cancer diagnosis. The need for this research derives from the urgent and increasing need for accurate and efficient algorithms. Colon cancer is a leading cause of death worldwide, hence it is vitally important for the cancer tissues to be expertly identified and classified in a rapid and timely manner, to assure both a fast detection of the disease and to expedite the drug discovery process. METHODS In this research, a three-phase approach was proposed and implemented: Phases One and Two examined the feature selection algorithms and classification algorithms employed separately, and Phase Three examined the performance of the combination of these. RESULTS It was found from Phase One that the Particle Swarm Optimization (PSO) algorithm performed best with the colon dataset as a feature selection (29 genes selected) and from Phase Two that the Support Vector Machine (SVM) algorithm outperformed other classifications, with an accuracy of almost 86%. It was also found from Phase Three that the combined use of PSO and SVM surpassed other algorithms in accuracy and performance, and was faster in terms of time analysis (94%). CONCLUSIONS It is concluded that applying feature selection algorithms prior to classification algorithms results in better accuracy than when the latter are applied alone. This conclusion is important and significant to industry and society.
Collapse
Affiliation(s)
- Murad Al-Rajab
- University of Huddersfield, Queensgate, Huddersfield, United Kingdom .
| | - Joan Lu
- University of Huddersfield, Queensgate, Huddersfield, United Kingdom .
| | - Qiang Xu
- University of Huddersfield, Queensgate, Huddersfield, United Kingdom .
| |
Collapse
|
23
|
Wang L, Wang Y, Chang Q. Feature selection methods for big data bioinformatics: A survey from the search perspective. Methods 2016; 111:21-31. [PMID: 27592382 DOI: 10.1016/j.ymeth.2016.08.014] [Citation(s) in RCA: 110] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2016] [Revised: 08/25/2016] [Accepted: 08/30/2016] [Indexed: 11/26/2022] Open
Abstract
This paper surveys main principles of feature selection and their recent applications in big data bioinformatics. Instead of the commonly used categorization into filter, wrapper, and embedded approaches to feature selection, we formulate feature selection as a combinatorial optimization or search problem and categorize feature selection methods into exhaustive search, heuristic search, and hybrid methods, where heuristic search methods may further be categorized into those with or without data-distilled feature ranking measures.
Collapse
Affiliation(s)
- Lipo Wang
- School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore.
| | - Yaoli Wang
- College of Information Engineering, Taiyuan University of Technology, Taiyuan, China.
| | - Qing Chang
- College of Information Engineering, Taiyuan University of Technology, Taiyuan, China.
| |
Collapse
|