1
|
Barrera-García J, Cisternas-Caneo F, Crawford B, Gómez Sánchez M, Soto R. Feature Selection Problem and Metaheuristics: A Systematic Literature Review about Its Formulation, Evaluation and Applications. Biomimetics (Basel) 2023; 9:9. [PMID: 38248583 PMCID: PMC10813816 DOI: 10.3390/biomimetics9010009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2023] [Revised: 12/16/2023] [Accepted: 12/18/2023] [Indexed: 01/23/2024] Open
Abstract
Feature selection is becoming a relevant problem within the field of machine learning. The feature selection problem focuses on the selection of the small, necessary, and sufficient subset of features that represent the general set of features, eliminating redundant and irrelevant information. Given the importance of the topic, in recent years there has been a boom in the study of the problem, generating a large number of related investigations. Given this, this work analyzes 161 articles published between 2019 and 2023 (20 April 2023), emphasizing the formulation of the problem and performance measures, and proposing classifications for the objective functions and evaluation metrics. Furthermore, an in-depth description and analysis of metaheuristics, benchmark datasets, and practical real-world applications are presented. Finally, in light of recent advances, this review paper provides future research opportunities.
Collapse
Affiliation(s)
- José Barrera-García
- Escuela de Ingeniería Informática, Pontificia Universidad Católica de Valparaíso, Avenida Brasil 2241, Valparaíso 2362807, Chile; (J.B.-G.); (F.C.-C.); (R.S.)
| | - Felipe Cisternas-Caneo
- Escuela de Ingeniería Informática, Pontificia Universidad Católica de Valparaíso, Avenida Brasil 2241, Valparaíso 2362807, Chile; (J.B.-G.); (F.C.-C.); (R.S.)
| | - Broderick Crawford
- Escuela de Ingeniería Informática, Pontificia Universidad Católica de Valparaíso, Avenida Brasil 2241, Valparaíso 2362807, Chile; (J.B.-G.); (F.C.-C.); (R.S.)
| | - Mariam Gómez Sánchez
- Departamento de Electrotecnia e Informática, Universidad Técnica Federico Santa María, Federico Santa María 6090, Viña del Mar 2520000, Chile;
| | - Ricardo Soto
- Escuela de Ingeniería Informática, Pontificia Universidad Católica de Valparaíso, Avenida Brasil 2241, Valparaíso 2362807, Chile; (J.B.-G.); (F.C.-C.); (R.S.)
| |
Collapse
|
2
|
Zheng J, Zhang Z, Wang J, Zhao R, Liu S, Yang G, Liu Z, Deng Z. Metabolic syndrome prediction model using Bayesian optimization and XGBoost based on traditional Chinese medicine features. Heliyon 2023; 9:e22727. [PMID: 38125549 PMCID: PMC10730568 DOI: 10.1016/j.heliyon.2023.e22727] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Revised: 11/16/2023] [Accepted: 11/17/2023] [Indexed: 12/23/2023] Open
Abstract
Metabolic syndrome (MetS) has a high prevalence and is prone to many complications. However, current MetS diagnostic methods require blood tests that are not conducive to self-testing, so a user-friendly and accurate method for predicting MetS is needed to facilitate early detection and treatment. In this study, a MetS prediction model based on a simple, small number of Traditional Chinese Medicine (TCM) clinical indicators and biological indicators combined with machine learning algorithms is investigated. Electronic medical record data from 2040 patients who visited outpatient clinics at Guangdong Chinese medicine hospitals from 2020 to 2021 were used to investigate the fusion of Bayesian optimization (BO) and eXtreme gradient boosting (XGBoost) in order to create a BO-XGBoost model for screening nineteen key features in three categories: individual bio-information, TCM indicators, and TCM habits that influence MetS prediction. Subsequently, the predictive diagnostic model for MetS was developed. The experimental results revealed that the model proposed in this paper achieved values of 93.35 %, 90.67 %, 80.40 %, and 0.920 for the F1, sensitivity, FRS, and AUC metrics, respectively. These values outperformed those of the seven other tested machine learning models. Finally, this study developed an intelligent prediction application for MetS based on the proposed model, which can be utilized by ordinary users to perform self-diagnosis through a web-based questionnaire, thereby accomplishing the objective of early detection and intervention for MetS.
Collapse
Affiliation(s)
- Jianhua Zheng
- College of Information Science and Technology, Zhongkai University of Agriculture and Engineering, Guangzhou, 510225, China
- Guangdong Provincial Key Laboratory of Traditional Chinese Medicine Informatization, Guangzhou, 510630, China
| | - Zihao Zhang
- College of Information Science and Technology, Zhongkai University of Agriculture and Engineering, Guangzhou, 510225, China
| | - Jinhe Wang
- Xiyuan Hospital of China Academy of Chinese Medical Sciences, Beijing, 100091, China
| | - Ruolin Zhao
- College of Information Science and Technology, Zhongkai University of Agriculture and Engineering, Guangzhou, 510225, China
| | - Shuangyin Liu
- College of Information Science and Technology, Zhongkai University of Agriculture and Engineering, Guangzhou, 510225, China
- Guangdong Provincial Key Laboratory of Traditional Chinese Medicine Informatization, Guangzhou, 510630, China
| | - Gaolin Yang
- College of Information Science and Technology, Zhongkai University of Agriculture and Engineering, Guangzhou, 510225, China
| | - Zhengjie Liu
- Guangdong Provincial Hospital of Chinese Medicine, Guangzhou, 510120, China
- The Second Affiliated Hospital of Guangzhou University of Chinese Medicine, Guangzhou, 510120, China
| | - Zhengyuan Deng
- College of Information Science and Technology, Zhongkai University of Agriculture and Engineering, Guangzhou, 510225, China
- Network and Educational Technology Center, Jinan University, Guangzhou, 510630, China
| |
Collapse
|
3
|
Rabie AH, Mohamed AM, Abo-Elsoud MA, Saleh AI. A new Covid-19 diagnosis strategy using a modified KNN classifier. Neural Comput Appl 2023; 35:1-25. [PMID: 37362572 PMCID: PMC10153048 DOI: 10.1007/s00521-023-08588-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2022] [Accepted: 04/05/2023] [Indexed: 06/28/2023]
Abstract
Covid-19 is a very dangerous disease as a result of the rapid and unprecedented spread of any previous disease. It is truly a crisis that threatens the world since its first appearance in December 2019 until our time. Due to the lack of a vaccine that has proved sufficiently effective so far, the rapid and more accurate diagnosis of this disease is extremely necessary to enable the medical staff to identify infected cases and isolate them from the rest to prevent further loss of life. In this paper, Covid-19 diagnostic strategy (CDS) as a new classification strategy that consists of two basic phases: Feature selection phase (FSP) and diagnosis phase (DP) has been introduced. During the first phase called FSP, the best set of features in laboratory test findings for Covid-19 patients will be selected using enhanced gray wolf optimization (EGWO). EGWO combines both types of selection techniques called wrapper and filter. Accordingly, EGWO includes two stages called filter stage (FS) and wrapper stage (WS). While FS uses many different filter methods, WS uses a wrapper method called binary gray wolf optimization (BGWO). The second phase called DP aims to give fast and more accurate diagnosis using a hybrid diagnosis methodology (HDM) based on the selected features from FSP. In fact, the HDM consists of two phases called weighting patient phase (WP2) and diagnostic patient phase (DP2). WP2 aims to calculate the belonging degree of each patient in the testing dataset to class category using naïve Bayes (NB) as a weight method. On the other hand, K-nearest neighbor (KNN) will be used in DP2 based on the weights of patients in the testing dataset as a new training dataset to give rapid and more accurate detection. The suggested CDS outperforms other strategies according to accuracy, precision, recall (or sensitivity) and F-measure calculations that are equal to 99%, 88%, 90% and 91%, respectively, as showed in experimental results.
Collapse
Affiliation(s)
- Asmaa H. Rabie
- Computers and Control Department Faculty of Engineering, Mansoura University, Mansoura, Egypt
| | - Alaa M. Mohamed
- Delta Higher Institute for Engineering and Technology, Talkha, Mansoura, Egypt
| | - M. A. Abo-Elsoud
- Electronics and Communication Department Faculty of Engineering, Mansoura University, Mansoura, Egypt
| | - Ahmed I. Saleh
- Computers and Control Department Faculty of Engineering, Mansoura University, Mansoura, Egypt
| |
Collapse
|
4
|
Feature selection based on absolute deviation factor for text classification. Inf Process Manag 2023. [DOI: 10.1016/j.ipm.2022.103251] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
|
5
|
A feature selection approach based on NSGA-II with ReliefF. Appl Soft Comput 2023. [DOI: 10.1016/j.asoc.2023.109987] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
|
6
|
Rahab H, Haouassi H, Souidi MEH, Bakhouche A, Mahdaoui R, Bekhouche M. A Modified Binary Rat Swarm Optimization Algorithm for Feature Selection in Arabic Sentiment Analysis. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING 2022. [DOI: 10.1007/s13369-022-07466-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
|
7
|
Chen H, Zhou X, Shi D. A Chaotic Antlion Optimization Algorithm for Text Feature Selection. INT J COMPUT INT SYS 2022. [DOI: 10.1007/s44196-022-00094-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
AbstractText classification is one of the important technologies in the field of text data mining. Feature selection, as a key step in processing text classification tasks, is used to process high-dimensional feature sets, which directly affects the final classification performance. At present, the most widely used text feature selection methods in academia are to calculate the importance of each feature for classification through an evaluation function, and then select the most important feature subsets that meet the quantitative requirements in turn. However, ignoring the correlation between the features and the effect of their mutual combination in this way may not guarantee the best classification effect. Therefore, this paper proposes a chaotic antlion feature selection algorithm (CAFSA) to solve this problem. The main contributions include: (1) Propose a chaotic antlion algorithm (CAA) based on quasi-opposition learning mechanism and chaos strategy, and compare it with the other four algorithms on 11 benchmark functions. The algorithm has achieved a higher convergence speed and the highest optimization accuracy. (2) Study the performance of CAFSA using CAA for feature selection when using different learning models, including decision tree, Naive Bayes, and SVM classifier. (3) The performance of CAFSA is compared with that of eight other feature selection methods on three Chinese datasets. The experimental results show that using CAFSA can reduce the number of features and improve the classification accuracy of the classifier, which has a better classification effect than other feature selection methods.
Collapse
|
8
|
Mahapatra S, Sahu SS. ANOVA-particle swarm optimization-based feature selection and gradient boosting machine classifier for improved protein-protein interaction prediction. Proteins 2021; 90:443-454. [PMID: 34528291 DOI: 10.1002/prot.26236] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2020] [Revised: 08/09/2021] [Accepted: 09/03/2021] [Indexed: 01/22/2023]
Abstract
Feature fusion and selection strategies have been applied to improve accuracy in the prediction of protein-protein interaction (PPI). In this paper, an embedded feature selection framework is developed by integrating a cost function based on analysis of variance (ANOVA) with the particle swarm optimization (PSO), termed AVPSO. Initially, the features of the protein sequences extracted using pseudo-amino acid composition (PseAAC), conjoint triad composition, and local descriptor are fused. Then, AVPSO is employed to select the optimal set of features. The light gradient boosting machine (LGBM) classifier is used to predict the PPIs using the optimal feature subset. On the five-fold cross-validation analysis, the proposed model (AVPSO-LGBM) achieved an average accuracy of 97.12% and 95.09%, respectively, on the intraspecies PPI datasets Saccharomyces cerevisiae and Helicobacter pylori. On the interspecies, PPI datasets of the Human-Bacillus and Human-Yersinia, an average accuracy of 95.20% and 93.44%, are achieved. Results obtained on independent test datasets, and network datasets show that the prediction accuracy of the AVPSO-LGBM is better than the existing methods, demonstrating its generalization ability. The improved prediction performance obtained by the proposed model makes it a reliable and effective PPI prediction model.
Collapse
Affiliation(s)
- Satyajit Mahapatra
- Department of Electronics and Communication Engineering, Birla Institute of Technology, Ranchi, India
| | - Sitanshu Sekhar Sahu
- Department of Electronics and Communication Engineering, Birla Institute of Technology, Ranchi, India
| |
Collapse
|