1
|
Silva LA, de Vasconcelos BP, Del-Moral-Hernandez E. A model to estimate the Self-Organizing Maps grid dimension for Prototype Generation. INTELL DATA ANAL 2021. [DOI: 10.3233/ida-205123] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Due to the high accuracy of the K nearest neighbor algorithm in different problems, KNN is one of the most important classifiers used in data mining applications and is recognized in the literature as a benchmark algorithm. Despite its high accuracy, KNN has some weaknesses, such as the time taken by the classification process, which is a disadvantage in many problems, particularly in those that involve a large dataset. The literature presents some approaches to reduce the classification time of KNN by selecting only the most important dataset examples. One of these methods is called Prototype Generation (PG) and the idea is to represent the dataset examples in prototypes. Thus, the classification process occurs in two steps; the first is based on prototypes and the second on the examples represented by the nearest prototypes. The main problem of this approach is a lack of definition about the ideal number of prototypes. This study proposes a model that allows the best grid dimension of Self-Organizing Maps and the ideal number of prototypes to be estimated using the number of dataset examples as a parameter. The approach is contrasted with other PG methods from the literature based on artificial intelligence that propose to automatically define the number of prototypes. The main advantage of the proposed method tested here using eighteen public datasets is that it allows a better relationship between a reduced number of prototypes and accuracy, providing a sufficient number that does not degrade KNN classification performance.
Collapse
Affiliation(s)
- Leandro A. Silva
- Faculdade de Computação e Informática, Brasil
- Programa de Pós-Graduação em Engenharia Elétrica e Computação, Brasil
- Universidade Presbiteriana Mackenzie, Brasil
| | - Bruno P. de Vasconcelos
- Programa de Pós-Graduação em Engenharia Elétrica e Computação, Brasil
- Universidade Presbiteriana Mackenzie, Brasil
| | | |
Collapse
|
2
|
Bello M, Nápoles G, Vanhoof K, Bello R. On the generation of multi-label prototypes. INTELL DATA ANAL 2020. [DOI: 10.3233/ida-200014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Data reduction techniques play a key role in instance-based classification to lower the amount of data to be processed. Prototype generation aims to obtain a reduced training set in order to obtain accurate results with less effort. This translates into a significant reduction in both algorithms’ spatial and temporal burden. This issue is particularly relevant in multi-label classification, which is a generalization of multiclass classification that allows objects to belong to several classes simultaneously. Although this field is quite active in terms of learning algorithms, there is a lack of data reduction methods. In this paper, we propose several prototype generation methods from multi-label datasets based on Granular Computing. The simulations show that these methods significantly reduce the number of examples to a set of prototypes without significantly affecting classifiers’ performance.
Collapse
Affiliation(s)
- Marilyn Bello
- Computer Science Department, Universidad Central de Las Villas, Cuba
- Faculty of Business Economics, Hasselt University, Belgium
| | - Gonzalo Nápoles
- Faculty of Business Economics, Hasselt University, Belgium
- Department of Cognitive Science and Artificial Intelligence, Tilburg University, The Netherlands
| | - Koen Vanhoof
- Faculty of Business Economics, Hasselt University, Belgium
| | - Rafael Bello
- Computer Science Department, Universidad Central de Las Villas, Cuba
| |
Collapse
|
3
|
Evolving Spiking Neural Networks for online learning over drifting data streams. Neural Netw 2018; 108:1-19. [DOI: 10.1016/j.neunet.2018.07.014] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2018] [Revised: 06/11/2018] [Accepted: 07/25/2018] [Indexed: 11/18/2022]
|
4
|
Zhu T, Hao Y, Luo W, Ning H. Learning enhanced differential evolution for tracking optimal decisions in dynamic power systems. Appl Soft Comput 2018. [DOI: 10.1016/j.asoc.2017.07.037] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
5
|
|
6
|
Yu Z, Wang Z, You J, Zhang J, Liu J, Wong HS, Han G. A New Kind of Nonparametric Test for Statistical Comparison of Multiple Classifiers Over Multiple Datasets. IEEE TRANSACTIONS ON CYBERNETICS 2017; 47:4418-4431. [PMID: 28113414 DOI: 10.1109/tcyb.2016.2611020] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Nonparametric statistical analysis, such as the Friedman test (FT), is gaining more and more attention due to its useful applications in a lot of experimental studies. However, traditional FT for the comparison of multiple learning algorithms on different datasets adopts the naive ranking approach. The ranking is based on the average accuracy values obtained by the set of learning algorithms on the datasets, which neither considers the differences of the results obtained by the learning algorithms on each dataset nor takes into account the performance of the learning algorithms in each run. In this paper, we will first propose three kinds of ranking approaches, which are the weighted ranking approach, the global ranking approach (GRA), and the weighted GRA. Then, a theoretical analysis is performed to explore the properties of the proposed ranking approaches. Next, a set of the modified FTs based on the proposed ranking approaches are designed for the comparison of the learning algorithms. Finally, the modified FTs are evaluated through six classifier ensemble approaches on 34 real-world datasets. The experiments show the effectiveness of the modified FTs.
Collapse
|
7
|
Vluymans S, Triguero I, Cornelis C, Saeys Y. EPRENNID: An evolutionary prototype reduction based ensemble for nearest neighbor classification of imbalanced data. Neurocomputing 2016. [DOI: 10.1016/j.neucom.2016.08.026] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
8
|
Hu W, Tan Y. Prototype Generation Using Multiobjective Particle Swarm Optimization for Nearest Neighbor Classification. IEEE TRANSACTIONS ON CYBERNETICS 2016; 46:2719-2731. [PMID: 26513820 DOI: 10.1109/tcyb.2015.2487318] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
The nearest neighbor (NN) classifier suffers from high time complexity when classifying a test instance since the need of searching the whole training set. Prototype generation is a widely used approach to reduce the classification time, which generates a small set of prototypes to classify a test instance instead of using the whole training set. In this paper, particle swarm optimization is applied to prototype generation and two novel methods for improving the classification performance are presented: 1) a fitness function named error rank and 2) the multiobjective (MO) optimization strategy. Error rank is proposed to enhance the generation ability of the NN classifier, which takes the ranks of misclassified instances into consideration when designing the fitness function. The MO optimization strategy pursues the performance on multiple subsets of data simultaneously, in order to keep the classifier from overfitting the training set. Experimental results over 31 UCI data sets and 59 additional data sets show that the proposed algorithm outperforms nearly 30 existing prototype generation algorithms.
Collapse
|
9
|
Park SY, Lee JJ. Stochastic Opposition-Based Learning Using a Beta Distribution in Differential Evolution. IEEE TRANSACTIONS ON CYBERNETICS 2016; 46:2184-2194. [PMID: 26390506 DOI: 10.1109/tcyb.2015.2469722] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Since it first appeared, differential evolution (DE), one of the most successful evolutionary algorithms, has been studied by many researchers. Theoretical and empirical studies of the parameters and strategies have been conducted, and numerous variants have been proposed. Opposition-based DE (ODE), one of such variants, combines DE with opposition-based learning (OBL) to obtain a high-quality solution with low-computational effort. In this paper, we propose a novel OBL using a beta distribution with partial dimensional change and selection switching and combine it with DE to enhance the convergence speed and searchability. Our proposed algorithm is tested on various test functions and compared with standard DE and other ODE variants. The results indicate that the proposed algorithm outperforms the comparison group, especially in terms of solution accuracy.
Collapse
|
10
|
Kamal S, Ripon SH, Dey N, Ashour AS, Santhi V. A MapReduce approach to diminish imbalance parameters for big deoxyribonucleic acid dataset. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2016; 131:191-206. [PMID: 27265059 DOI: 10.1016/j.cmpb.2016.04.005] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/09/2015] [Revised: 03/18/2016] [Accepted: 04/06/2016] [Indexed: 06/05/2023]
Abstract
BACKGROUND In the age of information superhighway, big data play a significant role in information processing, extractions, retrieving and management. In computational biology, the continuous challenge is to manage the biological data. Data mining techniques are sometimes imperfect for new space and time requirements. Thus, it is critical to process massive amounts of data to retrieve knowledge. The existing software and automated tools to handle big data sets are not sufficient. As a result, an expandable mining technique that enfolds the large storage and processing capability of distributed or parallel processing platforms is essential. METHOD In this analysis, a contemporary distributed clustering methodology for imbalance data reduction using k-nearest neighbor (K-NN) classification approach has been introduced. The pivotal objective of this work is to illustrate real training data sets with reduced amount of elements or instances. These reduced amounts of data sets will ensure faster data classification and standard storage management with less sensitivity. However, general data reduction methods cannot manage very big data sets. To minimize these difficulties, a MapReduce-oriented framework is designed using various clusters of automated contents, comprising multiple algorithmic approaches. RESULTS To test the proposed approach, a real DNA (deoxyribonucleic acid) dataset that consists of 90 million pairs has been used. The proposed model reduces the imbalance data sets from large-scale data sets without loss of its accuracy. CONCLUSIONS The obtained results depict that MapReduce based K-NN classifier provided accurate results for big data of DNA.
Collapse
Affiliation(s)
- Sarwar Kamal
- Computer Science and Engineering, East West University, Dhaka, Bangladesh
| | | | - Nilanjan Dey
- Techno India Institute of Technology, Kolkata, India
| | - Amira S Ashour
- Department of Electronics and Electrical Communications Engineering, Faculty of Engineering, Tanta University, Tanta, Egypt.
| | - V Santhi
- School of Computing Science and Engineering, VIT University, Vellore, Tamil Nadu, India
| |
Collapse
|
11
|
Verbiest N, Vluymans S, Cornelis C, García-Pedrajas N, Saeys Y. Improving nearest neighbor classification using Ensembles of Evolutionary Generated Prototype Subsets. Appl Soft Comput 2016. [DOI: 10.1016/j.asoc.2016.03.015] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
12
|
Rezaei M, Nezamabadi-pour H. Using gravitational search algorithm in prototype generation for nearest neighbor classification. Neurocomputing 2015. [DOI: 10.1016/j.neucom.2015.01.008] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
13
|
Triguero I, Garcia S, Herrera F. SEG-SSC: a framework based on synthetic examples generation for self-labeled semi-supervised classification. IEEE TRANSACTIONS ON CYBERNETICS 2015; 45:622-634. [PMID: 25014988 DOI: 10.1109/tcyb.2014.2332003] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Self-labeled techniques are semi-supervised classification methods that address the shortage of labeled examples via a self-learning process based on supervised models. They progressively classify unlabeled data and use them to modify the hypothesis learned from labeled samples. Most relevant proposals are currently inspired by boosting schemes to iteratively enlarge the labeled set. Despite their effectiveness, these methods are constrained by the number of labeled examples and their distribution, which in many cases is sparse and scattered. The aim of this paper is to design a framework, named synthetic examples generation for self-labeled semi-supervised classification, to improve the classification performance of any given self-labeled method by using synthetic labeled data. These are generated via an oversampling technique and a positioning adjustment model that use both labeled and unlabeled examples as reference. Next, these examples are incorporated in the main stages of the self-labeling process. The principal aspects of the proposed framework are: 1) introducing diversity to the multiple classifiers used by using more (new) labeled data; 2) fulfilling labeled data distribution with the aid of unlabeled data; and 3) being applicable to any kind of self-labeled method. In our empirical studies, we have applied this scheme to four recent self-labeled methods, testing their capabilities with a large number of data sets. We show that this framework significantly improves the classification capabilities of self-labeled techniques.
Collapse
|
14
|
Triguero I, Peralta D, Bacardit J, García S, Herrera F. MRPR: A MapReduce solution for prototype reduction in big data classification. Neurocomputing 2015. [DOI: 10.1016/j.neucom.2014.04.078] [Citation(s) in RCA: 99] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
15
|
López V, Triguero I, Carmona CJ, García S, Herrera F. Addressing imbalanced classification with instance generation techniques: IPADE-ID. Neurocomputing 2014. [DOI: 10.1016/j.neucom.2013.01.050] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
16
|
|
17
|
Bergmeir C, Triguero I, Molina D, Aznarte JL, Benitez JM. Time series modeling and forecasting using memetic algorithms for regime-switching models. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2012; 23:1841-1847. [PMID: 24808077 DOI: 10.1109/tnnls.2012.2216898] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
In this brief, we present a novel model fitting procedure for the neuro-coefficient smooth transition autoregressive model (NCSTAR), as presented by Medeiros and Veiga. The model is endowed with a statistically founded iterative building procedure and can be interpreted in terms of fuzzy rule-based systems. The interpretability of the generated models and a mathematically sound building procedure are two very important properties of forecasting models. The model fitting procedure employed by the original NCSTAR is a combination of initial parameter estimation by a grid search procedure with a traditional local search algorithm. We propose a different fitting procedure, using a memetic algorithm, in order to obtain more accurate models. An empirical evaluation of the method is performed, applying it to various real-world time series originating from three forecasting competitions. The results indicate that we can significantly enhance the accuracy of the models, making them competitive to models commonly used in the field.
Collapse
|
18
|
Wen G, Jiang L, Wen J, Wei J, Yu Z. Perceptual relativity-based local hyperplane classification. Neurocomputing 2012. [DOI: 10.1016/j.neucom.2012.03.018] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
19
|
Derrac J, Verbiest N, García S, Cornelis C, Herrera F. On the use of evolutionary feature selection for improving fuzzy rough set based prototype selection. Soft comput 2012. [DOI: 10.1007/s00500-012-0888-3] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
20
|
Enhancing evolutionary instance selection algorithms by means of fuzzy rough set based feature selection. Inf Sci (N Y) 2012. [DOI: 10.1016/j.ins.2011.09.027] [Citation(s) in RCA: 89] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
21
|
García S, Derrac J, Cano JR, Herrera F. Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2012; 34:417-35. [PMID: 21768651 DOI: 10.1109/tpami.2011.142] [Citation(s) in RCA: 161] [Impact Index Per Article: 12.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
The nearest neighbor classifier is one of the most used and well-known techniques for performing recognition tasks. It has also demonstrated itself to be one of the most useful algorithms in data mining in spite of its simplicity. However, the nearest neighbor classifier suffers from several drawbacks such as high storage requirements, low efficiency in classification response, and low noise tolerance. These weaknesses have been the subject of study for many researchers and many solutions have been proposed. Among them, one of the most promising solutions consists of reducing the data used for establishing a classification rule (training data) by means of selecting relevant prototypes. Many prototype selection methods exist in the literature and the research in this area is still advancing. Different properties could be observed in the definition of them, but no formal categorization has been established yet. This paper provides a survey of the prototype selection methods proposed in the literature from a theoretical and empirical point of view. Considering a theoretical point of view, we propose a taxonomy based on the main characteristics presented in prototype selection and we analyze their advantages and drawbacks. Empirically, we conduct an experimental study involving different sizes of data sets for measuring their performance in terms of accuracy, reduction capabilities, and runtime. The results obtained by all the methods studied have been verified by nonparametric statistical tests. Several remarks, guidelines, and recommendations are made for the use of prototype selection for nearest neighbor classification.
Collapse
|
22
|
|
23
|
|