1
|
Devi Priya R, Sivaraj R, Abraham A, Pravin T, Sivasankar P, Anitha N. Multi-Objective Particle Swarm Optimization Based Preprocessing of Multi-Class Extremely Imbalanced Datasets. INT J UNCERTAIN FUZZ 2022. [DOI: 10.1142/s0218488522500209] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Today’s datasets are usually very large with many features and making analysis on such datasets is really a tedious task. Especially when performing classification, selecting attributes that are salient for the process is a brainstorming task. It is more difficult when there are many class labels for the target class attribute and hence many researchers have introduced methods to select features for performing classification on multi-class attributes. The process becomes more tedious when the attribute values are imbalanced for which researchers have contributed many methods. But, there is no sufficient research to handle extreme imbalance and feature selection together and hence this paper aims to bridge this gap. Here Particle Swarm Optimization (PSO), an efficient evolutionary algorithm is used to handle imbalanced dataset and feature selection process is also enhanced with the required functionalities. First, Multi-objective Particle Swarm Optimization is used to transform the imbalanced datasets into balanced one and then another version of Multi-objective Particle Swarm Optimization is used to select the significant features. The proposed methodology is applied on eight multi-class extremely imbalanced datasets and the experimental results are found to be better than other existing methods in terms of classification accuracy, G mean, F measure. The results validated by using Friedman test also confirm that the proposed methodology effectively balances the dataset with less number of features than other methods.
Collapse
Affiliation(s)
- R. Devi Priya
- Department of Computer Science and Engineering, Centre for IoT and Artificial Intelligence, KPR Institute of Engineering and Technology, Coimbatore, TamilNadu, India
| | - R. Sivaraj
- Department of Computer Science and Engineering, Nandha Engineering College, Erode, TamilNadu, India
| | - Ajith Abraham
- Center for Artificial Intelligence, Innopolis University, Innopolis, Russia
- Machine Intelligence Research Labs (MIR Labs), Auburn, Washington 98071, USA
| | - T. Pravin
- Department of Mechanical Engineering, SNS College of Engineering, Coimbatore, India
| | - P. Sivasankar
- Department of Petroleum Engineering & Earth Sciences, Indian Institute of Petroleum and Energy, Visakhapatnam, India
| | - N. Anitha
- Department of Information Technology, Kongu Engineering College, Erode, TamilNadu, India
| |
Collapse
|
2
|
Pineda-Jaramillo J, Barrera-Jiménez H, Mesa-Arango R. Unveiling the relevance of traffic enforcement cameras on the severity of vehicle-pedestrian collisions in an urban environment with machine learning models. JOURNAL OF SAFETY RESEARCH 2022; 81:225-238. [PMID: 35589294 DOI: 10.1016/j.jsr.2022.02.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/16/2021] [Revised: 10/27/2021] [Accepted: 02/23/2022] [Indexed: 06/15/2023]
Abstract
PURPOSE One of the leading causes of violent fatalities around the world is road traffic collisions, and pedestrians are among the most vulnerable road users with respect to such incidents. Since walking is highly promoted in urban areas to alleviate motor-vehicle externalities, it is paramount to understand the causes associated with vehicle-pedestrian collisions and their severity to provide safe environments. Although traffic enforcement cameras can address vehicle-vehicle collisions, little is known about their effectiveness with respect to vehicle-pedestrian incidents. METHODOLOGY In this study, we trained a set of machine learning models to forecast if a vehicle-pedestrian collision will turn into an injury or fatality, and the most suitable model was used to investigate the contributing features associated with such events with emphasis on the impact of traffic enforcement cameras. In addition to traffic enforcement camera proximity, features associated with the collision, weather, vehicle, victim, and infrastructure are included in the model to reduce unobserved heterogeneity. RESULTS Results show that a Linear Discriminant Analysis model surpasses other machine learning models considering the evaluation metrics. Results reveal that the age and gender of the victim, the involvement of larger vehicles in the collision, and the quality of the illumination are the causes associated with pedestrian fatalities. On the other hand, involvement of motorcycles and collisions that occurred in densely populated locations are the causes associated with pedestrian injuries. CONCLUSIONS This investigation demonstrates how to articulate machine learning into a vehicle-pedestrian crash analysis to understand the direction and magnitude of covariates in the corresponding severity outcome. Furthermore, it highlights the remarkable effect that traffic enforcement cameras and other features have on vehicle-pedestrian crash severity. These results provide actionable guidance for educational campaigns, enhanced traffic engineering, and infrastructure improvements that could be implemented in the analyzed region to provide safer transportation.
Collapse
Affiliation(s)
| | | | - Rodrigo Mesa-Arango
- Department of Civil Engineering and Construction Management, Florida Institute of Technology, USA
| |
Collapse
|
3
|
Dai W, Ning C, Nan J, Wang D. Stochastic configuration networks for imbalanced data classification. INT J MACH LEARN CYB 2022. [DOI: 10.1007/s13042-022-01565-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
4
|
Li DC, Shi QS, Lin YS, Lin LS. A Boundary-Information-Based Oversampling Approach to Improve Learning Performance for Imbalanced Datasets. ENTROPY 2022; 24:e24030322. [PMID: 35327833 PMCID: PMC8947752 DOI: 10.3390/e24030322] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/04/2022] [Revised: 02/19/2022] [Accepted: 02/21/2022] [Indexed: 11/16/2022]
Abstract
Oversampling is the most popular data preprocessing technique. It makes traditional classifiers available for learning from imbalanced data. Through an overall review of oversampling techniques (oversamplers), we find that some of them can be regarded as danger-information-based oversamplers (DIBOs) that create samples near danger areas to make it possible for these positive examples to be correctly classified, and others are safe-information-based oversamplers (SIBOs) that create samples near safe areas to increase the correct rate of predicted positive values. However, DIBOs cause misclassification of too many negative examples in the overlapped areas, and SIBOs cause incorrect classification of too many borderline positive examples. Based on their advantages and disadvantages, a boundary-information-based oversampler (BIBO) is proposed. First, a concept of boundary information that considers safe information and dangerous information at the same time is proposed that makes created samples near decision boundaries. The experimental results show that DIBOs and BIBO perform better than SIBOs on the basic metrics of recall and negative class precision; SIBOs and BIBO perform better than DIBOs on the basic metrics for specificity and positive class precision, and BIBO is better than both of DIBOs and SIBOs in terms of integrated metrics.
Collapse
Affiliation(s)
- Der-Chiang Li
- Department of Industrial and Information Management, National Cheng Kung University, University Road, Tainan 70101, Taiwan; (D.-C.L.); (Q.-S.S.)
| | - Qi-Shi Shi
- Department of Industrial and Information Management, National Cheng Kung University, University Road, Tainan 70101, Taiwan; (D.-C.L.); (Q.-S.S.)
| | - Yao-San Lin
- Singapore Centre for Chinese Language, Nanyang Technological University, Ghim Moh Road, Singapore 279623, Singapore;
| | - Liang-Sian Lin
- Department of Information Management, National Taipei University of Nursing and Health Sciences, Ming-te Road, Taipei 112303, Taiwan
- Correspondence: ; Tel.: +886-2822-7101 (ext. 1234)
| |
Collapse
|
5
|
Chennuru VK, Timmappareddy SR. Simulated annealing based undersampling (SAUS): a hybrid multi-objective optimization method to tackle class imbalance. APPL INTELL 2021. [DOI: 10.1007/s10489-021-02369-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
6
|
Pascual-Triana JD, Charte D, Andrés Arroyo M, Fernández A, Herrera F. Revisiting data complexity metrics based on morphology for overlap and imbalance: snapshot, new overlap number of balls metrics and singular problems prospect. Knowl Inf Syst 2021. [DOI: 10.1007/s10115-021-01577-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
7
|
Shahee SA, Ananthakumar U. An overlap sensitive neural network for class imbalanced data. Data Min Knowl Discov 2021. [DOI: 10.1007/s10618-021-00766-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
8
|
Wu X, Yang Y, Ren L. Entropy difference and kernel-based oversampling technique for imbalanced data learning. INTELL DATA ANAL 2020. [DOI: 10.3233/ida-194761] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Class imbalance is often a problem in various real-world datasets, where one class contains a small number of data and the other contains a large number of data. It is notably difficult to develop an effective model using traditional data mining and machine learning algorithms without using data preprocessing techniques to balance the dataset. Oversampling is often used as a pretreatment method for imbalanced datasets. Specifically, synthetic oversampling techniques focus on balancing the number of training instances between the majority class and the minority class by generating extra artificial minority class instances. However, the current oversampling techniques simply consider the imbalance of quantity and pay no attention to whether the distribution is balanced or not. Therefore, this paper proposes an entropy difference and kernel-based SMOTE (EDKS) which considers the imbalance degree of dataset from distribution by entropy difference and overcomes the limitation of SMOTE for nonlinear problems by oversampling in the feature space of support vector machine classifier. First, the EDKS method maps the input data into a feature space to increase the separability of the data. Then EDKS calculates the entropy difference in kernel space, determines the majority class and minority class, and finds the sparse regions in the minority class. Moreover, the proposed method balances the data distribution by synthesizing new instances and evaluating its retention capability. Our algorithm can effectively distinguish those datasets with the same imbalance ratio but different distribution. The experimental study evaluates and compares the performance of our method against state-of-the-art algorithms, and then demonstrates that the proposed approach is competitive with the state-of-art algorithms on multiple benchmark imbalanced datasets.
Collapse
|
9
|
IA-SUWO: An Improving Adaptive semi-unsupervised weighted oversampling for imbalanced classification problems. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2020.106116] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
10
|
Yuan BW, Luo XG, Zhang ZL, Yu Y, Huo HW, Johannes T, Zou XD. A novel density-based adaptive k nearest neighbor method for dealing with overlapping problem in imbalanced datasets. Neural Comput Appl 2020. [DOI: 10.1007/s00521-020-05256-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
11
|
Zhang Y, Qian X, Wang J, Gendeel M. Fuzzy rule-based classification system using multi-population quantum evolutionary algorithm with contradictory rule reconstruction. APPL INTELL 2019. [DOI: 10.1007/s10489-019-01478-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
12
|
Liu G, Yang Y, Li B. Fuzzy rule-based oversampling technique for imbalanced and incomplete data learning. Knowl Based Syst 2018. [DOI: 10.1016/j.knosys.2018.05.044] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
|
13
|
Tan SC, Wang S, Watada J. A self-adaptive class-imbalance TSK neural network with applications to semiconductor defects detection. Inf Sci (N Y) 2018. [DOI: 10.1016/j.ins.2017.10.040] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
14
|
Li F, Zhang X, Zhang X, Du C, Xu Y, Tian YC. Cost-sensitive and hybrid-attribute measure multi-decision tree over imbalanced data sets. Inf Sci (N Y) 2018. [DOI: 10.1016/j.ins.2017.09.013] [Citation(s) in RCA: 44] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
15
|
Dynamic affinity-based classification of multi-class imbalanced data with one-versus-one decomposition: a fuzzy rough set approach. Knowl Inf Syst 2017. [DOI: 10.1007/s10115-017-1126-1] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
16
|
Fernández A, Carmona CJ, José del Jesus M, Herrera F. A Pareto-based Ensemble with Feature and Instance Selection for Learning from Multi-Class Imbalanced Datasets. Int J Neural Syst 2017. [DOI: 10.1142/s0129065717500289] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Imbalanced classification is related to those problems that have an uneven distribution among classes. In addition to the former, when instances are located into the overlapped areas, the correct modeling of the problem becomes harder. Current solutions for both issues are often focused on the binary case study, as multi-class datasets require an additional effort to be addressed. In this research, we overcome these problems by carrying out a combination between feature and instance selections. Feature selection will allow simplifying the overlapping areas easing the generation of rules to distinguish among the classes. Selection of instances from all classes will address the imbalance itself by finding the most appropriate class distribution for the learning task, as well as possibly removing noise and difficult borderline examples. For the sake of obtaining an optimal joint set of features and instances, we embedded the searching for both parameters in a Multi-Objective Evolutionary Algorithm, using the C4.5 decision tree as baseline classifier in this wrapper approach. The multi-objective scheme allows taking a double advantage: the search space becomes broader, and we may provide a set of different solutions in order to build an ensemble of classifiers. This proposal has been contrasted versus several state-of-the-art solutions on imbalanced classification showing excellent results in both binary and multi-class problems.
Collapse
Affiliation(s)
- Alberto Fernández
- Department of Computer Science and Artificial Intelligence, University of Granada, Granada 18071, Spain
| | - Cristobal José Carmona
- Department of Civil Engineering, University of Burgos, Burgos 09006, Spain
- Leicester School of Pharmacy, De Montfort University, Leicester, LE1 9BH, UK
| | | | - Francisco Herrera
- Department of Computer Science and Artificial Intelligence, University of Granada, Granada 18071, Spain
- Faculty of Computing and Information Technology — North Jeddah, King Abdulaziz University (KAU), Jeddah 80200, Saudi Arabia
| |
Collapse
|
17
|
|
18
|
Jamalabadi H, Nasrollahi H, Alizadeh S, Nadjar Araabi B, Nili Ahamadabadi M. Competitive interaction reasoning: A bio-inspired reasoning method for fuzzy rule based classification systems. Inf Sci (N Y) 2016. [DOI: 10.1016/j.ins.2016.02.052] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
19
|
Fernández A, Elkano M, Galar M, Sanz JA, Alshomrani S, Bustince H, Herrera F. Enhancing evolutionary fuzzy systems for multi-class problems: Distance-based relative competence weighting with truncated confidences (DRCW-TC). Int J Approx Reason 2016. [DOI: 10.1016/j.ijar.2016.02.005] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
|
20
|
Xu Y, Yang Z, Zhang Y, Pan X, Wang L. A maximum margin and minimum volume hyper-spheres machine with pinball loss for imbalanced data classification. Knowl Based Syst 2016. [DOI: 10.1016/j.knosys.2015.12.005] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
21
|
Rana ZA, Mian MA, Shamail S. Improving Recall of software defect prediction models using association mining. Knowl Based Syst 2015. [DOI: 10.1016/j.knosys.2015.10.009] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
22
|
Zhang Z, Gao G, Tian Y. Multi-kernel multi-criteria optimization classifier with fuzzification and penalty factors for predicting biological activity. Knowl Based Syst 2015. [DOI: 10.1016/j.knosys.2015.07.011] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
23
|
Fernández A, López V, del Jesus MJ, Herrera F. Revisiting Evolutionary Fuzzy Systems: Taxonomy, applications, new trends and challenges. Knowl Based Syst 2015. [DOI: 10.1016/j.knosys.2015.01.013] [Citation(s) in RCA: 54] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|