1
|
Li DC, Shi QS, Lin YS, Lin LS. A Boundary-Information-Based Oversampling Approach to Improve Learning Performance for Imbalanced Datasets. ENTROPY 2022; 24:e24030322. [PMID: 35327833 PMCID: PMC8947752 DOI: 10.3390/e24030322] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/04/2022] [Revised: 02/19/2022] [Accepted: 02/21/2022] [Indexed: 11/16/2022]
Abstract
Oversampling is the most popular data preprocessing technique. It makes traditional classifiers available for learning from imbalanced data. Through an overall review of oversampling techniques (oversamplers), we find that some of them can be regarded as danger-information-based oversamplers (DIBOs) that create samples near danger areas to make it possible for these positive examples to be correctly classified, and others are safe-information-based oversamplers (SIBOs) that create samples near safe areas to increase the correct rate of predicted positive values. However, DIBOs cause misclassification of too many negative examples in the overlapped areas, and SIBOs cause incorrect classification of too many borderline positive examples. Based on their advantages and disadvantages, a boundary-information-based oversampler (BIBO) is proposed. First, a concept of boundary information that considers safe information and dangerous information at the same time is proposed that makes created samples near decision boundaries. The experimental results show that DIBOs and BIBO perform better than SIBOs on the basic metrics of recall and negative class precision; SIBOs and BIBO perform better than DIBOs on the basic metrics for specificity and positive class precision, and BIBO is better than both of DIBOs and SIBOs in terms of integrated metrics.
Collapse
Affiliation(s)
- Der-Chiang Li
- Department of Industrial and Information Management, National Cheng Kung University, University Road, Tainan 70101, Taiwan; (D.-C.L.); (Q.-S.S.)
| | - Qi-Shi Shi
- Department of Industrial and Information Management, National Cheng Kung University, University Road, Tainan 70101, Taiwan; (D.-C.L.); (Q.-S.S.)
| | - Yao-San Lin
- Singapore Centre for Chinese Language, Nanyang Technological University, Ghim Moh Road, Singapore 279623, Singapore;
| | - Liang-Sian Lin
- Department of Information Management, National Taipei University of Nursing and Health Sciences, Ming-te Road, Taipei 112303, Taiwan
- Correspondence: ; Tel.: +886-2822-7101 (ext. 1234)
| |
Collapse
|
2
|
|
3
|
A Review of Fuzzy and Pattern-Based Approaches for Class Imbalance Problems. APPLIED SCIENCES-BASEL 2021. [DOI: 10.3390/app11146310] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
The usage of imbalanced databases is a recurrent problem in real-world data such as medical diagnostic, fraud detection, and pattern recognition. Nevertheless, in class imbalance problems, the classifiers are commonly biased by the class with more objects (majority class) and ignore the class with fewer objects (minority class). There are different ways to solve the class imbalance problem, and there has been a trend towards the usage of patterns and fuzzy approaches due to the favorable results. In this paper, we provide an in-depth review of popular methods for imbalanced databases related to patterns and fuzzy approaches. The reviewed papers include classifiers, data preprocessing, and evaluation metrics. We identify different application domains and describe how the methods are used. Finally, we suggest further research directions according to the analysis of the reviewed papers and the trend of the state of the art.
Collapse
|
4
|
Pei W, Xue B, Shang L, Zhang M. Genetic programming for development of cost-sensitive classifiers for binary high-dimensional unbalanced classification. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2020.106989] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
5
|
Jing XY, Zhang X, Zhu X, Wu F, You X, Gao Y, Shan S, Yang JY. Multiset Feature Learning for Highly Imbalanced Data Classification. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2021; 43:139-156. [PMID: 31331881 DOI: 10.1109/tpami.2019.2929166] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
With the expansion of data, increasing imbalanced data has emerged. When the imbalance ratio (IR) of data is high, most existing imbalanced learning methods decline seriously in classification performance. In this paper, we systematically investigate the highly imbalanced data classification problem, and propose an uncorrelated cost-sensitive multiset learning (UCML) approach for it. Specifically, UCML first constructs multiple balanced subsets through random partition, and then employs the multiset feature learning (MFL) to learn discriminant features from the constructed multiset. To enhance the usability of each subset and deal with the non-linearity issue existed in each subset, we further propose a deep metric based UCML (DM-UCML) approach. DM-UCML introduces the generative adversarial network technique into the multiset constructing process, such that each subset can own similar distribution with the original dataset. To cope with the non-linearity issue, DM-UCML integrates deep metric learning with MFL, such that more favorable performance can be achieved. In addition, DM-UCML designs a new discriminant term to enhance the discriminability of learned metrics. Experiments on eight traditional highly class-imbalanced datasets and two large-scale datasets indicate that: the proposed approaches outperform state-of-the-art highly imbalanced learning methods and are more robust to high IR.
Collapse
|
6
|
de la Cal EA, Villar JR, Vergara PM, Herrero Á, Sedano J. Design issues in Time Series dataset balancing algorithms. Neural Comput Appl 2020. [DOI: 10.1007/s00521-019-04011-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
7
|
Sharif M, Alesheikh AA, Tashayo B. CaFIRST: A context-aware hybrid fuzzy inference system for the similarity measure of multivariate trajectories. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2019. [DOI: 10.3233/jifs-181252] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Mohammad Sharif
- Department of Geography, Faculty of Literature and Human Science, University of Hormozgan, Bandar Abbas, Iran
| | - Ali Asghar Alesheikh
- Department of Geospatial Information Systems, Faculty of Geodesy and Geomatics Engineering, K. N. Toosi University of Technology, Tehran, Iran
| | - Behnam Tashayo
- Department of Surveying Engineering, Faculty of Civil Engineering and Transportation, University of Isfahan, Isfahan, Iran
| |
Collapse
|
8
|
Huang Z, Yang C, Chen X, Huang K, Xie Y. Adaptive over-sampling method for classification with application to imbalanced datasets in aluminum electrolysis. Neural Comput Appl 2019. [DOI: 10.1007/s00521-019-04208-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
9
|
Sayed GI, Tharwat A, Hassanien AE. Chaotic dragonfly algorithm: an improved metaheuristic algorithm for feature selection. APPL INTELL 2018. [DOI: 10.1007/s10489-018-1261-8] [Citation(s) in RCA: 84] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
10
|
|
11
|
A bi-objective hybrid algorithm for the classification of imbalanced noisy and borderline data sets. Pattern Anal Appl 2018. [DOI: 10.1007/s10044-018-0693-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
12
|
Tan SC, Wang S, Watada J. A self-adaptive class-imbalance TSK neural network with applications to semiconductor defects detection. Inf Sci (N Y) 2018. [DOI: 10.1016/j.ins.2017.10.040] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
13
|
|
14
|
Dynamic affinity-based classification of multi-class imbalanced data with one-versus-one decomposition: a fuzzy rough set approach. Knowl Inf Syst 2017. [DOI: 10.1007/s10115-017-1126-1] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
15
|
Hassanien AE, Tharwat A, Own HS. Computational model for vitamin D deficiency using hair mineral analysis. Comput Biol Chem 2017; 70:198-210. [PMID: 28923545 DOI: 10.1016/j.compbiolchem.2017.08.015] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2017] [Revised: 08/09/2017] [Accepted: 08/22/2017] [Indexed: 01/03/2023]
Abstract
Vitamin D deficiency is prevalent in the Arabian Gulf region, especially among women. Recent studies show that the vitamin D deficiency is associated with a mineral status of a patient. Therefore, it is important to assess the mineral status of the patient to reveal the hidden mineral imbalance associated with vitamin D deficiency. A well-known test such as the red blood cells is fairly expensive, invasive, and less informative. On the other hand, a hair mineral analysis can be considered an accurate, excellent, highly informative tool to measure mineral imbalance associated with vitamin D deficiency. In this study, 118 apparently healthy Kuwaiti women were assessed for their mineral levels and vitamin D status by a hair mineral analysis (HMA). This information was used to build a computerized model that would predict vitamin D deficiency based on its association with the levels and ratios of minerals. The first phase of the proposed model introduces a novel hybrid optimization algorithm, which can be considered as an improvement of Bat Algorithm (BA) to select the most discriminative features. The improvement includes using the mutation process of Genetic Algorithm (GA) to update the positions of bats with the aim of speeding up convergence; thus, making the algorithm more feasible for wider ranges of real-world applications. Due to the imbalanced class distribution in our dataset, in the second phase, different sampling methods such as Random Under-Sampling, Random Over-Sampling, and Synthetic Minority Oversampling Technique are used to solve the problem of imbalanced datasets. In the third phase, an AdaBoost ensemble classifier is used to predicting the vitamin D deficiency. The results showed that the proposed model achieved good results to detect the deficiency in vitamin D.
Collapse
Affiliation(s)
- Aboul Ella Hassanien
- Faculty of Computers and Information, Cairo University, Egypt; Scientific Research Group in Egypt (SRGE), Egypt1.
| | - Alaa Tharwat
- Faculty of Engineering, Suez Canal University, Egypt; Faculty of Computer Science and Engineering, Frankfurt University of Applied Sciences, 60318 Frankfurt am Main, Germany; Scientific Research Group in Egypt (SRGE), Egypt1.
| | - Hala S Own
- Department of Solar and Space Research, National Research Institute of Astronomy and Geophysics, El-Marsad Street, P.O. Box 11421 Helwan, Egypt.
| |
Collapse
|
16
|
Chandrasekar R, Khare N. BSFS: Design and Development of Exponential Brain Storm Fuzzy System for Data Classification. INT J UNCERTAIN FUZZ 2017. [DOI: 10.1142/s0218488517500106] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The inductive learning of fuzzy rule classifier suffers in the rule generation and rule optimization when the search space or variables becomes high. This creates the new idea of making the fuzzy system with precise rules leading to less scalability and improved accuracy. Accordingly, different approaches have been presented in the literature for optimal finding of fuzzy rules using optimization algorithms. Here, we make use of the brain storm optimization algorithm for rule optimization. In this paper, a new fuzzy system called, exponential brain storm fuzzy system is developed by modifying the traditional fuzzy system in rule definition process. In rule derivation, we have presented an algorithm called, EBSO by modifying the BSO algorithm with exponential model. Also, the membership function is designed using simple uniform distribution-based approach. Finally, data classification is performed with a new BSFS system using three medical databases such as, PID, Cleveland and DRD. The experimentation proved that the proposed BSFS clearly outperformed in all the three datasets by reaching the maximum accuracy.
Collapse
Affiliation(s)
- R. Chandrasekar
- School of Information Technology and Engineering, VIT University, Vellore, Tamil Nadu 632014, India
| | - Neelu Khare
- School of Information Technology and Engineering, VIT University, Vellore, Tamil Nadu 632014, India
| |
Collapse
|
17
|
Tharwat A, Moemen YS, Hassanien AE. Classification of toxicity effects of biotransformed hepatic drugs using whale optimized support vector machines. J Biomed Inform 2017; 68:132-149. [DOI: 10.1016/j.jbi.2017.03.002] [Citation(s) in RCA: 46] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2016] [Revised: 02/09/2017] [Accepted: 03/05/2017] [Indexed: 10/20/2022]
|
18
|
Tharwat A, Moemen YS, Hassanien AE. A Predictive Model for Toxicity Effects Assessment of Biotransformed Hepatic Drugs Using Iterative Sampling Method. Sci Rep 2016; 6:38660. [PMID: 27934950 PMCID: PMC5146749 DOI: 10.1038/srep38660] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2016] [Accepted: 11/11/2016] [Indexed: 12/03/2022] Open
Abstract
Measuring toxicity is one of the main steps in drug development. Hence, there is a high demand for computational models to predict the toxicity effects of the potential drugs. In this study, we used a dataset, which consists of four toxicity effects:mutagenic, tumorigenic, irritant and reproductive effects. The proposed model consists of three phases. In the first phase, rough set-based methods are used to select the most discriminative features for reducing the classification time and improving the classification performance. Due to the imbalanced class distribution, in the second phase, different sampling methods such as Random Under-Sampling, Random Over-Sampling and Synthetic Minority Oversampling Technique are used to solve the problem of imbalanced datasets. ITerative Sampling (ITS) method is proposed to avoid the limitations of those methods. ITS method has two steps. The first step (sampling step) iteratively modifies the prior distribution of the minority and majority classes. In the second step, a data cleaning method is used to remove the overlapping that is produced from the first step. In the third phase, Bagging classifier is used to classify an unknown drug into toxic or non-toxic. The experimental results proved that the proposed model performed well in classifying the unknown samples according to all toxic effects in the imbalanced datasets.
Collapse
Affiliation(s)
- Alaa Tharwat
- Faculty of Engineering, Suez Canal University, Egypt.,Scientific Research Group in Egypt, (SRGE), Cairo, Egypt
| | - Yasmine S Moemen
- Scientific Research Group in Egypt, (SRGE), Cairo, Egypt.,Clinical Pathology Department, National Liver Institute, Menoufia University, Egypt
| | - Aboul Ella Hassanien
- Scientific Research Group in Egypt, (SRGE), Cairo, Egypt.,Faculty of Computers and Information, Cairo University, Egypt
| |
Collapse
|
19
|
Tashayo B, Alimohammadi A. Modeling urban air pollution with optimized hierarchical fuzzy inference system. ENVIRONMENTAL SCIENCE AND POLLUTION RESEARCH INTERNATIONAL 2016; 23:19417-19431. [PMID: 27378222 DOI: 10.1007/s11356-016-7059-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/25/2016] [Accepted: 06/07/2016] [Indexed: 06/06/2023]
Abstract
Environmental exposure assessments (EEA) and epidemiological studies require urban air pollution models with appropriate spatial and temporal resolutions. Uncertain available data and inflexible models can limit air pollution modeling techniques, particularly in under developing countries. This paper develops a hierarchical fuzzy inference system (HFIS) to model air pollution under different land use, transportation, and meteorological conditions. To improve performance, the system treats the issue as a large-scale and high-dimensional problem and develops the proposed model using a three-step approach. In the first step, a geospatial information system (GIS) and probabilistic methods are used to preprocess the data. In the second step, a hierarchical structure is generated based on the problem. In the third step, the accuracy and complexity of the model are simultaneously optimized with a multiple objective particle swarm optimization (MOPSO) algorithm. We examine the capabilities of the proposed model for predicting daily and annual mean PM2.5 and NO2 and compare the accuracy of the results with representative models from existing literature. The benefits provided by the model features, including probabilistic preprocessing, multi-objective optimization, and hierarchical structure, are precisely evaluated by comparing five different consecutive models in terms of accuracy and complexity criteria. Fivefold cross validation is used to assess the performance of the generated models. The respective average RMSEs and coefficients of determination (R (2)) for the test datasets using proposed model are as follows: daily PM2.5 = (8.13, 0.78), annual mean PM2.5 = (4.96, 0.80), daily NO2 = (5.63, 0.79), and annual mean NO2 = (2.89, 0.83). The obtained results demonstrate that the developed hierarchical fuzzy inference system can be utilized for modeling air pollution in EEA and epidemiological studies.
Collapse
Affiliation(s)
- Behnam Tashayo
- Department of Geospatial Information Systems, Faculty of Geodesy and Geomatics Engineering, Khajeh Nasir Toosi University of Technology, Vali-Asr Street, Mirdamad Cross, Tehran, Iran.
| | - Abbas Alimohammadi
- Department of Geospatial Information Systems, Faculty of Geodesy and Geomatics Engineering, Khajeh Nasir Toosi University of Technology, Vali-Asr Street, Mirdamad Cross, Tehran, Iran
- Center of Excellence in Geospatial Information Technology, Faculty of Geodesy and Geomatics Engineering, Khajeh Nasir Toosi University of Technology, Tehran, Iran
| |
Collapse
|
20
|
Galar M, Fernández A, Barrenechea E, Bustince H, Herrera F. Ordering-based pruning for improving the performance of ensembles of classifiers in the framework of imbalanced datasets. Inf Sci (N Y) 2016. [DOI: 10.1016/j.ins.2016.02.056] [Citation(s) in RCA: 50] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
21
|
Sensitivity analysis of fuzzy rule-based classification systems by means of the Lipschitz condition. Soft comput 2016. [DOI: 10.1007/s00500-015-1744-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
22
|
GPFIS-CLASS: A Genetic Fuzzy System based on Genetic Programming for classification problems. Appl Soft Comput 2015. [DOI: 10.1016/j.asoc.2015.08.055] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
23
|
A discussion on interpretability of linguistic rule based systems and its application to solve regression problems. Knowl Based Syst 2015. [DOI: 10.1016/j.knosys.2015.08.002] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
24
|
Fernández A, López V, del Jesus MJ, Herrera F. Revisiting Evolutionary Fuzzy Systems: Taxonomy, applications, new trends and challenges. Knowl Based Syst 2015. [DOI: 10.1016/j.knosys.2015.01.013] [Citation(s) in RCA: 54] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
25
|
Alshomrani S, Bawakid A, Shim SO, Fernández A, Herrera F. A proposal for evolutionary fuzzy systems using feature weighting: Dealing with overlapping in imbalanced datasets. Knowl Based Syst 2015. [DOI: 10.1016/j.knosys.2014.09.002] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
26
|
Antonelli M, Ducange P, Marcelloni F. An experimental study on evolutionary fuzzy classifiers designed for managing imbalanced datasets. Neurocomputing 2014. [DOI: 10.1016/j.neucom.2014.04.070] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
27
|
del Río S, López V, Benítez JM, Herrera F. On the use of MapReduce for imbalanced big data using Random Forest. Inf Sci (N Y) 2014. [DOI: 10.1016/j.ins.2014.03.043] [Citation(s) in RCA: 195] [Impact Index Per Article: 17.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
28
|
Modeling of a semantics core of linguistic terms based on an extension of hedge algebra semantics and its application. Knowl Based Syst 2014. [DOI: 10.1016/j.knosys.2014.04.047] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
29
|
Multiobjective genetic classifier selection for random oracles fuzzy rule-based classifier ensembles: How beneficial is the additional diversity? Knowl Based Syst 2013. [DOI: 10.1016/j.knosys.2013.08.006] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
30
|
López V, Fernández A, García S, Palade V, Herrera F. An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Inf Sci (N Y) 2013. [DOI: 10.1016/j.ins.2013.07.007] [Citation(s) in RCA: 932] [Impact Index Per Article: 77.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|