101
|
Błaszczyński J, Stefanowski J. Local Data Characteristics in Learning Classifiers from Imbalanced Data. ADVANCES IN DATA ANALYSIS WITH COMPUTATIONAL INTELLIGENCE METHODS 2018. [DOI: 10.1007/978-3-319-67946-4_2] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
|
102
|
Abstract
In the era of big data, transformation of biomedical big data into valuable knowledge has been one of the most important challenges in bioinformatics. Deep learning has advanced rapidly since the early 2000s and now demonstrates state-of-the-art performance in various fields. Accordingly, application of deep learning in bioinformatics to gain insight from data has been emphasized in both academia and industry. Here, we review deep learning in bioinformatics, presenting examples of current research. To provide a useful and comprehensive perspective, we categorize research both by the bioinformatics domain (i.e. omics, biomedical imaging, biomedical signal processing) and deep learning architecture (i.e. deep neural networks, convolutional neural networks, recurrent neural networks, emergent architectures) and present brief descriptions of each study. Additionally, we discuss theoretical and practical issues of deep learning in bioinformatics and suggest future research directions. We believe that this review will provide valuable insights and serve as a starting point for researchers to apply deep learning approaches in their bioinformatics studies.
Collapse
|
103
|
Broaden the minority class space for decision tree induction using antigen-derived detectors. Knowl Based Syst 2017. [DOI: 10.1016/j.knosys.2017.09.029] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
104
|
|
105
|
Yom-Tov E. Predicting Drug Recalls From Internet Search Engine Queries. IEEE JOURNAL OF TRANSLATIONAL ENGINEERING IN HEALTH AND MEDICINE-JTEHM 2017; 5:4400106. [PMID: 28845371 PMCID: PMC5568020 DOI: 10.1109/jtehm.2017.2732945] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/30/2016] [Revised: 05/21/2017] [Accepted: 07/23/2017] [Indexed: 01/01/2023]
Abstract
Batches of pharmaceuticals are sometimes recalled from the market when a safety issue or a defect is detected in specific production runs of a drug. Such problems are usually detected when patients or healthcare providers report abnormalities to medical authorities. Here, we test the hypothesis that defective production lots can be detected earlier by monitoring queries to Internet search engines. We extracted queries from the USA to the Bing search engine, which mentioned one of the 5195 pharmaceutical drugs during 2015 and all recall notifications issued by the Food and Drug Administration (FDA) during that year. By using attributes that quantify the change in query volume at the state level, we attempted to predict if a recall of a specific drug will be ordered by FDA in a time horizon ranging from 1 to 40 days in future. Our results show that future drug recalls can indeed be identified with an AUC of 0.791 and a lift at 5% of approximately 6 when predicting a recall occurring one day ahead. This performance degrades as prediction is made for longer periods ahead. The most indicative attributes for prediction are sudden spikes in query volume about a specific medicine in each state. Recalls of prescription drugs and those estimated to be of medium-risk are more likely to be identified using search query data. These findings suggest that aggregated Internet search engine data can be used to facilitate in early warning of faulty batches of medicines.
Collapse
|
106
|
Babu S, Ananthanarayanan N. EMOTE: Enhanced Minority Oversampling TEchnique. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2017. [DOI: 10.3233/jifs-161114] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- S. Babu
- Department of Computer Science and Applications, SCSVMV University, Enathur, Kancheepuram, Tamilnadu, India
| | - N.R. Ananthanarayanan
- Department of Computer Science and Applications, SCSVMV University, Enathur, Kancheepuram, Tamilnadu, India
| |
Collapse
|
107
|
|
108
|
Wojciechowski S, Wilk S. Difficulty Factors and Preprocessing in Imbalanced Data Sets: An Experimental Study on Artificial Data. FOUNDATIONS OF COMPUTING AND DECISION SCIENCES 2017. [DOI: 10.1515/fcds-2017-0007] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Abstract
In this paper we describe results of an experimental study where we checked the impact of various difficulty factors in imbalanced data sets on the performance of selected classifiers applied alone or combined with several preprocessing methods. In the study we used artificial data sets in order to systematically check factors such as dimensionality, class imbalance ratio or distribution of specific types of examples (safe, borderline, rare and outliers) in the minority class. The results revealed that the latter factor was the most critical one and it exacerbated other factors (in particular class imbalance). The best classification performance was demonstrated by non-symbolic classifiers, particular by k-NN classifiers (with 1 or 3 neighbors - 1NN and 3NN, respectively) and by SVM. Moreover, they benefited from different preprocessing methods - SVM and 1NN worked best with undersampling, while oversampling was more beneficial for 3NN.
Collapse
Affiliation(s)
- Szymon Wojciechowski
- Institute of Computing Science, Poznan University of Technology, Piotrowo 2, 60-965 Poznan , Poland
| | - Szymon Wilk
- Institute of Computing Science, Poznan University of Technology, Piotrowo 2, 60-965 Poznan , Poland
| |
Collapse
|
109
|
Abstract
Understanding epigenetic processes holds immense promise for medical applications. Advances in Machine Learning (ML) are critical to realize this promise. Previous studies used epigenetic data sets associated with the germline transmission of epigenetic transgenerational inheritance of disease and novel ML approaches to predict genome-wide locations of critical epimutations. A combination of Active Learning (ACL) and Imbalanced Class Learning (ICL) was used to address past problems with ML to develop a more efficient feature selection process and address the imbalance problem in all genomic data sets. The power of this novel ML approach and our ability to predict epigenetic phenomena and associated disease is suggested. The current approach requires extensive computation of features over the genome. A promising new approach is to introduce Deep Learning (DL) for the generation and simultaneous computation of novel genomic features tuned to the classification task. This approach can be used with any genomic or biological data set applied to medicine. The application of molecular epigenetic data in advanced machine learning analysis to medicine is the focus of this review.
Collapse
Affiliation(s)
- Lawrence B Holder
- a School of Electrical Engineering and Computer Science , Washington State University , Pullman , WA , USA
| | - M Muksitul Haque
- a School of Electrical Engineering and Computer Science , Washington State University , Pullman , WA , USA.,b Center for Reproductive Biology, School of Biological Sciences , Washington State University , Pullman , WA , USA
| | - Michael K Skinner
- b Center for Reproductive Biology, School of Biological Sciences , Washington State University , Pullman , WA , USA
| |
Collapse
|
110
|
Cost-Effective Class-Imbalance Aware CNN for Vehicle Localization and Categorization in High Resolution Aerial Images. REMOTE SENSING 2017. [DOI: 10.3390/rs9050494] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
111
|
Mathews LM, Seetha H. On Improving the Classification of Imbalanced Data. CYBERNETICS AND INFORMATION TECHNOLOGIES 2017. [DOI: 10.1515/cait-2017-0004] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Abstract
Mining of imbalanced data isachallenging task due to its complex inherent characteristics. The conventional classifiers such as the nearest neighbor severely bias towards the majority class, as minority class data are under-represented and outnumbered. This paper focuses on building an improved Nearest Neighbor Classifier foratwo class imbalanced data. Three oversampling techniques are presented, for generation of artificial instances for the minority class for balancing the distribution among the classes. Experimental results showed that the proposed methods outperformed the conventional classifier.
Collapse
Affiliation(s)
- Lincy Meera Mathews
- School of Information Technology and Engineering, VIT University, Vellore, Tamil Nadu, India
| | - Hari Seetha
- School of Computing Science & Engineering, VIT University, Vellore, Tamil Nadu, India
| |
Collapse
|
112
|
|
113
|
Bach M, Werner A, Żywiec J, Pluskiewicz W. The study of under- and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis. Inf Sci (N Y) 2017. [DOI: 10.1016/j.ins.2016.09.038] [Citation(s) in RCA: 43] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
114
|
|
115
|
|
116
|
Moniz N, Branco P, Torgo L. Resampling strategies for imbalanced time series forecasting. INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS 2017. [DOI: 10.1007/s41060-017-0044-3] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
117
|
Multi-class and feature selection extensions of Roughly Balanced Bagging for imbalanced data. J Intell Inf Syst 2017. [DOI: 10.1007/s10844-017-0446-7] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
118
|
A Novel Ensemble Method for Imbalanced Data Learning: Bagging of Extrapolation-SMOTE SVM. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2017; 2017:1827016. [PMID: 28250765 PMCID: PMC5304315 DOI: 10.1155/2017/1827016] [Citation(s) in RCA: 68] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/04/2016] [Revised: 12/23/2016] [Accepted: 12/28/2016] [Indexed: 11/17/2022]
Abstract
Class imbalance ubiquitously exists in real life, which has attracted much interest from various domains. Direct learning from imbalanced dataset may pose unsatisfying results overfocusing on the accuracy of identification and deriving a suboptimal model. Various methodologies have been developed in tackling this problem including sampling, cost-sensitive, and other hybrid ones. However, the samples near the decision boundary which contain more discriminative information should be valued and the skew of the boundary would be corrected by constructing synthetic samples. Inspired by the truth and sense of geometry, we designed a new synthetic minority oversampling technique to incorporate the borderline information. What is more, ensemble model always tends to capture more complicated and robust decision boundary in practice. Taking these factors into considerations, a novel ensemble method, called Bagging of Extrapolation Borderline-SMOTE SVM (BEBS), has been proposed in dealing with imbalanced data learning (IDL) problems. Experiments on open access datasets showed significant superior performance using our model and a persuasive and intuitive explanation behind the method was illustrated. As far as we know, this is the first model combining ensemble of SVMs with borderline information for solving such condition.
Collapse
|
119
|
Sotiropoulos DN, Tsihrintzis GA. Artificial Immune System-Based Classification in Extremely Imbalanced Classification Problems. INT J ARTIF INTELL T 2017. [DOI: 10.1142/s0218213017500099] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
This paper focuses on a special category of machine learning problems arising in cases where the set of available training instances is significantly biased towards a particular class of patterns. Our work addresses the so-called Class Imbalance Problem through the utilization of an Artificial Immune System-(AIS)based classification algorithm which encodes the inherent ability of the Adaptive Immune System to mediate the exceptionally imbalanced “self” / “non-self” discrimination process. From a computational point of view, this process constitutes an extremely imbalanced pattern classification task since the vast majority of molecular patterns pertain to the “non-self” space. Our work focuses on investigating the effect of the class imbalance problem on the AIS-based classification algorithm by assessing its relative ability to deal with extremely skewed datasets when compared against two state-of-the-art machine learning paradigms such as Support Vector Machines (SVMs) and Multi-Layer Perceptrons (MLPs). To this end, we conducted a series of experiments on a music-related dataset where a small fraction of positive samples was to be recognized against the vast volume of negative samples. The results obtained indicate that the utilized bio-inspired classifier outperforms SVMs in detecting patterns from the minority class while its performance on the same task is competently close to the one exhibited by MLPs. Our findings suggest that the AIS-based classifier relies on its intrinsic resampling and class-balancing functionality in order to address the class imbalance problem.
Collapse
|
120
|
Brzezinski D, Stefanowski J. Prequential AUC: properties of the area under the ROC curve for data streams with concept drift. Knowl Inf Syst 2017. [DOI: 10.1007/s10115-017-1022-8] [Citation(s) in RCA: 58] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
121
|
Li J, Fong S, Sung Y, Cho K, Wong R, Wong KKL. Adaptive swarm cluster-based dynamic multi-objective synthetic minority oversampling technique algorithm for tackling binary imbalanced datasets in biomedical data classification. BioData Min 2016; 9:37. [PMID: 27980678 PMCID: PMC5131504 DOI: 10.1186/s13040-016-0117-1] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2016] [Accepted: 11/21/2016] [Indexed: 11/27/2022] Open
Abstract
Background An imbalanced dataset is defined as a training dataset that has imbalanced proportions of data in both interesting and uninteresting classes. Often in biomedical applications, samples from the stimulating class are rare in a population, such as medical anomalies, positive clinical tests, and particular diseases. Although the target samples in the primitive dataset are small in number, the induction of a classification model over such training data leads to poor prediction performance due to insufficient training from the minority class. Results In this paper, we use a novel class-balancing method named adaptive swarm cluster-based dynamic multi-objective synthetic minority oversampling technique (ASCB_DmSMOTE) to solve this imbalanced dataset problem, which is common in biomedical applications. The proposed method combines under-sampling and over-sampling into a swarm optimisation algorithm. It adaptively selects suitable parameters for the rebalancing algorithm to find the best solution. Compared with the other versions of the SMOTE algorithm, significant improvements, which include higher accuracy and credibility, are observed with ASCB_DmSMOTE. Conclusions Our proposed method tactfully combines two rebalancing techniques together. It reasonably re-allocates the majority class in the details and dynamically optimises the two parameters of SMOTE to synthesise a reasonable scale of minority class for each clustered sub-imbalanced dataset. The proposed methods ultimately overcome other conventional methods and attains higher credibility with even greater accuracy of the classification model.
Collapse
Affiliation(s)
- Jinyan Li
- Department of Computer and Information Science, University of Macau, Taipa, Macau, S.A.R. China
| | - Simon Fong
- Department of Computer and Information Science, University of Macau, Taipa, Macau, S.A.R. China
| | - Yunsick Sung
- Computer Engineering Division, Keimyung University, Daegu, South Korea
| | - Kyungeun Cho
- Department of Multimedia Engineering, College of Engineering, Dongguk University, Dongdaeipgu, South Korea
| | - Raymond Wong
- School of Computer Science and Engineering, University of New South Wales, Sydney, NSW 2000 Australia
| | - Kelvin K L Wong
- Centre for Biomedical Engineering, School of Electrical & Electronic Engineering, University of Adelaide, Adelaide, Australia.,School of Medicine, Western Sydney University, Campbelltown, Sydney Australia
| |
Collapse
|
122
|
Lim YW, Park YB, Park YJ. A longitudinal study of iris parameters and their relationships with temperament characteristics. Eur J Integr Med 2016. [DOI: 10.1016/j.eujim.2016.09.006] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
123
|
Nekooeimehr I, Lai-Yuen SK. Cluster-based Weighted Oversampling for Ordinal Regression (CWOS-Ord). Neurocomputing 2016. [DOI: 10.1016/j.neucom.2016.08.071] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
124
|
Guo H, Liu H, Wu C, Zhi W, Xiao Y, She W. Logistic discrimination based on G-mean and F-measure for imbalanced problem. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2016. [DOI: 10.3233/ifs-162150] [Citation(s) in RCA: 38] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Huaping Guo
- School of Computer and Information Technology, Xinyang Normal University, Xingyang, China
| | - Hongbing Liu
- School of Computer and Information Technology, Xinyang Normal University, Xingyang, China
| | - Changan Wu
- School of Computer and Information Technology, Xinyang Normal University, Xingyang, China
| | - Weimei Zhi
- School of Information Engineering, Zhengzhou Uninversity, Zhengzhou, China
| | - Yan Xiao
- School of Information Engineering, Zhengzhou Uninversity, Zhengzhou, China
| | - Wei She
- Software Technology School, Zhengzhou Uninversity, Zhengzhou, China
| |
Collapse
|
125
|
Wang Y, Li X, Ding X. Probabilistic framework of visual anomaly detection for unbalanced data. Neurocomputing 2016. [DOI: 10.1016/j.neucom.2016.03.038] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
126
|
A Selective Dynamic Sampling Back-Propagation Approach for Handling the Two-Class Imbalance Problem. APPLIED SCIENCES-BASEL 2016. [DOI: 10.3390/app6070200] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
|
127
|
An empirical study of a hybrid imbalanced-class DT-RST classification procedure to elucidate therapeutic effects in uremia patients. Med Biol Eng Comput 2016; 54:983-1001. [DOI: 10.1007/s11517-016-1482-0] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2014] [Accepted: 03/04/2016] [Indexed: 12/13/2022]
|
128
|
Imbalanced Learning Based on Logistic Discrimination. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2016; 2016:5423204. [PMID: 26880877 PMCID: PMC4736373 DOI: 10.1155/2016/5423204] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/10/2015] [Revised: 10/23/2015] [Accepted: 10/26/2015] [Indexed: 11/29/2022]
Abstract
In recent years, imbalanced learning problem has attracted more and more attentions from both academia and industry, and the problem is concerned with the performance of learning algorithms in the presence of data with severe class distribution skews. In this paper, we apply the well-known statistical model logistic discrimination to this problem and propose a novel method to improve its performance. To fully consider the class imbalance, we design a new cost function which takes into account the accuracies of both positive class and negative class as well as the precision of positive class. Unlike traditional logistic discrimination, the proposed method learns its parameters by maximizing the proposed cost function. Experimental results show that, compared with other state-of-the-art methods, the proposed one shows significantly better performance on measures of recall, g-mean, f-measure, AUC, and accuracy.
Collapse
|
129
|
Nath A, Subbiah K. Unsupervised learning assisted robust prediction of bioluminescent proteins. Comput Biol Med 2016; 68:27-36. [DOI: 10.1016/j.compbiomed.2015.10.013] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2015] [Revised: 09/27/2015] [Accepted: 10/28/2015] [Indexed: 10/22/2022]
|
130
|
Fernández A, Carmona CJ, del Jesus MJ, Herrera F. A View on Fuzzy Systems for Big Data: Progress and Opportunities. INT J COMPUT INT SYS 2016. [DOI: 10.1080/18756891.2016.1180820] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022] Open
|
131
|
Dealing with Data Difficulty Factors While Learning from Imbalanced Data. STUDIES IN COMPUTATIONAL INTELLIGENCE 2016. [DOI: 10.1007/978-3-319-18781-5_17] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
132
|
Exploring different strategies for imbalanced ADME data problem: case study on Caco-2 permeability modeling. Mol Divers 2015; 20:93-109. [DOI: 10.1007/s11030-015-9649-4] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2015] [Accepted: 11/13/2015] [Indexed: 10/22/2022]
|
133
|
Díez-Pastor JF, Rodríguez JJ, García-Osorio CI, Kuncheva LI. Diversity techniques improve the performance of the best imbalance learning ensembles. Inf Sci (N Y) 2015. [DOI: 10.1016/j.ins.2015.07.025] [Citation(s) in RCA: 120] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
134
|
|
135
|
Nath A, Subbiah K. Maximizing lipocalin prediction through balanced and diversified training set and decision fusion. Comput Biol Chem 2015; 59 Pt A:101-10. [PMID: 26433483 DOI: 10.1016/j.compbiolchem.2015.09.011] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2014] [Revised: 09/08/2015] [Accepted: 09/23/2015] [Indexed: 01/17/2023]
Abstract
Lipocalins are short in sequence length and perform several important biological functions. These proteins are having less than 20% sequence similarity among paralogs. Experimentally identifying them is an expensive and time consuming process. The computational methods based on the sequence similarity for allocating putative members to this family are also far elusive due to the low sequence similarity existing among the members of this family. Consequently, the machine learning methods become a viable alternative for their prediction by using the underlying sequence/structurally derived features as the input. Ideally, any machine learning based prediction method must be trained with all possible variations in the input feature vector (all the sub-class input patterns) to achieve perfect learning. A near perfect learning can be achieved by training the model with diverse types of input instances belonging to the different regions of the entire input space. Furthermore, the prediction performance can be improved through balancing the training set as the imbalanced data sets will tend to produce the prediction bias towards majority class and its sub-classes. This paper is aimed to achieve (i) the high generalization ability without any classification bias through the diversified and balanced training sets as well as (ii) enhanced the prediction accuracy by combining the results of individual classifiers with an appropriate fusion scheme. Instead of creating the training set randomly, we have first used the unsupervised Kmeans clustering algorithm to create diversified clusters of input patterns and created the diversified and balanced training set by selecting an equal number of patterns from each of these clusters. Finally, probability based classifier fusion scheme was applied on boosted random forest algorithm (which produced greater sensitivity) and K nearest neighbour algorithm (which produced greater specificity) to achieve the enhanced predictive performance than that of individual base classifiers. The performance of the learned models trained on Kmeans preprocessed training set is far better than the randomly generated training sets. The proposed method achieved a sensitivity of 90.6%, specificity of 91.4% and accuracy of 91.0% on the first test set and sensitivity of 92.9%, specificity of 96.2% and accuracy of 94.7% on the second blind test set. These results have established that diversifying training set improves the performance of predictive models through superior generalization ability and balancing the training set improves prediction accuracy. For smaller data sets, unsupervised Kmeans based sampling can be an effective technique to increase generalization than that of the usual random splitting method.
Collapse
Affiliation(s)
- Abhigyan Nath
- Department of Computer Science, Banaras Hindu University, Varanasi 221005, India.
| | - Karthikeyan Subbiah
- Department of Computer Science, Banaras Hindu University, Varanasi 221005, India.
| |
Collapse
|
136
|
Jacques J, Taillard J, Delerue D, Dhaenens C, Jourdan L. Conception of a dominance-based multi-objective local search in the context of classification rule mining in large and imbalanced data sets. Appl Soft Comput 2015. [DOI: 10.1016/j.asoc.2015.06.002] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
137
|
Díez-Pastor JF, Rodríguez JJ, García-Osorio C, Kuncheva LI. Random Balance: Ensembles of variable priors classifiers for imbalanced data. Knowl Based Syst 2015. [DOI: 10.1016/j.knosys.2015.04.022] [Citation(s) in RCA: 155] [Impact Index Per Article: 15.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
138
|
Napierala K, Stefanowski J. Types of minority class examples and their influence on learning classifiers from imbalanced data. J Intell Inf Syst 2015. [DOI: 10.1007/s10844-015-0368-1] [Citation(s) in RCA: 69] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
139
|
|
140
|
Tan SC, Watada J, Ibrahim Z, Khalid M. Evolutionary fuzzy ARTMAP neural networks for classification of semiconductor defects. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2015; 26:933-950. [PMID: 25014967 DOI: 10.1109/tnnls.2014.2329097] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Wafer defect detection using an intelligent system is an approach of quality improvement in semiconductor manufacturing that aims to enhance its process stability, increase production capacity, and improve yields. Occasionally, only few records that indicate defective units are available and they are classified as a minority group in a large database. Such a situation leads to an imbalanced data set problem, wherein it engenders a great challenge to deal with by applying machine-learning techniques for obtaining effective solution. In addition, the database may comprise overlapping samples of different classes. This paper introduces two models of evolutionary fuzzy ARTMAP (FAM) neural networks to deal with the imbalanced data set problems in a semiconductor manufacturing operations. In particular, both the FAM models and hybrid genetic algorithms are integrated in the proposed evolutionary artificial neural networks (EANNs) to classify an imbalanced data set. In addition, one of the proposed EANNs incorporates a facility to learn overlapping samples of different classes from the imbalanced data environment. The classification results of the proposed evolutionary FAM neural networks are presented, compared, and analyzed using several classification metrics. The outcomes positively indicate the effectiveness of the proposed networks in handling classification problems with imbalanced data sets.
Collapse
|
141
|
Immune centroids oversampling method for binary classification. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2015; 2015:109806. [PMID: 25834570 PMCID: PMC4365371 DOI: 10.1155/2015/109806] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/11/2014] [Accepted: 02/14/2015] [Indexed: 11/17/2022]
Abstract
To improve the classification performance of imbalanced learning, a novel oversampling method, immune centroids oversampling technique (ICOTE) based on an immune network, is proposed. ICOTE generates a set of immune centroids to broaden the decision regions of the minority class space. The representative immune centroids are regarded as synthetic examples in order to resolve the imbalance problem. We utilize an artificial immune network to generate synthetic examples on clusters with high data densities, which can address the problem of synthetic minority oversampling technique (SMOTE), which lacks reflection on groups of training examples. Meanwhile, we further improve the performance of ICOTE via integrating ENN with ICOTE, that is, ICOTE + ENN. ENN disposes the majority class examples that invade the minority class space, so ICOTE + ENN favors the separation of both classes. Our comprehensive experimental results show that two proposed oversampling methods can achieve better performance than the renowned resampling methods.
Collapse
|
142
|
|
143
|
|
144
|
SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci (N Y) 2015. [DOI: 10.1016/j.ins.2014.08.051] [Citation(s) in RCA: 305] [Impact Index Per Article: 30.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
145
|
García V, Sánchez JS, Ochoa Domínguez HJ, Cleofas-Sánchez L. Dissimilarity-Based Learning from Imbalanced Data with Small Disjuncts and Noise. PATTERN RECOGNITION AND IMAGE ANALYSIS 2015. [DOI: 10.1007/978-3-319-19390-8_42] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
146
|
Das B, Krishnan NC, Cook DJ. RACOG and wRACOG: Two Probabilistic Oversampling Techniques. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2015; 27:222-234. [PMID: 27041974 PMCID: PMC4814938 DOI: 10.1109/tkde.2014.2324567] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
As machine learning techniques mature and are used to tackle complex scientific problems, challenges arise such as the imbalanced class distribution problem, where one of the target class labels is under-represented in comparison with other classes. Existing oversampling approaches for addressing this problem typically do not consider the probability distribution of the minority class while synthetically generating new samples. As a result, the minority class is not well represented which leads to high misclassification error. We introduce two Gibbs sampling-based oversampling approaches, namely RACOG and wRACOG, to synthetically generating and strategically selecting new minority class samples. The Gibbs sampler uses the joint probability distribution of attributes of the data to generate new minority class samples in the form of Markov chain. While RACOG selects samples from the Markov chain based on a predefined lag, wRACOG selects those samples that have the highest probability of being misclassified by the existing learning model. We validate our approach using five UCI datasets that were carefully modified to exhibit class imbalance and one new application domain dataset with inherent extreme class imbalance. In addition, we compare the classification performance of the proposed methods with three other existing resampling techniques.
Collapse
|
147
|
Tomar D, Agarwal S. An effective Weighted Multi-class Least Squares Twin Support Vector Machine for Imbalanced data classification. INT J COMPUT INT SYS 2015. [DOI: 10.1080/18756891.2015.1061395] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022] Open
|
148
|
Zhu F, Wang X, Zhu D, Liu Y. A Supervised Requirement-oriented Patent Classification Scheme Based on the Combination of Metadata and Citation Information. INT J COMPUT INT SYS 2015. [DOI: 10.1080/18756891.2015.1023588] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022] Open
|
149
|
Cao H, Tan VYF, Pang JZF. A parsimonious mixture of Gaussian trees model for oversampling in imbalanced and multimodal time-series classification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2014; 25:2226-2239. [PMID: 25420245 DOI: 10.1109/tnnls.2014.2308321] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
We propose a novel framework of using a parsimonious statistical model, known as mixture of Gaussian trees, for modeling the possibly multimodal minority class to solve the problem of imbalanced time-series classification. By exploiting the fact that close-by time points are highly correlated due to smoothness of the time-series, our model significantly reduces the number of covariance parameters to be estimated from O(d(2)) to O(Ld), where L is the number of mixture components and d is the dimensionality. Thus, our model is particularly effective for modeling high-dimensional time-series with limited number of instances in the minority positive class. In addition, the computational complexity for learning the model is only of the order O(Ln+d(2)) where n+ is the number of positively labeled samples. We conduct extensive classification experiments based on several well-known time-series data sets (both single- and multimodal) by first randomly generating synthetic instances from our learned mixture model to correct the imbalance. We then compare our results with several state-of-the-art oversampling techniques and the results demonstrate that when our proposed model is used in oversampling, the same support vector machines classifier achieves much better classification accuracy across the range of data sets. In fact, the proposed method achieves the best average performance 30 times out of 36 multimodal data sets according to the F-value metric. Our results are also highly competitive compared with nonoversampling-based classifiers for dealing with imbalanced time-series data sets.
Collapse
|
150
|
|