1
|
Wang H, Nakajima T, Shikano K, Nomura Y, Nakaguchi T. Diagnosis of Lung Cancer Using Endobronchial Ultrasonography Image Based on Multi-Scale Image and Multi-Feature Fusion Framework. Tomography 2025; 11:24. [PMID: 40137564 PMCID: PMC11945964 DOI: 10.3390/tomography11030024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2024] [Revised: 02/23/2025] [Accepted: 02/24/2025] [Indexed: 03/29/2025] Open
Abstract
Lung cancer is the leading cause of cancer-related deaths globally and ranks among the most common cancer types. Given its low overall five-year survival rate, early diagnosis and timely treatment are essential to improving patient outcomes. In recent years, advances in computer technology have enabled artificial intelligence to make groundbreaking progress in imaging-based lung cancer diagnosis. The primary aim of this study is to develop a computer-aided diagnosis (CAD) system for lung cancer using endobronchial ultrasonography (EBUS) images and deep learning algorithms to facilitate early detection and improve patient survival rates. We propose M3-Net, which is a multi-branch framework that integrates multiple features through an attention-based mechanism, enhancing diagnostic performance by providing more comprehensive information for lung cancer assessment. The framework was validated on a dataset of 95 patient cases, including 13 benign and 82 malignant cases. The dataset comprises 1140 EBUS images, with 540 images used for training, and 300 images each for the validation and test sets. The evaluation yielded the following results: accuracy of 0.76, F1-score of 0.75, AUC of 0.83, PPV of 0.80, NPV of 0.75, sensitivity of 0.72, and specificity of 0.80. These findings indicate that the proposed attention-based multi-feature fusion framework holds significant potential in assisting with lung cancer diagnosis.
Collapse
Affiliation(s)
- Huitao Wang
- Department of Medical Engineering, Graduate School of Science and Engineering, Chiba University, Chiba 263-8522, Japan;
| | - Takahiro Nakajima
- Department of General Thoracic Surgery, Dokkyo Medical University, Mibu 321-0293, Japan;
| | - Kohei Shikano
- Department of Respirology, Graduate School of Medicine, Chiba University, Chiba 260-8670, Japan;
| | - Yukihiro Nomura
- Center for Frontier Medical Engineering, Chiba University, Chiba 263-8522, Japan;
| | - Toshiya Nakaguchi
- Center for Frontier Medical Engineering, Chiba University, Chiba 263-8522, Japan;
| |
Collapse
|
2
|
Thölke P, Mantilla-Ramos YJ, Abdelhedi H, Maschke C, Dehgan A, Harel Y, Kemtur A, Mekki Berrada L, Sahraoui M, Young T, Bellemare Pépin A, El Khantour C, Landry M, Pascarella A, Hadid V, Combrisson E, O'Byrne J, Jerbi K. Class imbalance should not throw you off balance: Choosing the right classifiers and performance metrics for brain decoding with imbalanced data. Neuroimage 2023:120253. [PMID: 37385392 DOI: 10.1016/j.neuroimage.2023.120253] [Citation(s) in RCA: 28] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Revised: 06/05/2023] [Accepted: 06/26/2023] [Indexed: 07/01/2023] Open
Abstract
Machine learning (ML) is increasingly used in cognitive, computational and clinical neuroscience. The reliable and efficient application of ML requires a sound understanding of its subtleties and limitations. Training ML models on datasets with imbalanced classes is a particularly common problem, and it can have severe consequences if not adequately addressed. With the neuroscience ML user in mind, this paper provides a didactic assessment of the class imbalance problem and illustrates its impact through systematic manipulation of data imbalance ratios in (i) simulated data and (ii) brain data recorded with electroencephalography (EEG), magnetoencephalography (MEG) and functional magnetic resonance imaging (fMRI). Our results illustrate how the widely-used Accuracy (Acc) metric, which measures the overall proportion of successful predictions, yields misleadingly high performances, as class imbalance increases. Because Acc weights the per-class ratios of correct predictions proportionally to class size, it largely disregards the performance on the minority class. A binary classification model that learns to systematically vote for the majority class will yield an artificially high decoding accuracy that directly reflects the imbalance between the two classes, rather than any genuine generalizable ability to discriminate between them. We show that other evaluation metrics such as the Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC), and the less common Balanced Accuracy (BAcc) metric - defined as the arithmetic mean between sensitivity and specificity, provide more reliable performance evaluations for imbalanced data. Our findings also highlight the robustness of Random Forest (RF), and the benefits of using stratified cross-validation and hyperprameter optimization to tackle data imbalance. Critically, for neuroscience ML applications that seek to minimize overall classification error, we recommend the routine use of BAcc, which in the specific case of balanced data is equivalent to using standard Acc, and readily extends to multi-class settings. Importantly, we present a list of recommendations for dealing with imbalanced data, as well as open-source code to allow the neuroscience community to replicate and extend our observations and explore alternative approaches to coping with imbalanced data.
Collapse
Affiliation(s)
- Philipp Thölke
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada; Institute of Cognitive Science, Osnabrück University, Neuer Graben 29/Schloss, Osnabrück, 49074, Lower Saxony, Germany.
| | - Yorguin-Jose Mantilla-Ramos
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada; Neuropsychology and Behavior Group (GRUNECO), Faculty of Medicine, Universidad de Antioquia,53-108, Medellin, Aranjuez, Medellin, 050010, Colombia
| | - Hamza Abdelhedi
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada
| | - Charlotte Maschke
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada; Integrated Program in Neuroscience, McGill University, 1033 Pine Ave,Montreal, H3A 0G4, Canada
| | - Arthur Dehgan
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada; Institut de Neurosciences de la Timone (INT), CNRS, Aix Marseille University,Marseille, 13005, France
| | - Yann Harel
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada
| | - Anirudha Kemtur
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada
| | - Loubna Mekki Berrada
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada
| | - Myriam Sahraoui
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada
| | - Tammy Young
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada; Department of Computing Science, University of Alberta, 116 St & 85 Ave, Edmonton, T6G 2R3, AB, Canada
| | - Antoine Bellemare Pépin
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada; Department of Music, Concordia University, 1550 De Maisonneuve Blvd. W., Montreal, H3H 1G8, QC, Canada
| | - Clara El Khantour
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada
| | - Mathieu Landry
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada
| | - Annalisa Pascarella
- Institute for Applied Mathematics Mauro Picone, National Research Council, Roma, Italy, Roma, Italy
| | - Vanessa Hadid
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada
| | - Etienne Combrisson
- Institut de Neurosciences de la Timone (INT), CNRS, Aix Marseille University,Marseille, 13005, France
| | - Jordan O'Byrne
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada
| | - Karim Jerbi
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada; Mila (Quebec Machine Learning Institute),6666 Rue Saint-Urbain, Montreal, H2S 3H1, QC, Canada; UNIQUE Centre (Quebec Neuro-AI Research Centre), 3744 rue Jean-Brillant, Montreal,H3T 1P1,QC, Canada
| |
Collapse
|
3
|
A cross-validation framework to find a better state than the balanced one for oversampling in imbalanced classification. INT J MACH LEARN CYB 2023. [DOI: 10.1007/s13042-023-01804-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/05/2023]
|
4
|
Reshma IA, Franchet C, Gaspard M, Ionescu RT, Mothe J, Cussat-Blanc S, Luga H, Brousset P. Finding a Suitable Class Distribution for Building Histological Images Datasets Used in Deep Model Training-The Case of Cancer Detection. J Digit Imaging 2022; 35:1326-1349. [PMID: 35445341 PMCID: PMC9582112 DOI: 10.1007/s10278-022-00618-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2020] [Revised: 02/15/2022] [Accepted: 03/09/2022] [Indexed: 11/26/2022] Open
Abstract
The class distribution of a training dataset is an important factor which influences the performance of a deep learning-based system. Understanding the optimal class distribution is therefore crucial when building a new training set which may be costly to annotate. This is the case for histological images used in cancer diagnosis where image annotation requires domain experts. In this paper, we tackle the problem of finding the optimal class distribution of a training set to be able to train an optimal model that detects cancer in histological images. We formulate several hypotheses which are then tested in scores of experiments with hundreds of trials. The experiments have been designed to account for both segmentation and classification frameworks with various class distributions in the training set, such as natural, balanced, over-represented cancer, and over-represented non-cancer. In the case of cancer detection, the experiments show several important results: (a) the natural class distribution produces more accurate results than the artificially generated balanced distribution; (b) the over-representation of non-cancer/negative classes (healthy tissue and/or background classes) compared to cancer/positive classes reduces the number of samples which are falsely predicted as cancer (false positive); (c) the least expensive to annotate non-ROI (non-region-of-interest) data can be useful in compensating for the performance loss in the system due to a shortage of expensive to annotate ROI data; (d) the multi-label examples are more useful than the single-label ones to train a segmentation model; and (e) when the classification model is tuned with a balanced validation set, it is less affected than the segmentation model by the class distribution of the training set.
Collapse
Affiliation(s)
| | - Camille Franchet
- Department of Pathology, University Cancer Institute of Toulouse-Oncopole, Toulouse, France
| | - Margot Gaspard
- Department of Pathology, University Cancer Institute of Toulouse-Oncopole, Toulouse, France
| | | | - Josiane Mothe
- IRIT, UMR5505 CNRS, Université de Toulouse, Toulouse, France
| | - Sylvain Cussat-Blanc
- IRIT, UMR5505 CNRS, Université de Toulouse, Toulouse, France
- Artificial and Natural Intelligence Toulouse Institute, Toulouse, France
| | - Hervé Luga
- IRIT, UMR5505 CNRS, Université de Toulouse, Toulouse, France
| | - Pierre Brousset
- Department of Pathology, University Cancer Institute of Toulouse-Oncopole, Toulouse, France
- INSERM UMR 1037 Cancer Research Centre of Toulouse (CRCT), Université Toulouse III Paul-Sabatier, CNRS ERL 5294, Toulouse, France
- Laboratoire d’Excellence TOUCAN, Toulouse, France
| |
Collapse
|
5
|
Cai W, Zhu J, Zhang M, Yang Y. A Parallel Classification Model for Marine Mammal Sounds Based on Multi-Dimensional Feature Extraction and Data Augmentation. SENSORS (BASEL, SWITZERLAND) 2022; 22:7443. [PMID: 36236544 PMCID: PMC9572586 DOI: 10.3390/s22197443] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/25/2022] [Revised: 09/24/2022] [Accepted: 09/28/2022] [Indexed: 06/16/2023]
Abstract
Due to the poor visibility of the deep-sea environment, acoustic signals are often collected and analyzed to explore the behavior of marine species. With the progress of underwater signal-acquisition technology, the amount of acoustic data obtained from the ocean has exceeded the limit that human can process manually, so designing efficient marine-mammal classification algorithms has become a research hotspot. In this paper, we design a classification model based on a multi-channel parallel structure, which can process multi-dimensional acoustic features extracted from audio samples, and fuse the prediction results of different channels through a trainable full connection layer. It uses transfer learning to obtain faster convergence speed, and introduces data augmentation to improve the classification accuracy. The k-fold cross-validation method was used to segment the data set to comprehensively evaluate the prediction accuracy and robustness of the model. The evaluation results showed that the model can achieve a mean accuracy of 95.21% while maintaining a standard deviation of 0.65%. There was excellent consistency in performance over multiple tests.
Collapse
Affiliation(s)
- Wenyu Cai
- College of Electronics and Information, Hangzhou Dianzi University, Hangzhou 310018, China
| | - Jifeng Zhu
- College of Electronics and Information, Hangzhou Dianzi University, Hangzhou 310018, China
| | - Meiyan Zhang
- College of Electrical Engineering, Zhejiang University of Water Resources and Electric Power, Hangzhou 310018, China
| | - Yong Yang
- College of Electronics and Information, Hangzhou Dianzi University, Hangzhou 310018, China
| |
Collapse
|
6
|
PF-SMOTE: A novel parameter-free SMOTE for imbalanced datasets. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.05.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
7
|
Wei G, Mu W, Song Y, Dou J. An improved and random synthetic minority oversampling technique for imbalanced data. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.108839] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
8
|
Kamran H, Aleman DM, McIntosh C, Purdie TG. SuPART: supervised projective adapted resonance theory for automatic quality assurance approval of radiotherapy treatment plans. Phys Med Biol 2022; 67. [DOI: 10.1088/1361-6560/ac568f] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2021] [Accepted: 02/18/2022] [Indexed: 11/12/2022]
Abstract
Abstract
Radiotherapy is a common treatment modality for the treatment of cancer, where treatments must be carefully designed to deliver appropriate dose to targets while avoiding healthy organs. The comprehensive multi-disciplinary quality assurance (QA) process in radiotherapy is designed to ensure safe and effective treatment plans are delivered to patients. However, the plan QA process is expensive, often time-intensive, and requires review of large quantities of complex data, potentially leading to human error in QA assessment. We therefore develop an automated machine learning algorithm to identify ‘acceptable’ plans (plans that are similar to historically approved plans) and ‘unacceptable’ plans (plans that are dissimilar to historically approved plans). This algorithm is a supervised extension of projective adaptive resonance theory, called SuPART, that learns a set of distinctive features, and considers deviations from them indications of unacceptable plans. We test SuPART on breast and prostate radiotherapy datasets from our institution, and find that SuPART outperforms common classification algorithms in several measures of accuracy. When no falsely approved plans are allowed, SuPART can correctly auto-approve 34% of the acceptable breast and 32% of the acceptable prostate plans, and can also correctly reject 53% of the unacceptable breast and 56% of the unacceptable prostate plans. Thus, usage of SuPART to aid in QA could potentially yield significant time savings.
Collapse
|
9
|
Maurya R, Pathak VK, Burget R, Dutta MK. Automated detection of bioimages using novel deep feature fusion algorithm and effective high-dimensional feature selection approach. Comput Biol Med 2021; 137:104862. [PMID: 34534793 DOI: 10.1016/j.compbiomed.2021.104862] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2021] [Revised: 08/26/2021] [Accepted: 09/07/2021] [Indexed: 11/30/2022]
Abstract
The classification of bioimages plays an important role in several biological studies, such as subcellular localisation, phenotype identification and other types of histopathological examinations. The objective of the present study was to develop a computer-aided bioimage classification method for the classification of bioimages across nine diverse benchmark datasets. A novel algorithm was developed, which systematically fused the features extracted from nine different convolution neural network architectures. A systematic fusion of features boosts the performance of a classifier but at the cost of the high dimensionality of the fused feature set. Therefore, non-discriminatory and redundant features need to be removed from a high-dimensional fused feature set to improve the classification performance and reduce the time complexity. To achieve this aim, a method based on analysis of variance and evolutionary feature selection was developed to select an optimal set of discriminatory features from the fused feature set. The proposed method was evaluated on nine different benchmark datasets. The experimental results showed that the proposed method achieved superior performance, with a significant reduction in the dimensionality of the fused feature set for most bioimage datasets. The performance of the proposed feature selection method was better than that of some of the most recent and classical methods used for feature selection. Thus, the proposed method was desirable because of its superior performance and high compression ratio, which significantly reduced the computational complexity.
Collapse
Affiliation(s)
- Ritesh Maurya
- Centre for Advanced Studies, Dr A.P.J. Abdul Kalam Technical University, Lucknow, India.
| | | | - Radim Burget
- Department of Telecommunications, Faculty of Electrical Engineering and Communication, BRNO University of Technology, Czech Republic.
| | - Malay Kishore Dutta
- Centre for Advanced Studies, Dr A.P.J. Abdul Kalam Technical University, Lucknow, India.
| |
Collapse
|
10
|
Diallo M, Xiong S, Emiru ED, Fesseha A, Abdulsalami AO, Elaziz MA. A Hybrid MultiLayer Perceptron Under-Sampling with Bagging Dealing with a Real-Life Imbalanced Rice Dataset. INFORMATION 2021; 12:291. [DOI: 10.3390/info12080291] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/02/2023] Open
Abstract
Classification algorithms have shown exceptional prediction results in the supervised learning area. These classification algorithms are not always efficient when it comes to real-life datasets due to class distributions. As a result, datasets for real-life applications are generally imbalanced. Several methods have been proposed to solve the problem of class imbalance. In this paper, we propose a hybrid method combining the preprocessing techniques and those of ensemble learning. The original training set is undersampled by evaluating the samples by stochastic measurement (SM) and then training these samples selected by Multilayer Perceptron to return a balanced training set. The MLPUS (Multilayer perceptron undersampling) balanced training set is aggregated using the bagging ensemble method. We applied our method to the real-life Niger_Rice dataset and forty-four other imbalanced datasets from the KEEL repository in this study. We also compared our method with six other existing methods in the literature, such as the MLP classifier on the original imbalance dataset, MLPUS, UnderBagging (combining random under-sampling and bagging), RUSBoost, SMOTEBagging (Synthetic Minority Oversampling Technique and bagging), SMOTEBoost. The results show that our method is competitive compared to other methods. The Niger_Rice real-life dataset results are 75.6, 0.73, 0.76, and 0.86, respectively, for accuracy, F-measure, G-mean, and ROC with our proposed method. In contrast, the MLP classifier on the original imbalance Niger_Rice dataset gives results 72.44, 0.82, 0.59, and 0.76 respectively for accuracy, F-measure, G-mean, and ROC.
Collapse
|
11
|
Changes in Computer-Analyzed Facial Expressions with Age. SENSORS 2021; 21:s21144858. [PMID: 34300600 PMCID: PMC8309819 DOI: 10.3390/s21144858] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/17/2021] [Revised: 07/13/2021] [Accepted: 07/15/2021] [Indexed: 11/17/2022]
Abstract
Facial expressions are well known to change with age, but the quantitative properties of facial aging remain unclear. In the present study, we investigated the differences in the intensity of facial expressions between older (n = 56) and younger adults (n = 113). In laboratory experiments, the posed facial expressions of the participants were obtained based on six basic emotions and neutral facial expression stimuli, and the intensities of their faces were analyzed using a computer vision tool, OpenFace software. Our results showed that the older adults expressed strong expressions for some negative emotions and neutral faces. Furthermore, when making facial expressions, older adults used more face muscles than younger adults across the emotions. These results may help to understand the characteristics of facial expressions in aging and can provide empirical evidence for other fields regarding facial recognition.
Collapse
|
12
|
A novel hybrid predictive maintenance model based on clustering, smote and multi-layer perceptron neural network optimised with grey wolf algorithm. SN APPLIED SCIENCES 2021. [DOI: 10.1007/s42452-021-04598-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
Abstract
Considering the complexities and challenges in the classification of multiclass and imbalanced fault conditions, this study explores the systematic combination of unsupervised and supervised learning by hybridising clustering (CLUST) and optimised multi-layer perceptron neural network with grey wolf algorithm (GWO-MLP). The hybrid technique was meticulously examined on a historical hydraulic system dataset by first, extracting and selecting the most significant statistical time-domain features. The selected features were then grouped into distinct clusters allowing for reduced computational complexity through a comparative study of four different and frequently used categories of unsupervised clustering algorithms in fault classification. The Synthetic Minority Over Sampling Technique (SMOTE) was then employed to balance the classes of the training samples from the various clusters which then served as inputs for training the supervised GWO-MLP. To validate the proposed hybrid technique (CLUST-SMOTE-GWO-MLP), it was compared with its distinct modifications (variants). The superiority of CLUST-SMOTE-GWO-MLP is demonstrated by outperforming all the distinct modifications in terms of test accuracy and seven other statistical performance evaluation metrics (error rate, sensitivity, specificity, precision, F score, Mathews Correlation Coefficient and geometric mean). The overall analysis indicates that the proposed CLUST-SMOTE-GWO-MLP is efficient and can be used to classify multiclass and imbalanced fault conditions.
Article Highlights
The issue of multiclass and imbalanced class outputs is addressed for improving predictive maintenance.
A multiclass fault classifier based on clustering and optimised multi-layer perceptron with grey wolf is proposed.
The robustness and feasibility of the proposed technique is validated on a complex hydraulic system dataset.
Collapse
|
13
|
Yielding Multi-Fold Training Strategy for Image Classification of Imbalanced Weeds. APPLIED SCIENCES-BASEL 2021. [DOI: 10.3390/app11083331] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
An imbalanced dataset is a significant challenge when training a deep neural network (DNN) model for deep learning problems, such as weeds classification. An imbalanced dataset may result in a model that behaves robustly on major classes and is overly sensitive to minor classes. This article proposes a yielding multi-fold training (YMufT) strategy to train a DNN model on an imbalanced dataset. This strategy reduces the bias in training through a min-class-max-bound procedure (MCMB), which divides samples in the training set into multiple folds. The model is consecutively trained on each one of these folds. In practice, we experiment with our proposed strategy on two small (PlantSeedlings, small PlantVillage) and two large (Chonnam National University (CNU), large PlantVillage) weeds datasets. With the same training configurations and approximate training steps used in conventional training methods, YMufT helps the DNN model to converge faster, thus requiring less training time. Despite a slight decrease in accuracy on the large dataset, YMufT increases the F1 score in the NASNet model to 0.9708 on the CNU dataset and 0.9928 when using the Mobilenet model training on the large PlantVillage dataset. YMufT shows outstanding performance in both accuracy and F1 score on small datasets, with values of (0.9981, 0.9970) using the Mobilenet model for training on small PlantVillage dataset and (0.9718, 0.9689) using Resnet to train on the PlantSeedlings dataset. Grad-CAM visualization shows that conventional training methods mainly concentrate on high-level features and may capture insignificant features. In contrast, YMufT guides the model to capture essential features on the leaf surface and properly localize the weeds targets.
Collapse
|
14
|
González M, Luengo J, Cano JR, García S. Synthetic Sample Generation for Label Distribution Learning. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2020.07.071] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
15
|
Abd Rahman HA, Wah YB, Huat OS. Predictive Performance of Logistic Regression for Imbalanced Data with Categorical Covariate. PERTANIKA JOURNAL OF SCIENCE AND TECHNOLOGY 2021; 29. [DOI: 10.47836/pjst.29.1.10] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
Abstract
Logistic regression is often used for the classification of a binary categorical dependent variable using various types of covariates (continuous or categorical). Imbalanced data will lead to biased parameter estimates and classification performance of the logistic regression model. Imbalanced data occurs when the number of cases in one category of the binary dependent variable is very much smaller than the other category. This simulation study investigates the effect of imbalanced data measured by imbalanced ratio on the parameter estimate of the binary logistic regression with a categorical covariate. Datasets were simulated with controlled different percentages of imbalance ratio (IR), from 1% to 50%, and for various sample sizes. The simulated datasets were then modeled using binary logistic regression. The bias in the estimates was measured using MSE (Mean Square Error). The simulation results provided evidence that the effect of imbalance ratio on the parameter estimate of the covariate decreased as sample size increased. The bias of the estimates depended on sample size whereby for sample size 100, 500, 1000 – 2000 and 2500 – 3500, the estimates were biased for IR below 30%, 10%, 5% and 2% respectively. Results also showed that parameter estimates were all biased at IR 1% for all sample size. An application using a real dataset supported the simulation results.
Collapse
|
16
|
Abstract
One of the significant challenges in machine learning is the classification of imbalanced data. In many situations, standard classifiers cannot learn how to distinguish minority class examples from the others. Since many real problems are unbalanced, this problem has become very relevant and deeply studied today. This paper presents a new preprocessing method based on Delaunay tessellation and the preprocessing algorithm SMOTE (Synthetic Minority Over-sampling Technique), which we call DTO-SMOTE (Delaunay Tessellation Oversampling SMOTE). DTO-SMOTE constructs a mesh of simplices (in this paper, we use tetrahedrons) for creating synthetic examples. We compare results with five preprocessing algorithms (GEOMETRIC-SMOTE, SVM-SMOTE, SMOTE-BORDERLINE-1, SMOTE-BORDERLINE-2, and SMOTE), eight classification algorithms, and 61 binary-class data sets. For some classifiers, DTO-SMOTE has higher performance than others in terms of Area Under the ROC curve (AUC), Geometric Mean (GEO), and Generalized Index of Balanced Accuracy (IBA).
Collapse
|
17
|
RUESVMs: An Ensemble Method to Handle the Class Imbalance Problem in Land Cover Mapping Using Google Earth Engine. REMOTE SENSING 2020. [DOI: 10.3390/rs12213484] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Timely and accurate Land Cover (LC) information is required for various applications, such as climate change analysis and sustainable development. Although machine learning algorithms are most likely successful in LC mapping tasks, the class imbalance problem is known as a common challenge in this regard. This problem occurs during the training phase and reduces classification accuracy for infrequent and rare LC classes. To address this issue, this study proposes a new method by integrating random under-sampling of majority classes and an ensemble of Support Vector Machines, namely Random Under-sampling Ensemble of Support Vector Machines (RUESVMs). The performance of RUESVMs for LC classification was evaluated in Google Earth Engine (GEE) over two different case studies using Sentinel-2 time-series data and five well-known spectral indices, including the Normalized Difference Vegetation Index (NDVI), Green Normalized Difference Vegetation Index (GNDVI), Soil-Adjusted Vegetation Index (SAVI), Normalized Difference Built-up Index (NDBI), and Normalized Difference Water Index (NDWI). The performance of RUESVMs was also compared with the traditional SVM and combination of SVM with three benchmark data balancing techniques namely the Random Over-Sampling (ROS), Random Under-Sampling (RUS), and Synthetic Minority Over-sampling Technique (SMOTE). It was observed that the proposed method considerably improved the accuracy of LC classification, especially for the minority classes. After adopting RUESVMs, the overall accuracy of the generated LC map increased by approximately 4.95 percentage points, and this amount for the geometric mean of producer’s accuracies was almost 3.75 percentage points, in comparison to the most accurate data balancing method (i.e., SVM-SMOTE). Regarding the geometric mean of users’ accuracies, RUESVMs also outperformed the SVM-SMOTE method with an average increase of 6.45 percentage points.
Collapse
|
18
|
Rahman HAA, Wah YB, Huat OS. Predictive Performance of Logistic Regression for Imbalanced Data with Categorical Covariate. PERTANIKA JOURNAL OF SCIENCE AND TECHNOLOGY 2020; 28. [DOI: 10.47836/pjst.28.4.02] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
Abstract
Logistic regression is often used for the classification of a binary categorical dependent variable using various types of covariates (continuous or categorical). Imbalanced data will lead to biased parameter estimates and classification performance of the logistic regression model. Imbalanced data occurs when the number of cases in one category of the binary dependent variable is very much smaller than the other category. This simulation study investigates the effect of imbalanced data measured by imbalanced ratio on the parameter estimate of the binary logistic regression with a categorical covariate. Datasets were simulated with controlled different percentages of imbalance ratio (IR), from 1% to 50%, and for various sample sizes. The simulated datasets were then modeled using binary logistic regression. The bias in the estimates was measured using Mean Square Error (MSE). The simulation results provided evidence that the effect of imbalance ratio on the parameter estimate of the covariate decreased as sample size increased. The bias of the estimated depends on sample size whereby for sample size 100, 500, 1000 - 2000 and 2500 - 3500, the estimated were biased for IR below 30%, 10%, 5% and 2% respectively. Results also showed that parameter estimates were all biased at IR 1% for all sample size. An application using a real dataset supported the simulation results.
Collapse
|
19
|
Hernández Farías DI, Prati R, Herrera F, Rosso P. Irony detection in Twitter with imbalanced class distributions. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2020. [DOI: 10.3233/jifs-179880] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Irony detection is a not trivial problem and can help to improve natural language processing tasks as sentiment analysis. When dealing with social media data in real scenarios, an important issue to address is data skew, i.e. the imbalance between available ironic and non-ironic samples available. In this work, the main objective is to address irony detection in Twitter considering various degrees of imbalanced distribution between classes. We rely on the emotIDM irony detection model. We evaluated it against both benchmark corpora and skewed Twitter datasets collected to simulate a realistic distribution of ironic tweets. We carry out a set of classification experiments aimed to determine the impact of class imbalance on detecting irony, and we evaluate the performance of irony detection when different scenarios are considered. We experiment with a set of classifiers applying class imbalance techniques to compensate class distribution. Our results indicate that by using such techniques, it is possible to improve the performance of irony detection in imbalanced class scenarios.
Collapse
Affiliation(s)
| | | | - Francisco Herrera
- Department of Computer Science and Artificial Intelligence, University of Granada, Spain
| | | |
Collapse
|
20
|
Stegmayer G, Di Persia LE, Rubiolo M, Gerard M, Pividori M, Yones C, Bugnon LA, Rodriguez T, Raad J, Milone DH. Predicting novel microRNA: a comprehensive comparison of machine learning approaches. Brief Bioinform 2020; 20:1607-1620. [PMID: 29800232 DOI: 10.1093/bib/bby037] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2017] [Revised: 03/26/2018] [Indexed: 12/25/2022] Open
Abstract
MOTIVATION The importance of microRNAs (miRNAs) is widely recognized in the community nowadays because these short segments of RNA can play several roles in almost all biological processes. The computational prediction of novel miRNAs involves training a classifier for identifying sequences having the highest chance of being precursors of miRNAs (pre-miRNAs). The big issue with this task is that well-known pre-miRNAs are usually few in comparison with the hundreds of thousands of candidate sequences in a genome, which results in high class imbalance. This imbalance has a strong influence on most standard classifiers, and if not properly addressed in the model and the experiments, not only performance reported can be completely unrealistic but also the classifier will not be able to work properly for pre-miRNA prediction. Besides, another important issue is that for most of the machine learning (ML) approaches already used (supervised methods), it is necessary to have both positive and negative examples. The selection of positive examples is straightforward (well-known pre-miRNAs). However, it is difficult to build a representative set of negative examples because they should be sequences with hairpin structure that do not contain a pre-miRNA. RESULTS This review provides a comprehensive study and comparative assessment of methods from these two ML approaches for dealing with the prediction of novel pre-miRNAs: supervised and unsupervised training. We present and analyze the ML proposals that have appeared during the past 10 years in literature. They have been compared in several prediction tasks involving two model genomes and increasing imbalance levels. This work provides a review of existing ML approaches for pre-miRNA prediction and fair comparisons of the classifiers with same features and data sets, instead of just a revision of published software tools. The results and the discussion can help the community to select the most adequate bioinformatics approach according to the prediction task at hand. The comparative results obtained suggest that from low to mid-imbalance levels between classes, supervised methods can be the best. However, at very high imbalance levels, closer to real case scenarios, models including unsupervised and deep learning can provide better performance.
Collapse
Affiliation(s)
- Georgina Stegmayer
- sinc(i), Research Institute for Signals, Systems and Computational Intelligence (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - Leandro E Di Persia
- sinc(i), Research Institute for Signals, Systems and Computational Intelligence (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - Mariano Rubiolo
- sinc(i), Research Institute for Signals, Systems and Computational Intelligence (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - Matias Gerard
- sinc(i), Research Institute for Signals, Systems and Computational Intelligence (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - Milton Pividori
- sinc(i), Research Institute for Signals, Systems and Computational Intelligence (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - Cristian Yones
- sinc(i), Research Institute for Signals, Systems and Computational Intelligence (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - Leandro A Bugnon
- sinc(i), Research Institute for Signals, Systems and Computational Intelligence (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - Tadeo Rodriguez
- sinc(i), Research Institute for Signals, Systems and Computational Intelligence (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - Jonathan Raad
- sinc(i), Research Institute for Signals, Systems and Computational Intelligence (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - Diego H Milone
- sinc(i), Research Institute for Signals, Systems and Computational Intelligence (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| |
Collapse
|
21
|
Ye X, Li H, Imakura A, Sakurai T. An oversampling framework for imbalanced classification based on Laplacian eigenmaps. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2020.02.081] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
22
|
Vong CM, Du J. Accurate and efficient sequential ensemble learning for highly imbalanced multi-class data. Neural Netw 2020; 128:268-278. [PMID: 32454371 DOI: 10.1016/j.neunet.2020.05.010] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2019] [Revised: 03/30/2020] [Accepted: 05/11/2020] [Indexed: 11/16/2022]
Abstract
Multi-class classification for highly imbalanced data is a challenging task in which multiple issues must be resolved simultaneously, including (i) accuracy on classifying highly imbalanced multi-class data; (ii) training efficiency for large data; and (iii) sensitivity to high imbalance ratio (IR). In this paper, a novel sequential ensemble learning (SEL) framework is designed to simultaneously resolve these issues. SEL framework provides a significant property over traditional AdaBoost, in which the majority samples can be divided into multiple small and disjoint subsets for training multiple weak learners without compromising accuracy (while AdaBoost cannot). To ensure the class balance and majority-disjoint property of subsets, a learning strategy called balanced and majority-disjoint subsets division (BMSD) is developed. Unfortunately it is difficult to derive a general learner combination method (LCM) for any kind of weak learner. In this work, LCM is specifically designed for extreme learning machine, called LCM-ELM. The proposed SEL framework with BMSD and LCM-ELM has been compared with state-of-the-art methods over 16 benchmark datasets. In the experiments, under highly imbalanced multi-class data (IR up to 14K; data size up to 493K), (i) the proposed works improve the performance in different measures including G-mean, macro-F, micro-F, MAUC; (ii) training time is significantly reduced.
Collapse
Affiliation(s)
- Chi-Man Vong
- Department of Computer and Information Science, University of Macau, Macau- SAR 999078, China.
| | - Jie Du
- School of Biomedical Engineering, Health Science Center, Shenzhen University, Shenzhen, 518060, China.
| |
Collapse
|
23
|
Blanquero R, Carrizosa E, Ramírez-Cobo P, Sillero-Denamiel MR. A cost-sensitive constrained Lasso. ADV DATA ANAL CLASSI 2020. [DOI: 10.1007/s11634-020-00389-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
|
24
|
|
25
|
Cruz RMO, Souza MA, Sabourin R, Cavalcanti GDC. Dynamic Ensemble Selection and Data Preprocessing for Multi-Class Imbalance Learning. INT J PATTERN RECOGN 2019. [DOI: 10.1142/s0218001419400093] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Class imbalance refers to classification problems in which many more instances are available for certain classes than for others. Such imbalanced datasets require special attention because traditional classifiers generally favor the majority class which has a large number of instances. Ensemble of classifiers has been reported to yield promising results. However, the majority of ensemble methods applied to imbalance learning are static ones. Moreover, they only deal with binary imbalanced problems. Hence, this paper presents an empirical analysis of Dynamic Selection techniques and data preprocessing methods for dealing with multi-class imbalanced problems. We considered five variations of preprocessing methods and 14 Dynamic Selection schemes. Our experiments conducted on 26 multi-class imbalanced problems show that the dynamic ensemble improves the AUC and the [Formula: see text]-mean as compared to the static ensemble. Moreover, data preprocessing plays an important role in such cases.
Collapse
Affiliation(s)
- Rafael M. O. Cruz
- Laboratoire d’Imagerie, de Vision et d’Intelligence Artificielle, École de Technologie Supérieure, Université du Québec, Montreal, QC, Canada H3C 1K3, Canada
| | - Mariana A. Souza
- Centro de Informática, Universidade Federal de Pernambuco, Recife, PE 50.670-420, Brazil
| | - Robert Sabourin
- Laboratoire d’Imagerie, de Vision et d’Intelligence Artificielle, École de Technologie Supérieure, Université du Québec, Montreal, QC, Canada H3C 1K3, Canada
| | | |
Collapse
|
26
|
Braytee A, Liu W, Anaissi A, Kennedy PJ. Correlated Multi-label Classification with Incomplete Label Space and Class Imbalance. ACM T INTEL SYST TEC 2019. [DOI: 10.1145/3342512] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
Multi-label classification is defined as the problem of identifying the multiple labels or categories of new observations based on labeled training data. Multi-labeled data has several challenges, including class imbalance, label correlation, incomplete multi-label matrices, and noisy and irrelevant features. In this article, we propose an integrated multi-label classification approach with incomplete label space and class imbalance (ML-CIB) for simultaneously training the multi-label classification model and addressing the aforementioned challenges. The model learns a new label matrix and captures new label correlations, because it is difficult to find a complete label vector for each instance in real-world data. We also propose a label regularization to handle the imbalanced multi-labeled issue in the new label, and
l
1
regularization norm is incorporated in the objective function to select the relevant sparse features. A multi-label feature selection (ML-CIB-FS) method is presented as a variant of the proposed ML-CIB to show the efficacy of the proposed method in selecting the relevant features. ML-CIB is formulated as a constrained objective function. We use the accelerated proximal gradient method to solve the proposed optimisation problem. Last, extensive experiments are conducted on 19 regular-scale and large-scale imbalanced multi-labeled datasets. The promising results show that our method significantly outperforms the state-of-the-art.
Collapse
Affiliation(s)
- Ali Braytee
- Advanced Analytics Institute, University of Technology Sydney, Ultimo, NSW, Australia
| | - Wei Liu
- Advanced Analytics Institute, University of Technology Sydney, Ultimo, NSW, Australia
| | - Ali Anaissi
- School of IT, Faculty of Engineering and IT, The University of Sydney, Camperdown, NSW, Australia
| | - Paul J. Kennedy
- Centre of Artificial Intelligence, University of Technology Sydney, Camperdown, NSW, Australia
| |
Collapse
|
27
|
González S, García S, Li ST, Herrera F. Chain based sampling for monotonic imbalanced classification. Inf Sci (N Y) 2019. [DOI: 10.1016/j.ins.2018.09.062] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
|
28
|
Chu WS, De la Torre F, Cohn JF. Learning Facial Action Units with Spatiotemporal Cues and Multi-label Sampling. IMAGE AND VISION COMPUTING 2019; 81:1-14. [PMID: 30524157 PMCID: PMC6277040 DOI: 10.1016/j.imavis.2018.10.002] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Facial action units (AUs) may be represented spatially, temporally, and in terms of their correlation. Previous research focuses on one or another of these aspects or addresses them disjointly. We propose a hybrid network architecture that jointly models spatial and temporal representations and their correlation. In particular, we use a Convolutional Neural Network (CNN) to learn spatial representations, and a Long Short-Term Memory (LSTM) to model temporal dependencies among them. The outputs of CNNs and LSTMs are aggregated into a fusion network to produce per-frame prediction of multiple AUs. The hybrid network was compared to previous state-of-the-art approaches in two large FACS-coded video databases, GFT and BP4D, with over 400,000 AU-coded frames of spontaneous facial behavior in varied social contexts. Relative to standard multi-label CNN and feature-based state-of-the-art approaches, the hybrid system reduced person-specific biases and obtained increased accuracy for AU detection. To address class imbalance within and between batches during training the network, we introduce multi-labeling sampling strategies that further increase accuracy when AUs are relatively sparse. Finally, we provide visualization of the learned AU models, which, to the best of our best knowledge, reveal for the first time how machines see AUs.
Collapse
Affiliation(s)
- Wen-Sheng Chu
- Robotics Institute, Carnegie Mellon University, Pittsburgh, USA
| | | | - Jeffrey F Cohn
- Department of Psychology, University of Pittsburgh, Pittsburgh, USA
| |
Collapse
|
29
|
REMEDIAL-HwR: Tackling multilabel imbalance through label decoupling and data resampling hybridization. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2017.01.118] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
30
|
Charte F, Rivera AJ, del Jesus MJ, Herrera F. Dealing with difficult minority labels in imbalanced mutilabel data sets. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2016.08.158] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
31
|
Vong CM, Du J, Wong CM, Cao JW. Postboosting Using Extended G-Mean for Online Sequential Multiclass Imbalance Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018; 29:6163-6177. [PMID: 29993897 DOI: 10.1109/tnnls.2018.2826553] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
In this paper, a novel learning method called postboosting using extended G-mean (PBG) is proposed for online sequential multiclass imbalance learning (OS-MIL) in neural networks. PBG is effective due to three reasons. 1) Through postadjusting a classification boundary under extended G-mean, the challenging issue of imbalanced class distribution for sequentially arriving multiclass data can be effectively resolved. 2) A newly derived update rule for online sequential learning is proposed, which produces a high G-mean for current model and simultaneously possesses almost the same information of its previous models. 3) A dynamic adjustment mechanism provided by extended G-mean is valid to deal with the unresolved challenging dense-majority problem and two dynamic changing issues, namely, dynamic changing data scarcity (DCDS) and dynamic changing data diversity (DCDD). Compared to other OS-MIL methods, PBG is highly effective on resolving DCDS, while PBG is the only method to resolve dense-majority and DCDD. Furthermore, PBG can directly and effectively handle unscaled data stream. Experiments have been conducted for PBG and two popular OS-MIL methods for neural networks under massive binary and multiclass data sets. Through the analyses of experimental results, PBG is shown to outperform the other compared methods on all data sets in various aspects including the issues of data scarcity, dense-majority, DCDS, DCDD, and unscaled data.
Collapse
|
32
|
A Two-Stage Big Data Analytics Framework with Real World Applications Using Spark Machine Learning and Long Short-Term Memory Network. Symmetry (Basel) 2018. [DOI: 10.3390/sym10100485] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Every day we experience unprecedented data growth from numerous sources, which contribute to big data in terms of volume, velocity, and variability. These datasets again impose great challenges to analytics framework and computational resources, making the overall analysis difficult for extracting meaningful information in a timely manner. Thus, to harness these kinds of challenges, developing an efficient big data analytics framework is an important research topic. Consequently, to address these challenges by exploiting non-linear relationships from very large and high-dimensional datasets, machine learning (ML) and deep learning (DL) algorithms are being used in analytics frameworks. Apache Spark has been in use as the fastest big data processing arsenal, which helps to solve iterative ML tasks, using distributed ML library called Spark MLlib. Considering real-world research problems, DL architectures such as Long Short-Term Memory (LSTM) is an effective approach to overcoming practical issues such as reduced accuracy, long-term sequence dependency, and vanishing and exploding gradient in conventional deep architectures. In this paper, we propose an efficient analytics framework, which is technically a progressive machine learning technique merged with Spark-based linear models, Multilayer Perceptron (MLP) and LSTM, using a two-stage cascade structure in order to enhance the predictive accuracy. Our proposed architecture enables us to organize big data analytics in a scalable and efficient way. To show the effectiveness of our framework, we applied the cascading structure to two different real-life datasets to solve a multiclass and a binary classification problem, respectively. Experimental results show that our analytical framework outperforms state-of-the-art approaches with a high-level of classification accuracy.
Collapse
|
33
|
Liu G, Yang Y, Li B. Fuzzy rule-based oversampling technique for imbalanced and incomplete data learning. Knowl Based Syst 2018. [DOI: 10.1016/j.knosys.2018.05.044] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
|
34
|
Decision Support System for Medical Diagnosis Utilizing Imbalanced Clinical Data. APPLIED SCIENCES-BASEL 2018. [DOI: 10.3390/app8091597] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The clinical decision support system provides an automatic diagnosis of human diseases using machine learning techniques to analyze features of patients and classify patients according to different diseases. An analysis of real-world electronic health record (EHR) data has revealed that a patient could be diagnosed as having more than one disease simultaneously. Therefore, to suggest a list of possible diseases, the task of classifying patients is transferred into a multi-label learning task. For most multi-label learning techniques, the class imbalance that exists in EHR data may bring about performance degradation. Cross-Coupling Aggregation (COCOA) is a typical multi-label learning approach that is aimed at leveraging label correlation and exploring class imbalance. For each label, COCOA aggregates the predictive result of a binary-class imbalance classifier corresponding to this label as well as the predictive results of some multi-class imbalance classifiers corresponding to the pairs of this label and other labels. However, class imbalance may still affect a multi-class imbalance learner when the number of a coupling label is too small. To improve the performance of COCOA, a regularized ensemble approach integrated into a multi-class classification process of COCOA named as COCOA-RE is presented in this paper. To provide disease diagnosis, COCOA-RE learns from the available laboratory test reports and essential information of patients and produces a multi-label predictive model. Experiments were performed to validate the effectiveness of the proposed multi-label learning approach, and the proposed approach was implemented in a developed system prototype.
Collapse
|
35
|
|
36
|
Emerging topics and challenges of learning from noisy data in nonstandard classification: a survey beyond binary class noise. Knowl Inf Syst 2018. [DOI: 10.1007/s10115-018-1244-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
37
|
Predicting reference soil groups using legacy data: A data pruning and Random Forest approach for tropical environment (Dano catchment, Burkina Faso). Sci Rep 2018; 8:9959. [PMID: 29967391 PMCID: PMC6028482 DOI: 10.1038/s41598-018-28244-w] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2017] [Accepted: 06/18/2018] [Indexed: 12/02/2022] Open
Abstract
Predicting taxonomic classes can be challenging with dataset subject to substantial irregularities due to the involvement of many surveyors. A data pruning approach was used in the present study to reduce such source errors by exploring whether different data pruning methods, which result in different subsets of a major reference soil groups (RSG) – the Plinthosols – would lead to an increase in prediction accuracy of the minor soil groups by using Random Forest (RF). This method was compared to the random oversampling approach. Four datasets were used, including the entire dataset and the pruned dataset, which consisted of 80% and 90% respectively, and standard deviation core range of the Plinthosols data while cutting off all data points belonging to the outer range. The best prediction was achieved when RF was used with recursive feature elimination along with the non-oversampled 90% core range dataset. This model provided a substantial agreement to observation, with a kappa value of 0.57 along with 7% to 35% increase in prediction accuracy for smaller RSG. The reference soil groups in the Dano catchment appeared to be mainly influenced by the wetness index, a proxy for soil moisture distribution.
Collapse
|
38
|
Roy A, Cruz RM, Sabourin R, Cavalcanti GD. A study on combining dynamic selection and data preprocessing for imbalance learning. Neurocomputing 2018. [DOI: 10.1016/j.neucom.2018.01.060] [Citation(s) in RCA: 37] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
39
|
Zhang ZL, Luo XG, González S, García S, Herrera F. DRCW-ASEG: One-versus-One distance-based relative competence weighting with adaptive synthetic example generation for multi-class imbalanced datasets. Neurocomputing 2018. [DOI: 10.1016/j.neucom.2018.01.039] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
|
40
|
Nanda G, Vallmuur K, Lehto M. Improving autocoding performance of rare categories in injury classification: Is more training data or filtering the solution? ACCIDENT; ANALYSIS AND PREVENTION 2018; 110:115-127. [PMID: 29127808 DOI: 10.1016/j.aap.2017.10.020] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/06/2017] [Revised: 08/13/2017] [Accepted: 10/21/2017] [Indexed: 06/07/2023]
Abstract
INTRODUCTION Classical Machine Learning (ML) models have been found to assign the external-cause-of-injury codes (E-codes) based on injury narratives with good overall accuracy but often struggle with rare categories, primarily due to lack of enough training cases and heavily skewed nature of injurdata. In this paper, we have: a) studied the effect of increasing the size of training data on the prediction performance of three classical ML models: Multinomial Naïve Bayes (MNB), Support Vector Machine (SVM) and Logistic Regression (LR), and b) studied the effect of filtering based on prediction strength of LR model when the model is trained on very-small (10,000 cases) and very-large (450,000 cases) training sets. METHOD Data from Queensland Injury Surveillance Unit from years 2002-2012, which was categorized into 20 broad E-codes was used for this study. Eleven randomly chosen training sets of size ranging from 10,000 to 450,000 cases were used to train the ML models, and the prediction performance was analyzed on a prediction set of 50,150 cases. Filtering approach was tested on LR models trained on smallest and largest training datasets. Sensitivity was used as the performance measure for individual categories. Weighted average sensitivity (WAvg) and Unweighted average sensitivity (UAvg) were used as the measures of overall performance. Filtering approach was also tested for estimating category counts and was compared with approaches of summing prediction probabilities and counting direct predictions by ML model. RESULTS The overall performance of all three ML models improved with increase in the size of training data. The overall sensitivities with maximum training size for LR and SVM models were similar (∼82%), and higher than MNB (76%). For all the ML models, the sensitivities of rare categories improved with increasing training data but they were considerably less than sensitivities of larger categories. With increasing training data size, LR and SVM exhibited diminishing improvement in UAvg whereas the improvement was relatively steady in case of MNB. Filtering based on prediction strength of LR model (and manual review of filtered cases) helped in improving the sensitivities of rare categories. A sizeable portion of cases still needed to be filtered even when the LR model was trained on very large training set. For estimating category counts, filtering approach provided best estimates for most E-codes and summing prediction probabilities approach provided better estimates for rare categories. CONCLUSIONS Increasing the size of training data alone cannot solve the problem of poor classification performance on rare categories by ML models. Filtering could be an effective strategy to improve classification performance of rare categories when large training data is not available.
Collapse
Affiliation(s)
- Gaurav Nanda
- School of Industrial Engineering, Purdue University, USA.
| | - Kirsten Vallmuur
- Current: Australian Centre for Health Services Innovation, School of Public Health and Social Work, Queensland University of Technology, Australia; Formerly: Centre for Accident Research and Road Safety-Queensland, School of Psychology and Counselling, Queensland University of Technology, Australia
| | - Mark Lehto
- School of Industrial Engineering, Purdue University, USA
| |
Collapse
|
41
|
Du J, Vong CM, Pun CM, Wong PK, Ip WF. Post-boosting of classification boundary for imbalanced data using geometric mean. Neural Netw 2017; 96:101-114. [PMID: 28987974 DOI: 10.1016/j.neunet.2017.09.004] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2016] [Revised: 06/21/2017] [Accepted: 09/05/2017] [Indexed: 11/24/2022]
Abstract
In this paper, a novel imbalance learning method for binary classes is proposed, named as Post-Boosting of classification boundary for Imbalanced data (PBI), which can significantly improve the performance of any trained neural networks (NN) classification boundary. The procedure of PBI simply consists of two steps: an (imbalanced) NN learning method is first applied to produce a classification boundary, which is then adjusted by PBI under the geometric mean (G-mean). For data imbalance, the geometric mean of the accuracies of both minority and majority classes is considered, that is statistically more suitable than the common metric accuracy. PBI also has the following advantages over traditional imbalance methods: (i) PBI can significantly improve the classification accuracy on minority class while improving or keeping that on majority class as well; (ii) PBI is suitable for large data even with high imbalance ratio (up to 0.001). For evaluation of (i), a new metric called Majority loss/Minority advance ratio (MMR) is proposed that evaluates the loss ratio of majority class to minority class. Experiments have been conducted for PBI and several imbalance learning methods over benchmark datasets of different sizes, different imbalance ratios, and different dimensionalities. By analyzing the experimental results, PBI is shown to outperform other imbalance learning methods on almost all datasets.
Collapse
Affiliation(s)
- Jie Du
- Department of Computer and Information Science, University of Macau, Macau.
| | - Chi-Man Vong
- Department of Computer and Information Science, University of Macau, Macau.
| | - Chi-Man Pun
- Department of Computer and Information Science, University of Macau, Macau.
| | - Pak-Kin Wong
- Department of Electromechanical Engineering, University of Macau, Macau.
| | - Weng-Fai Ip
- Faculty of Science and Technology, University of Macau, Macau.
| |
Collapse
|
42
|
Ortigosa-Hernández J, Inza I, Lozano JA. Measuring the class-imbalance extent of multi-class problems. Pattern Recognit Lett 2017. [DOI: 10.1016/j.patrec.2017.08.002] [Citation(s) in RCA: 38] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
43
|
Zhang ZL, Luo XG, García S, Herrera F. Cost-Sensitive back-propagation neural networks with binarization techniques in addressing multi-class problems and non-competent classifiers. Appl Soft Comput 2017. [DOI: 10.1016/j.asoc.2017.03.016] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
44
|
Xu Y. Maximum Margin of Twin Spheres Support Vector Machine for Imbalanced Data Classification. IEEE TRANSACTIONS ON CYBERNETICS 2017; 47:1540-1550. [PMID: 27116760 DOI: 10.1109/tcyb.2016.2551735] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Twin support vector machine (TSVM) finds two nonparallel planes by solving a pair of smaller-sized quadratic programming problems (QPPs) rather than a single large one as in the conventional support vector machine (SVM); this makes the learning speed of TSVM approximately four times faster than that of the standard SVM. One major limitation of TSVM is that it involves an expensive matrix inverse operation when solving the dual problem. In addition, TSVM is less effective when dealing with the imbalanced data. In this paper, we propose a maximum margin of twin spheres support vector machine (MMTSSVM) for imbalanced data classification. MMTSSVM only needs to find two homocentric spheres. On one hand, the small sphere captures as many samples in the majority class as possible; on the other hand, the large sphere pushes out most samples in the minority class by increasing the margin between two homocentric spheres. MMTSSVM involves a QPP and a linear programming problem as opposed to a pair of QPPs as in classical TSVM or a larger-sized QPP in SVM, thus it greatly increases the computational speed. More importantly, MMTSSVM avoids the matrix inverse operation. The property of parameters in MMTSSVM is discussed and testified by one artificial experiment. Experimental results on nine benchmark datasets demonstrate the effectiveness of the proposed MMTSSVM in comparison with state-of-the-art algorithms. Finally, we apply MMTSSVM into Alzheimer's disease medical experiment and also obtain a better experimental result.
Collapse
|
45
|
Gutiérrez PD, Lastra M, Benítez JM, Herrera F. SMOTE-GPU: Big Data preprocessing on commodity hardware for imbalanced classification. PROGRESS IN ARTIFICIAL INTELLIGENCE 2017. [DOI: 10.1007/s13748-017-0128-2] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
46
|
Fernández A, Carmona CJ, José del Jesus M, Herrera F. A Pareto-based Ensemble with Feature and Instance Selection for Learning from Multi-Class Imbalanced Datasets. Int J Neural Syst 2017. [DOI: 10.1142/s0129065717500289] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Imbalanced classification is related to those problems that have an uneven distribution among classes. In addition to the former, when instances are located into the overlapped areas, the correct modeling of the problem becomes harder. Current solutions for both issues are often focused on the binary case study, as multi-class datasets require an additional effort to be addressed. In this research, we overcome these problems by carrying out a combination between feature and instance selections. Feature selection will allow simplifying the overlapping areas easing the generation of rules to distinguish among the classes. Selection of instances from all classes will address the imbalance itself by finding the most appropriate class distribution for the learning task, as well as possibly removing noise and difficult borderline examples. For the sake of obtaining an optimal joint set of features and instances, we embedded the searching for both parameters in a Multi-Objective Evolutionary Algorithm, using the C4.5 decision tree as baseline classifier in this wrapper approach. The multi-objective scheme allows taking a double advantage: the search space becomes broader, and we may provide a set of different solutions in order to build an ensemble of classifiers. This proposal has been contrasted versus several state-of-the-art solutions on imbalanced classification showing excellent results in both binary and multi-class problems.
Collapse
Affiliation(s)
- Alberto Fernández
- Department of Computer Science and Artificial Intelligence, University of Granada, Granada 18071, Spain
| | - Cristobal José Carmona
- Department of Civil Engineering, University of Burgos, Burgos 09006, Spain
- Leicester School of Pharmacy, De Montfort University, Leicester, LE1 9BH, UK
| | | | - Francisco Herrera
- Department of Computer Science and Artificial Intelligence, University of Granada, Granada 18071, Spain
- Faculty of Computing and Information Technology — North Jeddah, King Abdulaziz University (KAU), Jeddah 80200, Saudi Arabia
| |
Collapse
|
47
|
|
48
|
|
49
|
Zhang J, Sheng VS, Li Q, Wu J, Wu X. Consensus algorithms for biased labeling in crowdsourcing. Inf Sci (N Y) 2017. [DOI: 10.1016/j.ins.2016.12.026] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
50
|
|