1
|
Wang X, Zhang J, Xu Y, Huang Y, Ming W, Jiao Y, Liu B, Fan X, Xu J. Glo-net: A dual task branch based neural network for multi-class glomeruli segmentation. Comput Biol Med 2025; 186:109670. [PMID: 39799830 DOI: 10.1016/j.compbiomed.2025.109670] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2024] [Revised: 11/25/2024] [Accepted: 01/08/2025] [Indexed: 01/15/2025]
Abstract
Accurate segmentation and classification of glomeruli are fundamental to histopathology slide analysis in renal pathology, which helps to characterize individual kidney disease. Accurate segmentation of glomeruli of different types faces two main challenges compared to traditional primitives segmentation in computational image analysis. Limited by small kernel size, traditional convolutional neural networks could hardly understand the complete context information of different glomeruli. Moreover, typical semantic segmentation networks lack adequate attention to difficult glomerular samples during the training process due to serious class imbalance between different glomeruli types. We propose a new deep learning approach, Glo-Net, which accurately segments and classifies glomeruli based on digitized pathology slides. Specifically, Glo-Net divides the traditional semantic segmentation network into two branches, i.e., segmentation and classification. While the segmentation branch specifically aims at localizing and delineating the boundary of individual glomerulus, the classification branch could focus on differentiating the glomerular types based on segmented pixels. In addition, an innovative loss function is added to the classification task to compensate for the class imbalance and minor types of glomeruli. The proposed network's average accuracy and F-score in classification tasks on the multi-institution datasets (including an external validation set) are 0.858 and 0.704, respectively. The average intersection over union (IoU) in segmentation tasks is 0.866. The Glo-Net demonstrates a 5 % improvement in classification accuracy, with up to 14 % increases for minor classes and an average 6 % IoU increase for segmentation tasks. Quantitative results show that our network achieves overall higher accuracy for segmentation and classification among nine subtypes of glomeruli compared to previous work with improved robustness and generalizability.
Collapse
Affiliation(s)
- Xiangxue Wang
- Jiangsu Key Laboratory of Intelligent Medical Image Computing, School of Future Technology, Nanjing University of Information Science and Technology, Nanjing, 210044, China.
| | - Jingkai Zhang
- Jiangsu Key Laboratory of Intelligent Medical Image Computing, School of Future Technology, Nanjing University of Information Science and Technology, Nanjing, 210044, China
| | - Yuemei Xu
- Department of Pathology, Nanjing Drum Tower Hospital, The Affiliated Hospital of Nanjing University Medical School, Nanjing, 210008, China
| | - Yang Huang
- Institute of Nephrology, Zhong Da Hospital, Southeast University School of Medicine, 210009, China
| | - Wenlong Ming
- Jiangsu Key Laboratory of Intelligent Medical Image Computing, School of Future Technology, Nanjing University of Information Science and Technology, Nanjing, 210044, China
| | - Yiping Jiao
- Jiangsu Key Laboratory of Intelligent Medical Image Computing, School of Future Technology, Nanjing University of Information Science and Technology, Nanjing, 210044, China
| | - Bicheng Liu
- Institute of Nephrology, Zhong Da Hospital, Southeast University School of Medicine, 210009, China
| | - Xiangshan Fan
- Department of Pathology, Nanjing Drum Tower Hospital, The Affiliated Hospital of Nanjing University Medical School, Nanjing, 210008, China
| | - Jun Xu
- Jiangsu Key Laboratory of Intelligent Medical Image Computing, School of Future Technology, Nanjing University of Information Science and Technology, Nanjing, 210044, China.
| |
Collapse
|
2
|
Adegbenjo AO, Ngadi MO. Handling the Imbalanced Problem in Agri-Food Data Analysis. Foods 2024; 13:3300. [PMID: 39456362 PMCID: PMC11507408 DOI: 10.3390/foods13203300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2024] [Revised: 09/07/2024] [Accepted: 10/15/2024] [Indexed: 10/28/2024] Open
Abstract
Imbalanced data situations exist in most fields of endeavor. The problem has been identified as a major bottleneck in machine learning/data mining and is becoming a serious issue of concern in food processing applications. Inappropriate analysis of agricultural and food processing data was identified as limiting the robustness of predictive models built from agri-food applications. As a result of rare cases occurring infrequently, classification rules that detect small groups are scarce, so samples belonging to small classes are largely misclassified. Most existing machine learning algorithms including the K-means, decision trees, and support vector machines (SVMs) are not optimal in handling imbalanced data. Consequently, models developed from the analysis of such data are very prone to rejection and non-adoptability in real industrial and commercial settings. This paper showcases the reality of the imbalanced data problem in agri-food applications and therefore proposes some state-of-the-art artificial intelligence algorithm approaches for handling the problem using methods including data resampling, one-class learning, ensemble methods, feature selection, and deep learning techniques. This paper further evaluates existing and newer metrics that are well suited for handling imbalanced data. Rightly analyzing imbalanced data from food processing application research works will improve the accuracy of results and model developments. This will consequently enhance the acceptability and adoptability of innovations/inventions.
Collapse
Affiliation(s)
- Adeyemi O. Adegbenjo
- Department of Bioresource Engineering, McGill University, 21111 Lakeshore Road, Ste-Anne-de-Bellevue, Montreal, QC H9X 3V9, Canada
- Process Quality Engineering, School of Engineering and Technology, Conestoga College Institute of Technology and Advanced Learning, 299 Doon Valley Drive, Kitchener, ON N2G 4M4, Canada
| | - Michael O. Ngadi
- Department of Bioresource Engineering, McGill University, 21111 Lakeshore Road, Ste-Anne-de-Bellevue, Montreal, QC H9X 3V9, Canada
| |
Collapse
|
3
|
Shyalika C, Wickramarachchi R, El Kalach F, Harik R, Sheth A. Evaluating the Role of Data Enrichment Approaches towards Rare Event Analysis in Manufacturing. SENSORS (BASEL, SWITZERLAND) 2024; 24:5009. [PMID: 39124055 PMCID: PMC11315056 DOI: 10.3390/s24155009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/26/2024] [Revised: 07/24/2024] [Accepted: 07/26/2024] [Indexed: 08/12/2024]
Abstract
Rare events are occurrences that take place with a significantly lower frequency than more common, regular events. These events can be categorized into distinct categories, from frequently rare to extremely rare, based on factors like the distribution of data and significant differences in rarity levels. In manufacturing domains, predicting such events is particularly important, as they lead to unplanned downtime, a shortening of equipment lifespans, and high energy consumption. Usually, the rarity of events is inversely correlated with the maturity of a manufacturing industry. Typically, the rarity of events affects the multivariate data generated within a manufacturing process to be highly imbalanced, which leads to bias in predictive models. This paper evaluates the role of data enrichment techniques combined with supervised machine learning techniques for rare event detection and prediction. We use time series data augmentation and sampling to address the data scarcity, maintaining its patterns, and imputation techniques to handle null values. Evaluating 15 learning models, we find that data enrichment improves the F1 measure by up to 48% in rare event detection and prediction. Our empirical and ablation experiments provide novel insights, and we also investigate model interpretability.
Collapse
Affiliation(s)
- Chathurangi Shyalika
- Artificial Intelligence Institute, College of Engineering and Computing, University of South Carolina, Columbia, SC 29208, USA;
| | - Ruwan Wickramarachchi
- Artificial Intelligence Institute, College of Engineering and Computing, University of South Carolina, Columbia, SC 29208, USA;
| | - Fadi El Kalach
- McNair Center for Aerospace Innovation and Research, Department of Mechanical Engineering, College of Engineering and Computing, University of South Carolina, Columbia, SC 29201, USA; (F.E.K.); (R.H.)
| | - Ramy Harik
- McNair Center for Aerospace Innovation and Research, Department of Mechanical Engineering, College of Engineering and Computing, University of South Carolina, Columbia, SC 29201, USA; (F.E.K.); (R.H.)
| | - Amit Sheth
- Artificial Intelligence Institute, College of Engineering and Computing, University of South Carolina, Columbia, SC 29208, USA;
| |
Collapse
|
4
|
Nath A, Chaube R. Mining Chemogenomic Spaces for Prediction of Drug-Target Interactions. Methods Mol Biol 2024; 2714:155-169. [PMID: 37676598 DOI: 10.1007/978-1-0716-3441-7_9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/08/2023]
Abstract
The pipeline of drug discovery consists of a number of processes; drug-target interaction determination is one of the salient steps among them. Computational prediction of drug-target interactions can facilitate in reducing the search space of experimental wet lab-based verifications steps, thus considerably reducing time and other resources dedicated to the drug discovery pipeline. While machine learning-based methods are more widespread for drug-target interaction prediction, network-centric methods are also evolving. In this chapter, we focus on the process of the drug-target interaction prediction from the perspective of using machine learning algorithms and the various stages involved for developing an accurate predictor.
Collapse
Affiliation(s)
- Abhigyan Nath
- Department of Biochemistry, Pt. Jawahar Lal Nehru Memorial Medical College, Raipur, India
| | - Radha Chaube
- Department of Zoology, Institute of Science, Banaras Hindu University, Varanasi, India
| |
Collapse
|
5
|
Iskender S, Heydarov S, Yalcin M, Faydaci C, Kurt O, Surme S, Kucukbasmaci O. Rapid determination of colistin resistance in Klebsiella pneumoniae by MALDI-TOF peak based machine learning algorithm with MATLAB. Diagn Microbiol Infect Dis 2023; 107:116052. [PMID: 37769565 DOI: 10.1016/j.diagmicrobio.2023.116052] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Revised: 07/05/2023] [Accepted: 08/05/2023] [Indexed: 10/03/2023]
Abstract
INTRODUCTION To date, limited data exist on demonstrating the usefulness of machine learning (ML) algorithms applied to MALDI-TOF in determining colistin resistance among Klebsiella pneumoniae. We aimed to detect colistin resistance in K. pneumoniae using MATLAB on MALDI-TOF database. MATERIALS AND METHODS A total of 260 K. pneumoniae isolates were collected. Three ML models, namely, linear discriminant analysis (LDA), support vector machine, and Ensemble were used as ML algorithms and applied to training data set. RESULTS The accuracies for the training phase with 200 isolates were found to be 99.3%, 93.1%, and 88.3% for LDA, support vector machine, and Ensemble models, respectively. Accuracy, sensitivity, specificity, and precision values for LDA in the application test set with 60 K. pneumoniae isolates were 81.6%, 66.7%, 91.7%, and 84.2%, respectively. CONCLUSION This study provides a rapid and accurate MALDI-TOF MS screening assay for clinical practice in identifying colistin resistance in K. pneumoniae.
Collapse
Affiliation(s)
- Secil Iskender
- Department of Medical Microbiology, Istanbul University-Cerrahpasa, Cerrahpasa Medical Faculty, Istanbul, Turkey
| | - Saddam Heydarov
- Electronics Technologies, Istanbul Gelisim University, Istanbul, Turkey
| | - Metin Yalcin
- Department of Medical Microbiology, Istanbul University-Cerrahpasa, Cerrahpasa Medical Faculty, Istanbul, Turkey
| | - Cagri Faydaci
- Electronics Technologies, Istanbul Gelisim University, Istanbul, Turkey
| | - Ozge Kurt
- Department of Medical Microbiology, Istanbul University-Cerrahpasa, Cerrahpasa Medical Faculty, Istanbul, Turkey
| | - Serkan Surme
- Department of Medical Microbiology, Istanbul University-Cerrahpasa, Cerrahpasa Medical Faculty, Istanbul, Turkey
| | - Omer Kucukbasmaci
- Department of Medical Microbiology, Istanbul University-Cerrahpasa, Cerrahpasa Medical Faculty, Istanbul, Turkey.
| |
Collapse
|
6
|
Jaiswal A, Chen T, Rousseau JF, Peng Y, Ding Y, Wang Z. Attend Who is Weak: Pruning-assisted Medical Image Localization under Sophisticated and Implicit Imbalances. 2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV) 2023; 2023:4976-4985. [PMID: 37051561 PMCID: PMC10089697 DOI: 10.1109/wacv56688.2023.00496] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
Deep neural networks (DNNs) have rapidly become a de facto choice for medical image understanding tasks. However, DNNs are notoriously fragile to the class imbalance in image classification. We further point out that such imbalance fragility can be amplified when it comes to more sophisticated tasks such as pathology localization, as imbalances in such problems can have highly complex and often implicit forms of presence. For example, different pathology can have different sizes or colors (w.r.t.the background), different underlying demographic distributions, and in general different difficulty levels to recognize, even in a meticulously curated balanced distribution of training data. In this paper, we propose to use pruning to automatically and adaptively identify hard-to-learn (HTL) training samples, and improve pathology localization by attending them explicitly, during training in supervised, semi-supervised, and weakly-supervised settings. Our main inspiration is drawn from the recent finding that deep classification models have difficult-to-memorize samples and those may be effectively exposed through network pruning [15] - and we extend such observation beyond classification for the first time. We also present an interesting demographic analysis which illustrates HTLs ability to capture complex demographic imbalances. Our extensive experiments on the Skin Lesion Localization task in multiple training settings by paying additional attention to HTLs show significant improvement of localization performance by ~2-3%.
Collapse
|
7
|
Ghaderi Zefrehi H, Altınçay H. MaMiPot: a paradigm shift for the classification of imbalanced data. J Intell Inf Syst 2022. [DOI: 10.1007/s10844-022-00763-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
8
|
Fair evaluation of classifier predictive performance based on binary confusion matrix. Comput Stat 2022. [DOI: 10.1007/s00180-022-01301-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
AbstractEvaluating the ability of a classifier to make predictions on unseen data and increasing it by tweaking the learning algorithm are two of the main reasons motivating the evaluation of classifier predictive performance. In this study the behavior of Balanced $$AC_1$$
A
C
1
— a novel classifier accuracy measure — is investigated under different class imbalance conditions via a Monte Carlo simulation. The behavior of Balanced $$AC_1$$
A
C
1
is compared against that of several well-known performance measures based on binary confusion matrix. Study results reveal the suitability of Balanced $$AC_1$$
A
C
1
with both balanced and imbalanced data sets. A real example of the effects of class imbalance on the behavior of the investigated classifier performance measures is provided by comparing the performance of several machine learning algorithms in a churn prediction problem.
Collapse
|
9
|
Xing M, Zhang Y, Yu H, Yang Z, Li X, Li Q, Zhao Y, Zhao Z, Luo Y. Predict DLBCL patients' recurrence within two years with Gaussian mixture model cluster oversampling and multi-kernel learning. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2022; 226:107103. [PMID: 36088813 DOI: 10.1016/j.cmpb.2022.107103] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/14/2021] [Revised: 08/05/2022] [Accepted: 08/30/2022] [Indexed: 06/15/2023]
Abstract
BACKGROUND AND OBJECTIVE Diffuse large B-cell lymphoma (DLBCL) is common in adults' non-Hodgkin's lymphoma. Relapse mainly occurs within two years after diagnosis and has a poor prognosis. Relapse after two years is less frequent and has a better prognosis. In this work, we constructed a relapse prediction model for diffuse large B-cell lymphoma patients within two years, expecting to provide a reference for Clinicians to implement individualized treatment. METHOD We propose a secondary-level class imbalance method based on Gaussian mixture model (GMM) clustering resampling to balance the data. Then use a multi-kernel support vector machine(SVM) to inscribe heterogeneous clinical data. Finally, merging them to identify recurrence patients within two years. RESULTS Among all the class imbalance methods in this work, Inverse Weighted -GMM +SMOTEENN has the best performance. Compared with NO-GMM (Directl use the SMOTEENN without the GMM clustering process), its Area Under the ROC Curve(AUC) increases by 8.75%, and ECE and brier scores decrease 2.07% and 3.09%, respectively. Among the four classification algorithms in this work, Multiple kernel learning (MKL) has the most minimized brier scores and expected calibration error(ECE), the largest AUC, accuracy, Recall, precision and F1, has the best discrimination and calibration. CONCLUSION Our inverse weighted -GMM+SMOTEENN+MKL (GMM-SENN-MKL) method can handle data class imbalance and clinical heterogeneity data well and can be used to predict recurrence in DLBCL patients.
Collapse
Affiliation(s)
- Meng Xing
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China; Shanxi Provincial Key Laboratory of Major Diseases Risk Assessment, Taiyuan, China
| | - Yanbo Zhang
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China; Shanxi Provincial Key Laboratory of Major Diseases Risk Assessment, Taiyuan, China
| | - Hongmei Yu
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China; Shanxi Provincial Key Laboratory of Major Diseases Risk Assessment, Taiyuan, China
| | - Zhenhuan Yang
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China; Shanxi Provincial Key Laboratory of Major Diseases Risk Assessment, Taiyuan, China
| | - Xueling Li
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China; Shanxi Provincial Key Laboratory of Major Diseases Risk Assessment, Taiyuan, China
| | - Qiong Li
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China; Shanxi Provincial Key Laboratory of Major Diseases Risk Assessment, Taiyuan, China
| | - Yanlin Zhao
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China; Shanxi Provincial Key Laboratory of Major Diseases Risk Assessment, Taiyuan, China
| | - Zhiqiang Zhao
- Department of Hematology, Shanxi Cancer Hospital, Taiyuan, China.
| | - Yanhong Luo
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China; Shanxi Provincial Key Laboratory of Major Diseases Risk Assessment, Taiyuan, China.
| |
Collapse
|
10
|
Nasir M, Dag A, Simsek S, Ivanov A, Oztekin A. Improving Imbalanced Machine Learning with Neighborhood-Informed Synthetic Sample Placement. J MANAGE INFORM SYST 2022. [DOI: 10.1080/07421222.2022.2127453] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Affiliation(s)
- Murtaza Nasir
- Department of Finance, Real Estate, and Decision Sciences, W. Frank Barton School of Business, Wichita State University, Wichita, KS, USA
| | - Ali Dag
- Department of Business Intelligence & Analytics, Heider College of Business, Creighton University, Omaha, NE, USA
| | - Serhat Simsek
- Department of Information Management & Business Analytics, Feliciano School of Business, Montclair State University, Montclair, NJ, USA
| | - Anton Ivanov
- Department of Business Administration, Gies College of Business, University of Illinois Urbana-Champaign, Champaign, IL, USA
| | - Asil Oztekin
- Department of Operations & Information Systems, Manning School of Business, University of Massachusetts Lowell, Lowell, MA, USA
| |
Collapse
|
11
|
Dutta A, Hasan MK, Ahmad M, Awal MA, Islam MA, Masud M, Meshref H. Early Prediction of Diabetes Using an Ensemble of Machine Learning Models. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2022; 19:ijerph191912378. [PMID: 36231678 PMCID: PMC9566114 DOI: 10.3390/ijerph191912378] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 09/20/2022] [Accepted: 09/24/2022] [Indexed: 05/15/2023]
Abstract
Diabetes is one of the most rapidly spreading diseases in the world, resulting in an array of significant complications, including cardiovascular disease, kidney failure, diabetic retinopathy, and neuropathy, among others, which contribute to an increase in morbidity and mortality rate. If diabetes is diagnosed at an early stage, its severity and underlying risk factors can be significantly reduced. However, there is a shortage of labeled data and the occurrence of outliers or data missingness in clinical datasets that are reliable and effective for diabetes prediction, making it a challenging endeavor. Therefore, we introduce a newly labeled diabetes dataset from a South Asian nation (Bangladesh). In addition, we suggest an automated classification pipeline that includes a weighted ensemble of machine learning (ML) classifiers: Naive Bayes (NB), Random Forest (RF), Decision Tree (DT), XGBoost (XGB), and LightGBM (LGB). Grid search hyperparameter optimization is employed to tune the critical hyperparameters of these ML models. Furthermore, missing value imputation, feature selection, and K-fold cross-validation are included in the framework design. A statistical analysis of variance (ANOVA) test reveals that the performance of diabetes prediction significantly improves when the proposed weighted ensemble (DT + RF + XGB + LGB) is executed with the introduced preprocessing, with the highest accuracy of 0.735 and an area under the ROC curve (AUC) of 0.832. In conjunction with the suggested ensemble model, our statistical imputation and RF-based feature selection techniques produced the best results for early diabetes prediction. Moreover, the presented new dataset will contribute to developing and implementing robust ML models for diabetes prediction utilizing population-level data.
Collapse
Affiliation(s)
- Aishwariya Dutta
- Department of Biomedical Engineering (BME), Khulna University of Engineering & Technology (KUET), Khulna 9203, Bangladesh
- Department of Biomedical Engineering (BME), Military Institute of Science and Technology (MIST), Mirpur Cantonment, Dhaka 1216, Bangladesh
| | - Md. Kamrul Hasan
- Department of Electrical and Electronic Engineering (EEE), Khulna University of Engineering & Technology (KUET), Khulna 9203, Bangladesh
| | - Mohiuddin Ahmad
- Department of Electrical and Electronic Engineering (EEE), Khulna University of Engineering & Technology (KUET), Khulna 9203, Bangladesh
| | - Md. Abdul Awal
- School of Information Technology and Electrical Engineering, The University of Queensland, Brisbane, QLD 4072, Australia
- Electronics and Communication Engineering (ECE) Discipline, Khulna University (KU), Khulna 9208, Bangladesh
- Correspondence:
| | | | - Mehedi Masud
- Department of Computer Science, College of Computers and Information Technology, Taif University, P.O. Box 11099, Taif 21944, Saudi Arabia
| | - Hossam Meshref
- Department of Computer Science, College of Computers and Information Technology, Taif University, P.O. Box 11099, Taif 21944, Saudi Arabia
| |
Collapse
|
12
|
Zhao T, Chen H, Bai Y, Zhao Y, Zhao S. A Hierarchical Ensemble Deep Learning Activity Recognition Approach with Wearable Sensors Based on Focal Loss. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2022; 19:11706. [PMID: 36141976 PMCID: PMC9517260 DOI: 10.3390/ijerph191811706] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/31/2022] [Revised: 09/05/2022] [Accepted: 09/13/2022] [Indexed: 06/16/2023]
Abstract
Abnormal activity in daily life is a relatively common symptom of chronic diseases, such as dementia. There will probably be a variety of repetitive activities in dementia patients' daily life, such as repeated handling of objects and repeated packing of clothes. It is particularly important to recognize the daily activities of the elderly, which can be further used to predict and monitor chronic diseases. In this paper, we propose a hierarchical ensemble deep learning activity recognition approach with wearable sensors based on focal loss. Seven basic everyday life activities including cooking, keyboarding, reading, brushing teeth, washing one's face, washing dishes and writing are considered in order to show its performance. Based on hold-out cross-validation results on a dataset collected from elderly volunteers, the average accuracy, precision, recall and F1-score of our approach are 98.69%, 98.05%, 98.01% and 97.99%, respectively, in identifying the activities of daily life for the elderly.
Collapse
|
13
|
Abstract
Synthetic data consists of artificially generated data. When data are scarce, or of poor quality, synthetic data can be used, for example, to improve the performance of machine learning models. Generative adversarial networks (GANs) are a state-of-the-art deep generative models that can generate novel synthetic samples that follow the underlying data distribution of the original dataset. Reviews on synthetic data generation and on GANs have already been written. However, none in the relevant literature, to the best of our knowledge, has explicitly combined these two topics. This survey aims to fill this gap and provide useful material to new researchers in this field. That is, we aim to provide a survey that combines synthetic data generation and GANs, and that can act as a good and strong starting point for new researchers in the field, so that they have a general overview of the key contributions and useful references. We have conducted a review of the state-of-the-art by querying four major databases: Web of Sciences (WoS), Scopus, IEEE Xplore, and ACM Digital Library. This allowed us to gain insights into the most relevant authors, the most relevant scientific journals in the area, the most cited papers, the most significant research areas, the most important institutions, and the most relevant GAN architectures. GANs were thoroughly reviewed, as well as their most common training problems, their most important breakthroughs, and a focus on GAN architectures for tabular data. Further, the main algorithms for generating synthetic data, their applications and our thoughts on these methods are also expressed. Finally, we reviewed the main techniques for evaluating the quality of synthetic data (especially tabular data) and provided a schematic overview of the information presented in this paper.
Collapse
|
14
|
Li DC, Wang SY, Huang KC, Tsai TI. Learning class-imbalanced data with region-impurity synthetic minority oversampling technique. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.06.067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
15
|
Accurate Evaluation of Feature Contributions for Sentinel Lymph Node Status Classification in Breast Cancer. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12147227] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
The current guidelines recommend the sentinel lymph node biopsy to evaluate the lymph node involvement for breast cancer patients with clinically negative lymph nodes on clinical or radiological examination. Machine learning (ML) models have significantly improved the prediction of lymph nodes status based on clinical features, thus avoiding expensive, time-consuming and invasive procedures. However, the classification of sentinel lymph node status represents a typical example of an unbalanced classification problem. In this work, we developed a ML framework to explore the effects of unbalanced populations on the performance and stability of feature ranking for sentinel lymph node status classification in breast cancer. Our results indicate state-of-the-art AUC (Area under the Receiver Operating Characteristic curve) values on a hold-out set (67%) while providing particularly stable features related to tumor size, histological subtype and estrogen receptor expression, which should therefore be considered as potential biomarkers.
Collapse
|
16
|
Nieto-del-Amor F, Prats-Boluda G, Garcia-Casado J, Diaz-Martinez A, Diago-Almela VJ, Monfort-Ortiz R, Hao D, Ye-Lin Y. Combination of Feature Selection and Resampling Methods to Predict Preterm Birth Based on Electrohysterographic Signals from Imbalance Data. SENSORS 2022; 22:s22145098. [PMID: 35890778 PMCID: PMC9319575 DOI: 10.3390/s22145098] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/13/2022] [Revised: 07/01/2022] [Accepted: 07/05/2022] [Indexed: 02/01/2023]
Abstract
Due to its high sensitivity, electrohysterography (EHG) has emerged as an alternative technique for predicting preterm labor. The main obstacle in designing preterm labor prediction models is the inherent preterm/term imbalance ratio, which can give rise to relatively low performance. Numerous studies obtained promising preterm labor prediction results using the synthetic minority oversampling technique. However, these studies generally overestimate mathematical models’ real generalization capacity by generating synthetic data before splitting the dataset, leaking information between the training and testing partitions and thus reducing the complexity of the classification task. In this work, we analyzed the effect of combining feature selection and resampling methods to overcome the class imbalance problem for predicting preterm labor by EHG. We assessed undersampling, oversampling, and hybrid methods applied to the training and validation dataset during feature selection by genetic algorithm, and analyzed the resampling effect on training data after obtaining the optimized feature subset. The best strategy consisted of undersampling the majority class of the validation dataset to 1:1 during feature selection, without subsequent resampling of the training data, achieving an AUC of 94.5 ± 4.6%, average precision of 84.5 ± 11.7%, maximum F1-score of 79.6 ± 13.8%, and recall of 89.8 ± 12.1%. Our results outperformed the techniques currently used in clinical practice, suggesting the EHG could be used to predict preterm labor in clinics.
Collapse
Affiliation(s)
- Félix Nieto-del-Amor
- Centro de Investigación e Innovación en Bioingeniería, Universitat Politècnica de València, 46022 Valencia, Spain; (F.N.-d.-A.); (J.G.-C.); (A.D.-M.); (Y.Y.-L.)
| | - Gema Prats-Boluda
- Centro de Investigación e Innovación en Bioingeniería, Universitat Politècnica de València, 46022 Valencia, Spain; (F.N.-d.-A.); (J.G.-C.); (A.D.-M.); (Y.Y.-L.)
- Correspondence:
| | - Javier Garcia-Casado
- Centro de Investigación e Innovación en Bioingeniería, Universitat Politècnica de València, 46022 Valencia, Spain; (F.N.-d.-A.); (J.G.-C.); (A.D.-M.); (Y.Y.-L.)
| | - Alba Diaz-Martinez
- Centro de Investigación e Innovación en Bioingeniería, Universitat Politècnica de València, 46022 Valencia, Spain; (F.N.-d.-A.); (J.G.-C.); (A.D.-M.); (Y.Y.-L.)
| | | | - Rogelio Monfort-Ortiz
- Servicio de Obstetricia, H.U.P. La Fe, 46026 Valencia, Spain; (V.J.D.-A.); (R.M.-O.)
| | - Dongmei Hao
- Faculty of Environment and Life, Beijing University of Technology, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing 100124, China;
| | - Yiyao Ye-Lin
- Centro de Investigación e Innovación en Bioingeniería, Universitat Politècnica de València, 46022 Valencia, Spain; (F.N.-d.-A.); (J.G.-C.); (A.D.-M.); (Y.Y.-L.)
| |
Collapse
|
17
|
Wei G, Mu W, Song Y, Dou J. An improved and random synthetic minority oversampling technique for imbalanced data. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.108839] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
18
|
Fernando KRM, Tsokos CP. Dynamically Weighted Balanced Loss: Class Imbalanced Learning and Confidence Calibration of Deep Neural Networks. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:2940-2951. [PMID: 33444149 DOI: 10.1109/tnnls.2020.3047335] [Citation(s) in RCA: 33] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Imbalanced class distribution is an inherent problem in many real-world classification tasks where the minority class is the class of interest. Many conventional statistical and machine learning classification algorithms are subject to frequency bias, and learning discriminating boundaries between the minority and majority classes could be challenging. To address the class distribution imbalance in deep learning, we propose a class rebalancing strategy based on a class-balanced dynamically weighted loss function where weights are assigned based on the class frequency and predicted probability of ground-truth class. The ability of dynamic weighting scheme to self-adapt its weights depending on the prediction scores allows the model to adjust for instances with varying levels of difficulty resulting in gradient updates driven by hard minority class samples. We further show that the proposed loss function is classification calibrated. Experiments conducted on highly imbalanced data across different applications of cyber intrusion detection (CICIDS2017 data set) and medical imaging (ISIC2019 data set) show robust generalization. Theoretical results supported by superior empirical performance provide justification for the validity of the proposed dynamically weighted balanced (DWB) loss function.
Collapse
|
19
|
Deep instance envelope network-based imbalance learning algorithm with multilayer fuzzy C-means clustering and minimum interlayer discrepancy. Appl Soft Comput 2022. [DOI: 10.1016/j.asoc.2022.108846] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
20
|
Ji Z, Yu X, Yu Y, Pang Y, Zhang Z. Semantic-Guided Class-Imbalance Learning Model for Zero-Shot Image Classification. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:6543-6554. [PMID: 34043516 DOI: 10.1109/tcyb.2020.3004641] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
In this article, we focus on the task of zero-shot image classification (ZSIC) that equips a learning system with the ability to recognize visual images from unseen classes. In contrast to the traditional image classification, ZSIC more easily suffers from the class-imbalance issue since it is more concerned with the class-level knowledge transferring capability. In the real world, the sample numbers of different categories generally follow a long-tailed distribution, and the discriminative information in the sample-scarce seen classes is hard to transfer to the related unseen classes in the traditional batch-based training manner, which degrades the overall generalization ability a lot. To alleviate the class-imbalance issue in ZSIC, we propose a sample-balanced training process to encourage all training classes to contribute equally to the learned model. Specifically, we randomly select the same number of images from each class across all training classes to form a training batch to ensure that the sample-scarce classes contribute equally as those classes with sufficient samples during each iteration. Considering that the instances from the same class differ in class representativeness, we further develop an efficient semantic-guided feature fusion model to obtain the discriminative class visual prototype for the following visual-semantic interaction process via distributing different weights to the selected samples based on their class representativeness. Extensive experiments on three imbalanced ZSIC benchmark datasets for both traditional ZSIC and generalized ZSIC tasks demonstrate that our approach achieves promising results, especially for the unseen categories that are closely related to the sample-scarce seen categories. Besides, the experimental results on two class-balanced datasets show that the proposed approach also improves the classification performance against the baseline model.
Collapse
|
21
|
Majority-to-minority resampling for boosting-based classification under imbalanced data. APPL INTELL 2022. [DOI: 10.1007/s10489-022-03585-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
22
|
A Highly Adaptive Oversampling Approach to Address the Issue of Data Imbalance. COMPUTERS 2022. [DOI: 10.3390/computers11050073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]
Abstract
Data imbalance is a serious problem in machine learning that can be alleviated at the data level by balancing the class distribution with sampling. In the last decade, several sampling methods have been published to address the shortcomings of the initial ones, such as noise sensitivity and incorrect neighbor selection. Based on the review of the literature, it has become clear to us that the algorithms achieve varying performance on different data sets. In this paper, we present a new oversampler that has been developed based on the key steps and sampling strategies identified by analyzing dozens of existing methods and that can be fitted to various data sets through an optimization process. Experiments were performed on a number of data sets, which show that the proposed method had a similar or better effect on the performance of SVM, DTree, kNN and MLP classifiers compared with other well-known samplers found in the literature. The results were also confirmed by statistical tests.
Collapse
|
23
|
Mostafaei S, Ahmadi A, Shahrabi J. Dealing with data intrinsic difficulties by learning an interPretable Ensemble Rule Learning (PERL) model. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.02.048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
24
|
Borderline-margin loss based deep metric learning framework for imbalanced data. APPL INTELL 2022. [DOI: 10.1007/s10489-022-03494-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
25
|
Tao X, Zheng Y, Chen W, Zhang X, Qi L, Fan Z, Huang S. SVDD-based weighted oversampling technique for imbalanced and overlapped dataset learning. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2021.12.066] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
26
|
Santos MS, Abreu PH, Japkowicz N, Fernández A, Soares C, Wilk S, Santos J. On the joint-effect of class imbalance and overlap: a critical review. Artif Intell Rev 2022. [DOI: 10.1007/s10462-022-10150-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
27
|
Li DC, Shi QS, Lin YS, Lin LS. A Boundary-Information-Based Oversampling Approach to Improve Learning Performance for Imbalanced Datasets. ENTROPY 2022; 24:e24030322. [PMID: 35327833 PMCID: PMC8947752 DOI: 10.3390/e24030322] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/04/2022] [Revised: 02/19/2022] [Accepted: 02/21/2022] [Indexed: 11/16/2022]
Abstract
Oversampling is the most popular data preprocessing technique. It makes traditional classifiers available for learning from imbalanced data. Through an overall review of oversampling techniques (oversamplers), we find that some of them can be regarded as danger-information-based oversamplers (DIBOs) that create samples near danger areas to make it possible for these positive examples to be correctly classified, and others are safe-information-based oversamplers (SIBOs) that create samples near safe areas to increase the correct rate of predicted positive values. However, DIBOs cause misclassification of too many negative examples in the overlapped areas, and SIBOs cause incorrect classification of too many borderline positive examples. Based on their advantages and disadvantages, a boundary-information-based oversampler (BIBO) is proposed. First, a concept of boundary information that considers safe information and dangerous information at the same time is proposed that makes created samples near decision boundaries. The experimental results show that DIBOs and BIBO perform better than SIBOs on the basic metrics of recall and negative class precision; SIBOs and BIBO perform better than DIBOs on the basic metrics for specificity and positive class precision, and BIBO is better than both of DIBOs and SIBOs in terms of integrated metrics.
Collapse
Affiliation(s)
- Der-Chiang Li
- Department of Industrial and Information Management, National Cheng Kung University, University Road, Tainan 70101, Taiwan; (D.-C.L.); (Q.-S.S.)
| | - Qi-Shi Shi
- Department of Industrial and Information Management, National Cheng Kung University, University Road, Tainan 70101, Taiwan; (D.-C.L.); (Q.-S.S.)
| | - Yao-San Lin
- Singapore Centre for Chinese Language, Nanyang Technological University, Ghim Moh Road, Singapore 279623, Singapore;
| | - Liang-Sian Lin
- Department of Information Management, National Taipei University of Nursing and Health Sciences, Ming-te Road, Taipei 112303, Taiwan
- Correspondence: ; Tel.: +886-2822-7101 (ext. 1234)
| |
Collapse
|
28
|
DEIDS: a novel intrusion detection system for industrial control systems. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-06965-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
29
|
Guido R, Groccia MC, Conforti D. A hyper-parameter tuning approach for cost-sensitive support vector machine classifiers. Soft comput 2022. [DOI: 10.1007/s00500-022-06768-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
AbstractIn machine learning, hyperparameter tuning is strongly useful to improve model performance. In our research, we concentrate our attention on classifying imbalanced data by cost-sensitive support vector machines. We propose a multi-objective approach that optimizes model’s hyper-parameters. The approach is devised for imbalanced data. Three SVM model’s performance measures are optimized. We present the algorithm in a basic version based on genetic algorithms, and as an improved version based on genetic algorithms combined with decision trees. We tested the basic and the improved approach on benchmark datasets either as serial and parallel version. The improved version strongly reduces the computational time needed for finding optimized hyper-parameters. The results empirically show that suitable evaluation measures should be used in assessing the classification performance of classification models with imbalanced data.
Collapse
|
30
|
Saad M, He S, Thorstad W, Gay H, Barnett D, Zhao Y, Ruan S, Wang X, Li H. Learning-based Cancer Treatment Outcome Prognosis using Multimodal Biomarkers. IEEE TRANSACTIONS ON RADIATION AND PLASMA MEDICAL SCIENCES 2022; 6:231-244. [PMID: 35520102 PMCID: PMC9066560 DOI: 10.1109/trpms.2021.3104297] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
Predicting early in treatment whether a tumor is likely to be responsive is a difficult yet important task to support clinical decision-making. Studies have shown that multimodal biomarkers could provide complementary information and lead to more accurate treatment outcome prognosis than unimodal biomarkers. However, the prognosis accuracy could be affected by multimodal data heterogeneity and incompleteness. The small-sized and imbalance datasets also bring additional challenges for training a designed prognosis model. In this study, a modular framework employing multimodal biomarkers for cancer treatment outcome prediction was proposed. It includes four modules of synthetic data generation, deep feature extraction, multimodal feature fusion, and classification to address the challenges described above. The feasibility and advantages of the designed framework were demonstrated through an example study, in which the goal was to stratify oropharyngeal squamous cell carcinoma (OPSCC) patients with low- and high-risks of treatment failures by use of positron emission tomography (PET) image data and microRNA (miRNA) biomarkers. The superior prognosis performance and the comparison with other methods demonstrated the efficiency of the proposed framework and its ability of enabling seamless integration, validation and comparison of various algorithms in each module of the framework. The limitation and future work was discussed as well.
Collapse
Affiliation(s)
- Maliazurina Saad
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Urbana, IL, USA. She is now with the MD Anderson Cancer Center, Houston, TX, USA
| | - Shenghua He
- Department of Computer Science and Engineering, Washington University, Saint louis, MO, USA
| | - Wade Thorstad
- Department of Radiation Oncology, Washington University School of Medicine, Saint louis, MO, USA
| | - Hiram Gay
- Department of Radiation Oncology, Washington University School of Medicine, Saint louis, MO, USA
| | - Daniel Barnett
- Carle Cancer Center, Carle Foundation Hospital, Urbana, IL, USA
| | - Yujie Zhao
- Mao Clinic at Florida, Jacksonville, FL, USA
| | - Su Ruan
- Laboratoire LITIS (EA 4108), Equipe Quantif, University of Rouen, France
| | - Xiaowei Wang
- Department of Pharmacology and Bioengineering, University of Illinois at Chicago, Chicago, IL, USA
| | - Hua Li
- Department of Bioengineering, University of Illinois at Urbana-Champaign, Cancer Center at Illinois, and Carle Foundation Hospital, Urbana, IL, USA
| |
Collapse
|
31
|
SVDD boundary and DPC clustering technique-based oversampling approach for handling imbalanced and overlapped data. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2021.107588] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
32
|
Upadhyay K, Kaur P, Verma DK. Evaluating the Performance of Data Level Methods Using KEEL Tool to Address Class Imbalance Problem. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING 2021. [DOI: 10.1007/s13369-021-06377-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
33
|
Dudjak M, Martinović G. An empirical study of data intrinsic characteristics that make learning from imbalanced data difficult. EXPERT SYSTEMS WITH APPLICATIONS 2021; 182:115297. [DOI: 10.1016/j.eswa.2021.115297] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/02/2023]
|
34
|
Khoda ME, Kamruzzaman J, Gondal I, Imam T, Rahman A. Malware detection in edge devices with fuzzy oversampling and dynamic class weighting. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2021.107783] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
35
|
Yang L, Heiselman C, Quirk JG, Djurić PM. CLASS-IMBALANCED CLASSIFIERS USING ENSEMBLES OF GAUSSIAN PROCESSES AND GAUSSIAN PROCESS LATENT VARIABLE MODELS. PROCEEDINGS OF THE ... IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING. ICASSP (CONFERENCE) 2021; 2021. [PMID: 34712104 DOI: 10.1109/icassp39728.2021.9414754] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Classification with imbalanced data is a common and challenging problem in many practical machine learning problems. Ensemble learning is a popular solution where the results from multiple base classifiers are synthesized to reduce the effect of a possibly skewed distribution of the training set. In this paper, binary classifiers based on Gaussian processes are chosen as bases for inferring the predictive distributions of test latent variables. We apply a Gaussian process latent variable model where the outputs of the Gaussian processes are used for making the final decision. The tests of the new method in both synthetic and real data sets show improved performance over standard approaches.
Collapse
Affiliation(s)
- Liu Yang
- Department of Electrical and Computer Engineering, Stony Brook University, Stony Brook, NY 11794, USA
| | - Cassandra Heiselman
- Department of Obstetrics, Gynecology and Reproductive Medicine, Stony Brook, NY 11794, USA
| | - J Gerald Quirk
- Department of Obstetrics, Gynecology and Reproductive Medicine, Stony Brook, NY 11794, USA
| | - Petar M Djurić
- Department of Electrical and Computer Engineering, Stony Brook University, Stony Brook, NY 11794, USA
| |
Collapse
|
36
|
Falcone R, Anderlucci L, Montanari A. Matrix sketching for supervised classification with imbalanced classes. Data Min Knowl Discov 2021. [DOI: 10.1007/s10618-021-00791-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
AbstractThe presence of imbalanced classes is more and more common in practical applications and it is known to heavily compromise the learning process. In this paper we propose a new method aimed at addressing this issue in binary supervised classification. Re-balancing the class sizes has turned out to be a fruitful strategy to overcome this problem. Our proposal performs re-balancing through matrix sketching. Matrix sketching is a recently developed data compression technique that is characterized by the property of preserving most of the linear information that is present in the data. Such property is guaranteed by the Johnson-Lindenstrauss’ Lemma (1984) and allows to embed an n-dimensional space into a reduced one without distorting, within an $$\epsilon $$
ϵ
-size interval, the distances between any pair of points. We propose to use matrix sketching as an alternative to the standard re-balancing strategies that are based on random under-sampling the majority class or random over-sampling the minority one. We assess the properties of our method when combined with linear discriminant analysis (LDA), classification trees (C4.5) and Support Vector Machines (SVM) on simulated and real data. Results show that sketching can represent a sound alternative to the most widely used rebalancing methods.
Collapse
|
37
|
Viscaino M, Torres Bustos J, Muñoz P, Auat Cheein C, Cheein FA. Artificial intelligence for the early detection of colorectal cancer: A comprehensive review of its advantages and misconceptions. World J Gastroenterol 2021; 27:6399-6414. [PMID: 34720530 PMCID: PMC8517786 DOI: 10.3748/wjg.v27.i38.6399] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/28/2021] [Revised: 04/26/2021] [Accepted: 09/14/2021] [Indexed: 02/06/2023] Open
Abstract
Colorectal cancer (CRC) was the second-ranked worldwide type of cancer during 2020 due to the crude mortality rate of 12.0 per 100000 inhabitants. It can be prevented if glandular tissue (adenomatous polyps) is detected early. Colonoscopy has been strongly recommended as a screening test for both early cancer and adenomatous polyps. However, it has some limitations that include the high polyp miss rate for smaller (< 10 mm) or flat polyps, which are easily missed during visual inspection. Due to the rapid advancement of technology, artificial intelligence (AI) has been a thriving area in different fields, including medicine. Particularly, in gastroenterology AI software has been included in computer-aided systems for diagnosis and to improve the assertiveness of automatic polyp detection and its classification as a preventive method for CRC. This article provides an overview of recent research focusing on AI tools and their applications in the early detection of CRC and adenomatous polyps, as well as an insightful analysis of the main advantages and misconceptions in the field.
Collapse
Affiliation(s)
- Michelle Viscaino
- Department of Electronic Engineering, Universidad Tecnica Federico Santa Maria, Valpaiso 2340000, Chile
| | - Javier Torres Bustos
- Department of Electronic Engineering, Universidad Tecnica Federico Santa Maria, Valpaiso 2340000, Chile
| | - Pablo Muñoz
- Hospital Clinico, University of Chile, Santiago 8380456, Chile
| | - Cecilia Auat Cheein
- Facultad de Medicina, Universidad Nacional de Santiago del Estero, Santiago del Estero 4200, Argentina
| | - Fernando Auat Cheein
- Department of Electronic Engineering, Universidad Técnica Federico Santa María, Valparaiso 2340000, Chile
| |
Collapse
|
38
|
|
39
|
Hossain MS, Betts JM, Paplinski AP. Dual Focal Loss to address class imbalance in semantic segmentation. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.07.055] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
40
|
Leveraging network using controlled weight learning approach for thyroid cancer lymph node detection. Biocybern Biomed Eng 2021. [DOI: 10.1016/j.bbe.2021.10.003] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
|
41
|
Singh D, Saha A, Gosain A. wCM based hybrid pre-processing algorithm for class imbalanced dataset. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2021. [DOI: 10.3233/jifs-210624] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Imbalanced dataset classification is challenging because of the severely skewed class distribution. The traditional machine learning algorithms show degraded performance for these skewed datasets. However, there are additional characteristics of a classification dataset that are not only challenging for the traditional machine learning algorithms but also increase the difficulty when constructing a model for imbalanced datasets. Data complexity metrics identify these intrinsic characteristics, which cause substantial deterioration of the learning algorithms’ performance. Though many research efforts have been made to deal with class noise, none of them focused on imbalanced datasets coupled with other intrinsic factors. This paper presents a novel hybrid pre-processing algorithm focusing on treating the class-label noise in the imbalanced dataset, which suffers from other intrinsic factors such as class overlapping, non-linear class boundaries, small disjuncts, and borderline examples. This algorithm uses the wCM complexity metric (proposed for imbalanced dataset) to identify noisy, borderline, and other difficult instances of the dataset and then intelligently handles these instances. Experiments on synthetic datasets and real-world datasets with different levels of imbalance, noise, small disjuncts, class overlapping, and borderline examples are conducted to check the effectiveness of the proposed algorithm. The experimental results show that the proposed algorithm offers an interesting alternative to popular state-of-the-art pre-processing algorithms for effectively handling imbalanced datasets along with noise and other difficulties.
Collapse
Affiliation(s)
- Deepika Singh
- USICT, Guru Gobind Singh Indraprasth University, Sector-16, C, Dwarka, New Delhi, India
| | - Anju Saha
- USICT, Guru Gobind Singh Indraprasth University, Sector-16, C, Dwarka, New Delhi, India
| | - Anjana Gosain
- USICT, Guru Gobind Singh Indraprasth University, Sector-16, C, Dwarka, New Delhi, India
| |
Collapse
|
42
|
Selecting the Suitable Resampling Strategy for Imbalanced Data Classification Regarding Dataset Properties. An Approach Based on Association Models. APPLIED SCIENCES-BASEL 2021. [DOI: 10.3390/app11188546] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class. This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples. Thus, the prediction model is unreliable although the overall model accuracy can be acceptable. Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class. However, their effectiveness depends on several factors mainly related to data intrinsic characteristics, such as imbalance ratio, dataset size and dimensionality, overlapping between classes or borderline examples. In this work, the impact of these factors is analyzed through a comprehensive comparative study involving 40 datasets from different application areas. The objective is to obtain models for automatic selection of the best resampling strategy for any dataset based on its characteristics. These models allow us to check several factors simultaneously considering a wide range of values since they are induced from very varied datasets that cover a broad spectrum of conditions. This differs from most studies that focus on the individual analysis of the characteristics or cover a small range of values. In addition, the study encompasses both basic and advanced resampling strategies that are evaluated by means of eight different performance metrics, including new measures specifically designed for imbalanced data classification. The general nature of the proposal allows the choice of the most appropriate method regardless of the domain, avoiding the search for special purpose techniques that could be valid for the target data.
Collapse
|
43
|
Xiao J, Wang Y, Chen J, Xie L, Huang J. Impact of resampling methods and classification models on the imbalanced credit scoring problems. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2021.05.029] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
|
44
|
Diallo M, Xiong S, Emiru ED, Fesseha A, Abdulsalami AO, Elaziz MA. A Hybrid MultiLayer Perceptron Under-Sampling with Bagging Dealing with a Real-Life Imbalanced Rice Dataset. INFORMATION 2021; 12:291. [DOI: 10.3390/info12080291] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/02/2023] Open
Abstract
Classification algorithms have shown exceptional prediction results in the supervised learning area. These classification algorithms are not always efficient when it comes to real-life datasets due to class distributions. As a result, datasets for real-life applications are generally imbalanced. Several methods have been proposed to solve the problem of class imbalance. In this paper, we propose a hybrid method combining the preprocessing techniques and those of ensemble learning. The original training set is undersampled by evaluating the samples by stochastic measurement (SM) and then training these samples selected by Multilayer Perceptron to return a balanced training set. The MLPUS (Multilayer perceptron undersampling) balanced training set is aggregated using the bagging ensemble method. We applied our method to the real-life Niger_Rice dataset and forty-four other imbalanced datasets from the KEEL repository in this study. We also compared our method with six other existing methods in the literature, such as the MLP classifier on the original imbalance dataset, MLPUS, UnderBagging (combining random under-sampling and bagging), RUSBoost, SMOTEBagging (Synthetic Minority Oversampling Technique and bagging), SMOTEBoost. The results show that our method is competitive compared to other methods. The Niger_Rice real-life dataset results are 75.6, 0.73, 0.76, and 0.86, respectively, for accuracy, F-measure, G-mean, and ROC with our proposed method. In contrast, the MLP classifier on the original imbalance Niger_Rice dataset gives results 72.44, 0.82, 0.59, and 0.76 respectively for accuracy, F-measure, G-mean, and ROC.
Collapse
|
45
|
A Review of Fuzzy and Pattern-Based Approaches for Class Imbalance Problems. APPLIED SCIENCES-BASEL 2021. [DOI: 10.3390/app11146310] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
The usage of imbalanced databases is a recurrent problem in real-world data such as medical diagnostic, fraud detection, and pattern recognition. Nevertheless, in class imbalance problems, the classifiers are commonly biased by the class with more objects (majority class) and ignore the class with fewer objects (minority class). There are different ways to solve the class imbalance problem, and there has been a trend towards the usage of patterns and fuzzy approaches due to the favorable results. In this paper, we provide an in-depth review of popular methods for imbalanced databases related to patterns and fuzzy approaches. The reviewed papers include classifiers, data preprocessing, and evaluation metrics. We identify different application domains and describe how the methods are used. Finally, we suggest further research directions according to the analysis of the reviewed papers and the trend of the state of the art.
Collapse
|
46
|
Improving Imbalanced Land Cover Classification with K-Means SMOTE: Detecting and Oversampling Distinctive Minority Spectral Signatures. INFORMATION 2021. [DOI: 10.3390/info12070266] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Land cover maps are a critical tool to support informed policy development, planning, and resource management decisions. With significant upsides, the automatic production of Land Use/Land Cover maps has been a topic of interest for the remote sensing community for several years, but it is still fraught with technical challenges. One such challenge is the imbalanced nature of most remotely sensed data. The asymmetric class distribution impacts negatively the performance of classifiers and adds a new source of error to the production of these maps. In this paper, we address the imbalanced learning problem, by using K-means and the Synthetic Minority Oversampling Technique (SMOTE) as an improved oversampling algorithm. K-means SMOTE improves the quality of newly created artificial data by addressing both the between-class imbalance, as traditional oversamplers do, but also the within-class imbalance, avoiding the generation of noisy data while effectively overcoming data imbalance. The performance of K-means SMOTE is compared to three popular oversampling methods (Random Oversampling, SMOTE and Borderline-SMOTE) using seven remote sensing benchmark datasets, three classifiers (Logistic Regression, K-Nearest Neighbors and Random Forest Classifier) and three evaluation metrics using a five-fold cross-validation approach with three different initialization seeds. The statistical analysis of the results show that the proposed method consistently outperforms the remaining oversamplers producing higher quality land cover classifications. These results suggest that LULC data can benefit significantly from the use of more sophisticated oversamplers as spectral signatures for the same class can vary according to geographical distribution.
Collapse
|
47
|
Chennuru VK, Timmappareddy SR. Simulated annealing based undersampling (SAUS): a hybrid multi-objective optimization method to tackle class imbalance. APPL INTELL 2021. [DOI: 10.1007/s10489-021-02369-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
48
|
Shahee SA, Ananthakumar U. An overlap sensitive neural network for class imbalanced data. Data Min Knowl Discov 2021. [DOI: 10.1007/s10618-021-00766-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
49
|
Piernik M, Morzy T. A study on using data clustering for feature extraction to improve the quality of classification. Knowl Inf Syst 2021. [DOI: 10.1007/s10115-021-01572-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
AbstractThere is a certain belief among data science researchers and enthusiasts alike that clustering can be used to improve classification quality. Insofar as this belief is fairly uncontroversial, it is also very general and therefore produces a lot of confusion around the subject. There are many ways of using clustering in classification and it obviously cannot always improve the quality of predictions, so a question arises, in which scenarios exactly does it help? Since we were unable to find a rigorous study addressing this question, in this paper, we try to shed some light on the concept of using clustering for classification. To do so, we first put forward a framework for incorporating clustering as a method of feature extraction for classification. The framework is generic w.r.t. similarity measures, clustering algorithms, classifiers, and datasets and serves as a platform to answer ten essential questions regarding the studied subject. Each answer is formulated based on a separate experiment on 16 publicly available datasets, followed by an appropriate statistical analysis. After performing the experiments and analyzing the results separately, we discuss them from a global perspective and form general conclusions regarding using clustering as feature extraction for classification.
Collapse
|
50
|
Adedigba AP, Adeshina SA, Aina OE, Aibinu AM. Optimal hyperparameter selection of deep learning models for COVID-19 chest X-ray classification. INTELLIGENCE-BASED MEDICINE 2021; 5:100034. [PMID: 33899036 PMCID: PMC8057926 DOI: 10.1016/j.ibmed.2021.100034] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/02/2020] [Revised: 03/05/2021] [Accepted: 04/08/2021] [Indexed: 02/06/2023]
Abstract
The first and most critical response to curbing the spread of the novel coronavirus disease (COVID-19) is to deploy effective techniques to test potentially infected patients, isolate them and commence immediate treatment. However, several test kits currently in use are slow and in a shortage of supply. This paper presents techniques for diagnosing COVID-19 from chest X-ray (CXR) and address problems associated with training deep models with less voluminous datasets and class imbalance as obtained in most available CXR datasets on COVID-19. We used the discriminative fine-tuning approach, which dynamically assigns different learning rates to each layer of the network. The learning rate is set using the cyclical learning rate policy that changes per iteration. This flexibility ensured rapid convergence and avoided being stuck in saddle point plateau. In addition, we addressed the high computational demand of deep models by implementing our algorithm using the memory- and computational-efficient mixed-precision training. Despite the availability of scanty datasets, our model achieved high performance and generalisation. A Validation accuracy of 96.83%, sensitivity and specificity of 96.26% and 95.54% were obtained, respectively. When tested on an entirely new dataset, the model achieves 97% accuracy without further training. Lastly, we presented a visual interpretation of the model’s output to prove that the model can aid radiologists in rapidly screening for the symptoms of COVID-19.
Collapse
Affiliation(s)
- Adeyinka P Adedigba
- Department of Mechatronics Engineering, Federal University of Technology, Minna, Nigeria
| | - Steve A Adeshina
- Department of Computer Engineering, Nile University of Nigeria, Abuja, Nigeria
| | - Oluwatomisin E Aina
- Department of Computer Engineering, Nile University of Nigeria, Abuja, Nigeria
| | - Abiodun M Aibinu
- Department of Mechatronics Engineering, Federal University of Technology, Minna, Nigeria
| |
Collapse
|