1
|
Zhang Y, Liu H, Huang Q, Qu W, Shi Y, Zhang T, Li J, Chen J, Shi Y, Deng R, Chen Y, Zhang Z. Predictive value of machine learning for in-hospital mortality risk in acute myocardial infarction: A systematic review and meta-analysis. Int J Med Inform 2025; 198:105875. [PMID: 40073650 DOI: 10.1016/j.ijmedinf.2025.105875] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2024] [Revised: 02/25/2025] [Accepted: 03/07/2025] [Indexed: 03/14/2025]
Abstract
BACKGROUND Machine learning (ML) models have been constructed to predict the risk of in-hospital mortality in patients with myocardial infarction (MI). Due to diverse ML models and modeling variables, along with the significant imbalance in data, the predictive accuracy of these models remains controversial. OBJECTIVE This study aimed to review the accuracy of ML in predicting in-hospital mortality risk in MI patients and to provide evidence-based advices for the development or updating of clinical tools. METHODS PubMed, Embase, Cochrane, and Web of Science databases were searched, up to June 4, 2024. PROBAST and ChAMAI checklist are utilized to assess the risk of bias in the included studies. Since the included studies constructed models based on severely unbalanced datasets, subgroup analyses were conducted by the type of dataset (balanced data, unbalanced data, model type). RESULTS This meta-analysis included 32 studies. In the validation set, the pooled C-index, sensitivity, and specificity of prediction models based on balanced data were 0.83 (95 % CI: 0.795-0.866), 0.81 (95 % CI: 0.79-0.84), and 0.82 (95 % CI: 0.78-0.86), respectively. In the validation set, the pooled C-index, sensitivity, and specificity of ML models based on imbalanced data were 0.815 (95 % CI: 0.789-0.842), 0.66 (95 % CI: 0.60-0.72), and 0.84 (95 % CI: 0.83-0.85), respectively. CONCLUSIONS ML models such as LR, SVM, and RF exhibit high sensitivity and specificity in predicting in-hospital mortality in MI patients. However, their sensitivity is not superior to well-established scoring tools. Mitigating the impact of imbalanced data on ML models remains challenging.
Collapse
Affiliation(s)
- Yuan Zhang
- College of Traditional Chinese Medicine, Changchun University of Chinese Medicine, Changchun, Jilin 130000, China
| | - Huan Liu
- College of Traditional Chinese Medicine, Changchun University of Chinese Medicine, Changchun, Jilin 130000, China
| | - Qingxia Huang
- Research Center of Traditional Chinese Medicine, The First Affiliated Hospital of Changchun University of Chinese Medicine, Changchun, Jilin 130117, China
| | - Wantong Qu
- Department of Cardiology, The First Affiliated Hospital of Changchun University of Chinese Medicine, Changchun 130000 Jilin, China
| | - Yanyu Shi
- College of Traditional Chinese Medicine, Changchun University of Chinese Medicine, Changchun, Jilin 130000, China
| | - Tianyang Zhang
- College of Traditional Chinese Medicine, Changchun University of Chinese Medicine, Changchun, Jilin 130000, China
| | - Jing Li
- College of Traditional Chinese Medicine, Changchun University of Chinese Medicine, Changchun, Jilin 130000, China
| | - Jinjin Chen
- College of Traditional Chinese Medicine, Changchun University of Chinese Medicine, Changchun, Jilin 130000, China
| | - Yuqing Shi
- College of Traditional Chinese Medicine, Changchun University of Chinese Medicine, Changchun, Jilin 130000, China
| | - Ruixue Deng
- College of Traditional Chinese Medicine, Changchun University of Chinese Medicine, Changchun, Jilin 130000, China
| | - Ying Chen
- Department of Cardiology, The First Affiliated Hospital of Changchun University of Chinese Medicine, Changchun 130000 Jilin, China.
| | - Zepeng Zhang
- Research Center of Traditional Chinese Medicine, The First Affiliated Hospital of Changchun University of Chinese Medicine, Changchun, Jilin 130117, China.
| |
Collapse
|
2
|
Gurcan F, Soylu A. Learning from Imbalanced Data: Integration of Advanced Resampling Techniques and Machine Learning Models for Enhanced Cancer Diagnosis and Prognosis. Cancers (Basel) 2024; 16:3417. [PMID: 39410036 PMCID: PMC11476323 DOI: 10.3390/cancers16193417] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2024] [Revised: 10/05/2024] [Accepted: 10/06/2024] [Indexed: 10/20/2024] Open
Abstract
BACKGROUND/OBJECTIVES This study aims to evaluate the performance of various classification algorithms and resampling methods across multiple diagnostic and prognostic cancer datasets, addressing the challenges of class imbalance. METHODS A total of five datasets were analyzed, including three diagnostic datasets (Wisconsin Breast Cancer Database, Cancer Prediction Dataset, Lung Cancer Detection Dataset) and two prognostic datasets (Seer Breast Cancer Dataset, Differentiated Thyroid Cancer Recurrence Dataset). Nineteen resampling methods from three categories were employed, and ten classifiers from four distinct categories were utilized for comparison. RESULTS The results demonstrated that hybrid sampling methods, particularly SMOTEENN, achieved the highest mean performance at 98.19%, followed by IHT (97.20%) and RENN (96.48%). In terms of classifiers, Random Forest showed the best performance with a mean value of 94.69%, with Balanced Random Forest and XGBoost following closely. The baseline method (no resampling) yielded a significantly lower performance of 91.33%, highlighting the effectiveness of resampling techniques in improving model outcomes. CONCLUSIONS This research underscores the importance of resampling methods in enhancing classification performance on imbalanced datasets, providing valuable insights for researchers and healthcare professionals. The findings serve as a foundation for future studies aimed at integrating machine learning techniques in cancer diagnosis and prognosis, with recommendations for further research on hybrid models and clinical applications.
Collapse
Affiliation(s)
- Fatih Gurcan
- Department of Management Information Systems, Faculty of Economics and Administrative Sciences, Karadeniz Technical University, 61080 Trabzon, Turkey;
| | - Ahmet Soylu
- Department of Computer Science, Faculty of Information Technology and Electrical Engineering, Norwegian University of Science and Technology, 2815 Gjøvik, Norway
| |
Collapse
|
3
|
Zhang H, Zhu L, Wang X, Yang Y. Divide and Retain: A Dual-Phase Modeling for Long-Tailed Visual Recognition. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:13538-13549. [PMID: 37276091 DOI: 10.1109/tnnls.2023.3269907] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
This work explores visual recognition models on real-world datasets exhibiting a long-tailed distribution. Most of previous works are based on a holistic perspective that the overall gradient for training model is directly obtained by considering all classes jointly. However, due to the extreme data imbalance in long-tailed datasets, joint consideration of different classes tends to induce the gradient distortion problem; i.e., the overall gradient tends to suffer from shifted direction toward data-rich classes and enlarged variances caused by data-poor classes. The gradient distortion problem impairs the training of our models. To avoid such drawbacks, we propose to disentangle the overall gradient and aim to consider the gradient on data-rich classes and that on data-poor classes separately. We tackle the long-tailed visual recognition problem via a dual-phase-based method. In the first phase, only data-rich classes are concerned to update model parameters, where only separated gradient on data-rich classes is used. In the second phase, the rest data-poor classes are involved to learn a complete classifier for all classes. More importantly, to ensure the smooth transition from phase I to phase II, we propose an exemplar bank and a memory-retentive loss. In general, the exemplar bank reserves a few representative examples from data-rich classes. It is used to maintain the information of data-rich classes when transiting. The memory-retentive loss constrains the change of model parameters from phase I to phase II based on the exemplar bank and data-poor classes. The extensive experimental results on four commonly used long-tailed benchmarks, including CIFAR100-LT, Places-LT, ImageNet-LT, and iNaturalist 2018, highlight the excellent performance of our proposed method.
Collapse
|
4
|
Huang X, Xie X, Huang S, Wu S, Huang L. Predicting non-chemotherapy drug-induced agranulocytosis toxicity through ensemble machine learning approaches. Front Pharmacol 2024; 15:1431941. [PMID: 39206259 PMCID: PMC11349714 DOI: 10.3389/fphar.2024.1431941] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2024] [Accepted: 08/02/2024] [Indexed: 09/04/2024] Open
Abstract
Agranulocytosis, induced by non-chemotherapy drugs, is a serious medical condition that presents a formidable challenge in predictive toxicology due to its idiosyncratic nature and complex mechanisms. In this study, we assembled a dataset of 759 compounds and applied a rigorous feature selection process prior to employing ensemble machine learning classifiers to forecast non-chemotherapy drug-induced agranulocytosis (NCDIA) toxicity. The balanced bagging classifier combined with a gradient boosting decision tree (BBC + GBDT), utilizing the combined descriptor set of DS and RDKit comprising 237 features, emerged as the top-performing model, with an external validation AUC of 0.9164, ACC of 83.55%, and MCC of 0.6095. The model's predictive reliability was further substantiated by an applicability domain analysis. Feature importance, assessed through permutation importance within the BBC + GBDT model, highlighted key molecular properties that significantly influence NCDIA toxicity. Additionally, 16 structural alerts identified by SARpy software further revealed potential molecular signatures associated with toxicity, enriching our understanding of the underlying mechanisms. We also applied the constructed models to assess the NCDIA toxicity of novel drugs approved by FDA. This study advances predictive toxicology by providing a framework to assess and mitigate agranulocytosis risks, ensuring the safety of pharmaceutical development and facilitating post-market surveillance of new drugs.
Collapse
Affiliation(s)
- Xiaojie Huang
- Department of Clinical Pharmacy, Jieyang People’s Hospital, Jieyang, China
| | | | | | | | | |
Collapse
|
5
|
Nie F, Xie F, Yu W, Li X. Parameter-Insensitive Min Cut Clustering With Flexible Size Constrains. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2024; 46:5479-5492. [PMID: 38376965 DOI: 10.1109/tpami.2024.3367912] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/22/2024]
Abstract
Clustering is a fundamental topic in machine learning and various methods are proposed, in which K-Means (KM) and min cut clustering are typical ones. However, they may produce empty or skewed clustering results, which are not as expected. In KM, the constrained clustering methods have been fully studied while in min cut clustering, it still needs to be developed. In this paper, we propose a parameter-insensitive min cut clustering with flexible size constraints. Specifically, we add lower limitations on the number of samples for each cluster, which can perfectly avoid the trivial solution in min cut clustering. As far as we are concerned, this is the first attempt of directly incorporating size constraints into min cut. However, it is a NP-hard problem and difficult to solve. Thus, the upper limits is also added in but it is still difficult to solve. Therefore, an additional variable that is equivalent to label matrix is introduced in and the augmented Lagrangian multiplier (ALM) is used to decouple the constraints. In the experiments, we find that the our algorithm is less sensitive to lower bound and is practical in image segmentation. A large number of experiments demonstrate the effectiveness of our proposed algorithm.
Collapse
|
6
|
Pu X, Liu L, Zhou Y, Xu Z. Determination of the rat estrous cycle vased on EfficientNet. Front Vet Sci 2024; 11:1434991. [PMID: 39119352 PMCID: PMC11306968 DOI: 10.3389/fvets.2024.1434991] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2024] [Accepted: 07/01/2024] [Indexed: 08/10/2024] Open
Abstract
In the field of biomedical research, rats are widely used as experimental animals due to their short gestation period and strong reproductive ability. Accurate monitoring of the estrous cycle is crucial for the success of experiments. Traditional methods are time-consuming and rely on the subjective judgment of professionals, which limits the efficiency and accuracy of experiments. This study proposes an EfficientNet model to automate the recognition of the estrous cycle of female rats using deep learning techniques. The model optimizes performance through systematic scaling of the network depth, width, and image resolution. A large dataset of physiological data from female rats was used for training and validation. The improved EfficientNet model effectively recognized different stages of the estrous cycle. The model demonstrated high-precision feature capture and significantly improved recognition accuracy compared to conventional methods. The proposed technique enhances experimental efficiency and reduces human error in recognizing the estrous cycle. This study highlights the potential of deep learning to optimize data processing and achieve high-precision recognition in biomedical research. Future work should focus on further validation with larger datasets and integration into experimental workflows.
Collapse
Affiliation(s)
- Xiaodi Pu
- Reproductive Section, Huaihua City Maternal and Child Health Care Hospital, Huaihua, China
| | - Longyi Liu
- Shenyang Institute of Computing Technology, Chinese Academy of Sciences, Shenyang, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Yonglai Zhou
- Reproductive Section, Huaihua City Maternal and Child Health Care Hospital, Huaihua, China
| | - Zihan Xu
- College of Biological Sciences, China Agricultural University, Beijing, China
| |
Collapse
|
7
|
Atto AM. Altruistic Collaborative Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:1954-1964. [PMID: 35771785 DOI: 10.1109/tnnls.2022.3185961] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
This article proposes a new learning paradigm based on the concept of concordant gradients for ensemble learning strategies. In this paradigm, learners update their weights if and only if the gradients of their cost functions are mutually concordant in a sense given by paper. The objective of the proposed concordant optimization framework is robustness against uncertainties by postponing to a later epoch, the consideration of examples associated with discordant directions during a training phase. Concordance constrained collaboration is shown to be relevant, especially in intricate classification issues where exclusive class labeling involves information bias due to correlated disturbances affecting almost all training examples. The first learning paradigm applies on a gradient descent strategy based on allied agents, subjected to concordance checking before moving forward in training epochs. The second learning paradigm is related to multivariate dense neural matrix fusion, where the fusion operator is itself a learnable neural operator. In addition to these paradigms, this article proposes a new categorical probability transform to enrich the existing collection and propose an alternative scenario for integrating penalized SoftMax information. Finally, this article assesses the relevance of the above contributions with respect to several deep learning frameworks and a collaborative classification involving dependent classes.
Collapse
|
8
|
Liu X, Xing F, You J, Lu J, Kuo CCJ, Fakhri GE, Woo J. Subtype-Aware Dynamic Unsupervised Domain Adaptation. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:2820-2834. [PMID: 35895653 DOI: 10.1109/tnnls.2022.3192315] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Unsupervised domain adaptation (UDA) has been successfully applied to transfer knowledge from a labeled source domain to target domains without their labels. Recently introduced transferable prototypical networks (TPNs) further address class-wise conditional alignment. In TPN, while the closeness of class centers between source and target domains is explicitly enforced in a latent space, the underlying fine-grained subtype structure and the cross-domain within-class compactness have not been fully investigated. To counter this, we propose a new approach to adaptively perform a fine-grained subtype-aware alignment to improve the performance in the target domain without the subtype label in both domains. The insight of our approach is that the unlabeled subtypes in a class have the local proximity within a subtype while exhibiting disparate characteristics because of different conditional and label shifts. Specifically, we propose to simultaneously enforce subtype-wise compactness and class-wise separation, by utilizing intermediate pseudo-labels. In addition, we systematically investigate various scenarios with and without prior knowledge of subtype numbers and propose to exploit the underlying subtype structure. Furthermore, a dynamic queue framework is developed to evolve the subtype cluster centroids steadily using an alternative processing scheme. Experimental results, carried out with multiview congenital heart disease data and VisDA and DomainNet, show the effectiveness and validity of our subtype-aware UDA, compared with state-of-the-art UDA methods.
Collapse
|
9
|
Du G, Zhang J, Jiang M, Long J, Lin Y, Li S, Tan KC. Graph-Based Class-Imbalance Learning With Label Enhancement. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:6081-6095. [PMID: 34928806 DOI: 10.1109/tnnls.2021.3133262] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Class imbalance is a common issue in the community of machine learning and data mining. The class-imbalance distribution can make most classical classification algorithms neglect the significance of the minority class and tend toward the majority class. In this article, we propose a label enhancement method to solve the class-imbalance problem in a graph manner, which estimates the numerical label and trains the inductive model simultaneously. It gives a new perspective on the class-imbalance learning based on the numerical label rather than the original logical label. We also present an iterative optimization algorithm and analyze the computation complexity and its convergence. To demonstrate the superiority of the proposed method, several single-label and multilabel datasets are applied in the experiments. The experimental results show that the proposed method achieves a promising performance and outperforms some state-of-the-art single-label and multilabel class-imbalance learning methods.
Collapse
|
10
|
Xu Y, Yu Z, Chen CLP, Liu Z. Adaptive Subspace Optimization Ensemble Method for High-Dimensional Imbalanced Data Classification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:2284-2297. [PMID: 34469316 DOI: 10.1109/tnnls.2021.3106306] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
It is hard to construct an optimal classifier for high-dimensional imbalanced data, on which the performance of classifiers is seriously affected and becomes poor. Although many approaches, such as resampling, cost-sensitive, and ensemble learning methods, have been proposed to deal with the skewed data, they are constrained by high-dimensional data with noise and redundancy. In this study, we propose an adaptive subspace optimization ensemble method (ASOEM) for high-dimensional imbalanced data classification to overcome the above limitations. To construct accurate and diverse base classifiers, a novel adaptive subspace optimization (ASO) method based on adaptive subspace generation (ASG) process and rotated subspace optimization (RSO) process is designed to generate multiple robust and discriminative subspaces. Then a resampling scheme is applied on the optimized subspace to build a class-balanced data for each base classifier. To verify the effectiveness, our ASOEM is implemented based on different resampling strategies on 24 real-world high-dimensional imbalanced datasets. Experimental results demonstrate that our proposed methods outperform other mainstream imbalance learning approaches and classifier ensemble methods.
Collapse
|
11
|
Kim HS. Geospatial data-driven assessment of earthquake-induced liquefaction impact mapping using classifier and cluster ensembles. Appl Soft Comput 2023. [DOI: 10.1016/j.asoc.2023.110266] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/31/2023]
|
12
|
Ding H, Sun Y, Huang N, Shen Z, Wang Z, Iftekhar A, Cui X. RVGAN-TL: A generative adversarial networks and transfer learning-based hybrid approach for imbalanced data classification. Inf Sci (N Y) 2023. [DOI: 10.1016/j.ins.2023.01.147] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
|
13
|
Han M, Guo H, Li J, Wang W. Global-local information based oversampling for multi-class imbalanced data. INT J MACH LEARN CYB 2022. [DOI: 10.1007/s13042-022-01746-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
14
|
Dai Q, Liu J, Yang J. Multi‐armed bandit heterogeneous ensemble learning for imbalanced data. Comput Intell 2022. [DOI: 10.1111/coin.12566] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Affiliation(s)
- Qi Dai
- Department of Automation, College of Information Science and Engineering Beijing China
| | - Jian‐wei Liu
- Department of Automation, College of Information Science and Engineering Beijing China
| | - Jiapeng Yang
- College of Science North China University of Science and Technology Tangshan China
| |
Collapse
|
15
|
Class-imbalanced positive instances augmentation via three-line hybrid. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109902] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
16
|
Chen Z, Duan J, Kang L, Qiu G. Class-Imbalanced Deep Learning via a Class-Balanced Ensemble. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:5626-5640. [PMID: 33900923 DOI: 10.1109/tnnls.2021.3071122] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Class imbalance is a prevalent phenomenon in various real-world applications and it presents significant challenges to model learning, including deep learning. In this work, we embed ensemble learning into the deep convolutional neural networks (CNNs) to tackle the class-imbalanced learning problem. An ensemble of auxiliary classifiers branching out from various hidden layers of a CNN is trained together with the CNN in an end-to-end manner. To that end, we designed a new loss function that can rectify the bias toward the majority classes by forcing the CNN's hidden layers and its associated auxiliary classifiers to focus on the samples that have been misclassified by previous layers, thus enabling subsequent layers to develop diverse behavior and fix the errors of previous layers in a batch-wise manner. A unique feature of the new method is that the ensemble of auxiliary classifiers can work together with the main CNN to form a more powerful combined classifier, or can be removed after finished training the CNN and thus only acting the role of assisting class imbalance learning of the CNN to enhance the neural network's capability in dealing with class-imbalanced data. Comprehensive experiments are conducted on four benchmark data sets of increasing complexity (CIFAR-10, CIFAR-100, iNaturalist, and CelebA) and the results demonstrate significant performance improvements over the state-of-the-art deep imbalance learning methods.
Collapse
|
17
|
Zhang Y, Lin M, Yang Y, Ding C. A Hybrid Ensemble and Evolutionary Algorithm for Imbalanced Classification and its Application on Bioinformatics. Comput Biol Chem 2022; 98:107646. [DOI: 10.1016/j.compbiolchem.2022.107646] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2021] [Revised: 02/15/2022] [Accepted: 02/21/2022] [Indexed: 11/03/2022]
|
18
|
Binary imbalanced big data classification based on fuzzy data reduction and classifier fusion. Soft comput 2022. [DOI: 10.1007/s00500-021-06654-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
19
|
Ge Z, Jiang X, Tong Z, Feng P, Zhou B, Xu M, Wang Z, Pang Y. Multi-label correlation guided feature fusion network for abnormal ECG diagnosis. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2021.107508] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
|
20
|
Fu C, Zhan Q, Liu W. Evidential reasoning based ensemble classifier for uncertain imbalanced data. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2021.07.027] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
21
|
Qiao S, Han N, Huang F, Yue K, Wu T, Yi Y, Mao R, Yuan CA. LMNNB: Two-in-One imbalanced classification approach by combining metric learning and ensemble learning. APPL INTELL 2021. [DOI: 10.1007/s10489-021-02901-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
22
|
Wang X, Jing L, Lyu Y, Guo M, Zeng T. Smooth Soft-Balance Discriminative Analysis for imbalanced data. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2020.106604] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
23
|
Wong TT, Tsai HC. Multinomial naïve Bayesian classifier with generalized Dirichlet priors for high-dimensional imbalanced data. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2021.107288] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
24
|
Abstract
Tuberculosis (TB) is an airborne infectious disease caused by organisms in the Mycobacterium tuberculosis (Mtb) complex. In many low and middle-income countries, TB remains a major cause of morbidity and mortality. Once a patient has been diagnosed with TB, it is critical that healthcare workers make the most appropriate treatment decision given the individual conditions of the patient and the likely course of the disease based on medical experience. Depending on the prognosis, delayed or inappropriate treatment can result in unsatisfactory results including the exacerbation of clinical symptoms, poor quality of life, and increased risk of death. This work benchmarks machine learning models to aid TB prognosis using a Brazilian health database of confirmed cases and deaths related to TB in the State of Amazonas. The goal is to predict the probability of death by TB thus aiding the prognosis of TB and associated treatment decision making process. In its original form, the data set comprised 36,228 records and 130 fields but suffered from missing, incomplete, or incorrect data. Following data cleaning and preprocessing, a revised data set was generated comprising 24,015 records and 38 fields, including 22,876 reported cured TB patients and 1139 deaths by TB. To explore how the data imbalance impacts model performance, two controlled experiments were designed using (1) imbalanced and (2) balanced data sets. The best result is achieved by the Gradient Boosting (GB) model using the balanced data set to predict TB-mortality, and the ensemble model composed by the Random Forest (RF), GB and Multi-Layer Perceptron (MLP) models is the best model to predict the cure class.
Collapse
|
25
|
|
26
|
Data Sampling Methods to Deal With the Big Data Multi-Class Imbalance Problem. APPLIED SCIENCES-BASEL 2020. [DOI: 10.3390/app10041276] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
The class imbalance problem has been a hot topic in the machine learning community in recent years. Nowadays, in the time of big data and deep learning, this problem remains in force. Much work has been performed to deal to the class imbalance problem, the random sampling methods (over and under sampling) being the most widely employed approaches. Moreover, sophisticated sampling methods have been developed, including the Synthetic Minority Over-sampling Technique (SMOTE), and also they have been combined with cleaning techniques such as Editing Nearest Neighbor or Tomek’s Links (SMOTE+ENN and SMOTE+TL, respectively). In the big data context, it is noticeable that the class imbalance problem has been addressed by adaptation of traditional techniques, relatively ignoring intelligent approaches. Thus, the capabilities and possibilities of heuristic sampling methods on deep learning neural networks in big data domain are analyzed in this work, and the cleaning strategies are particularly analyzed. This study is developed on big data, multi-class imbalanced datasets obtained from hyper-spectral remote sensing images. The effectiveness of a hybrid approach on these datasets is analyzed, in which the dataset is cleaned by SMOTE followed by the training of an Artificial Neural Network (ANN) with those data, while the neural network output noise is processed with ENN to eliminate output noise; after that, the ANN is trained again with the resultant dataset. Obtained results suggest that best classification outcome is achieved when the cleaning strategies are applied on an ANN output instead of input feature space only. Consequently, the need to consider the classifier’s nature when the classical class imbalance approaches are adapted in deep learning and big data scenarios is clear.
Collapse
|