1
|
Zhang Y, Liu H, Huang Q, Qu W, Shi Y, Zhang T, Li J, Chen J, Shi Y, Deng R, Chen Y, Zhang Z. Predictive value of machine learning for in-hospital mortality risk in acute myocardial infarction: A systematic review and meta-analysis. Int J Med Inform 2025; 198:105875. [PMID: 40073650 DOI: 10.1016/j.ijmedinf.2025.105875] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2024] [Revised: 02/25/2025] [Accepted: 03/07/2025] [Indexed: 03/14/2025]
Abstract
BACKGROUND Machine learning (ML) models have been constructed to predict the risk of in-hospital mortality in patients with myocardial infarction (MI). Due to diverse ML models and modeling variables, along with the significant imbalance in data, the predictive accuracy of these models remains controversial. OBJECTIVE This study aimed to review the accuracy of ML in predicting in-hospital mortality risk in MI patients and to provide evidence-based advices for the development or updating of clinical tools. METHODS PubMed, Embase, Cochrane, and Web of Science databases were searched, up to June 4, 2024. PROBAST and ChAMAI checklist are utilized to assess the risk of bias in the included studies. Since the included studies constructed models based on severely unbalanced datasets, subgroup analyses were conducted by the type of dataset (balanced data, unbalanced data, model type). RESULTS This meta-analysis included 32 studies. In the validation set, the pooled C-index, sensitivity, and specificity of prediction models based on balanced data were 0.83 (95 % CI: 0.795-0.866), 0.81 (95 % CI: 0.79-0.84), and 0.82 (95 % CI: 0.78-0.86), respectively. In the validation set, the pooled C-index, sensitivity, and specificity of ML models based on imbalanced data were 0.815 (95 % CI: 0.789-0.842), 0.66 (95 % CI: 0.60-0.72), and 0.84 (95 % CI: 0.83-0.85), respectively. CONCLUSIONS ML models such as LR, SVM, and RF exhibit high sensitivity and specificity in predicting in-hospital mortality in MI patients. However, their sensitivity is not superior to well-established scoring tools. Mitigating the impact of imbalanced data on ML models remains challenging.
Collapse
Affiliation(s)
- Yuan Zhang
- College of Traditional Chinese Medicine, Changchun University of Chinese Medicine, Changchun, Jilin 130000, China
| | - Huan Liu
- College of Traditional Chinese Medicine, Changchun University of Chinese Medicine, Changchun, Jilin 130000, China
| | - Qingxia Huang
- Research Center of Traditional Chinese Medicine, The First Affiliated Hospital of Changchun University of Chinese Medicine, Changchun, Jilin 130117, China
| | - Wantong Qu
- Department of Cardiology, The First Affiliated Hospital of Changchun University of Chinese Medicine, Changchun 130000 Jilin, China
| | - Yanyu Shi
- College of Traditional Chinese Medicine, Changchun University of Chinese Medicine, Changchun, Jilin 130000, China
| | - Tianyang Zhang
- College of Traditional Chinese Medicine, Changchun University of Chinese Medicine, Changchun, Jilin 130000, China
| | - Jing Li
- College of Traditional Chinese Medicine, Changchun University of Chinese Medicine, Changchun, Jilin 130000, China
| | - Jinjin Chen
- College of Traditional Chinese Medicine, Changchun University of Chinese Medicine, Changchun, Jilin 130000, China
| | - Yuqing Shi
- College of Traditional Chinese Medicine, Changchun University of Chinese Medicine, Changchun, Jilin 130000, China
| | - Ruixue Deng
- College of Traditional Chinese Medicine, Changchun University of Chinese Medicine, Changchun, Jilin 130000, China
| | - Ying Chen
- Department of Cardiology, The First Affiliated Hospital of Changchun University of Chinese Medicine, Changchun 130000 Jilin, China.
| | - Zepeng Zhang
- Research Center of Traditional Chinese Medicine, The First Affiliated Hospital of Changchun University of Chinese Medicine, Changchun, Jilin 130117, China.
| |
Collapse
|
2
|
Li Y, Liu X, Zhou J, Li F, Wang Y, Liu Q. Artificial intelligence in traditional Chinese medicine: advances in multi-metabolite multi-target interaction modeling. Front Pharmacol 2025; 16:1541509. [PMID: 40303920 PMCID: PMC12037568 DOI: 10.3389/fphar.2025.1541509] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2024] [Accepted: 03/25/2025] [Indexed: 05/02/2025] Open
Abstract
Traditional Chinese Medicine (TCM) utilizes multi-metabolite and multi-target interventions to address complex diseases, providing advantages over single-target therapies. However, the active metabolites, therapeutic targets, and especially the combination mechanisms remain unclear. The integration of advanced data analysis and nonlinear modeling capabilities of artificial intelligence (AI) is driving the transformation of TCM into precision medicine. This review concentrates on the application of AI in TCM target prediction, including multi-omics techniques, TCM-specialized databases, machine learning (ML), deep learning (DL), and cross-modal fusion strategies. It also critically analyzes persistent challenges such as data heterogeneity, limited model interpretability, causal confounding, and insufficient robustness validation in practical applications. To enhance the reliability and scalability of AI in TCM target prediction, future research should prioritize continuous optimization of the AI algorithms using zero-shot learning, end-to-end architectures, and self-supervised contrastive learning.
Collapse
Affiliation(s)
| | | | | | | | | | - Qingzhong Liu
- Department of Clinical Laboratory, Shanghai Municipal Hospital of Traditional Chinese Medicine, Shanghai University of Traditional Chinese Medicine, Shanghai, China
| |
Collapse
|
3
|
Salehi A, Khedmati M. Hybrid clustering strategies for effective oversampling and undersampling in multiclass classification. Sci Rep 2025; 15:3460. [PMID: 39870706 PMCID: PMC11772689 DOI: 10.1038/s41598-024-84786-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2024] [Accepted: 12/27/2024] [Indexed: 01/29/2025] Open
Abstract
Multiclass imbalance is a challenging problem in real-world datasets, where certain classes may have a low number of samples because they correspond to rare occurrences. To address the challenge of multiclass imbalance, this paper introduces a novel hybrid cluster-based oversampling and undersampling (HCBOU) technique. By clustering and separating classes into majority and minority categories, this algorithm retains the most information during undersampling while generating efficient data in the minority class. The classification is carried out using one-vs-one and one-vs-all decomposition schemes. Extensive experimentation was carried out on 30 datasets to evaluate the proposed algorithm's performance. The results were subsequently compared with those of several state-of-the-art algorithms. Based on the results, the proposed algorithm outperforms the competing algorithms under different scenarios. Finally, The HCBOU algorithm demonstrated robust performance across varying class imbalance levels, highlighting its effectiveness in handling imbalanced datasets.
Collapse
Affiliation(s)
- Amirreza Salehi
- Department of Industrial Engineering, Sharif University of Technology, Tehran, Iran
| | - Majid Khedmati
- Department of Industrial Engineering, Sharif University of Technology, Azadi Ave., Tehran, 1458889694, Iran.
| |
Collapse
|
4
|
Yang Y, Khorshidi HA, Aickelin U. A review on over-sampling techniques in classification of multi-class imbalanced datasets: insights for medical problems. Front Digit Health 2024; 6:1430245. [PMID: 39131184 PMCID: PMC11310152 DOI: 10.3389/fdgth.2024.1430245] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2024] [Accepted: 07/12/2024] [Indexed: 08/13/2024] Open
Abstract
There has been growing attention to multi-class classification problems, particularly those challenges of imbalanced class distributions. To address these challenges, various strategies, including data-level re-sampling treatment and ensemble methods, have been introduced to bolster the performance of predictive models and Artificial Intelligence (AI) algorithms in scenarios where excessive level of imbalance is present. While most research and algorithm development have been focused on binary classification problems, in health informatics there is an increased interest in the field to address the problem of multi-class classification in imbalanced datasets. Multi-class imbalance problems bring forth more complex challenges, as a delicate approach is required to generate synthetic data and simultaneously maintain the relationship between the multiple classes. The aim of this review paper is to examine over-sampling methods tailored for medical and other datasets with multi-class imbalance. Out of 2,076 peer-reviewed papers identified through searches, 197 eligible papers were chosen and thoroughly reviewed for inclusion, narrowing to 37 studies being selected for in-depth analysis. These studies are categorised into four categories: metric, adaptive, structure-based, and hybrid approaches. The most significant finding is the emerging trend toward hybrid resampling methods that combine the strengths of various techniques to effectively address the problem of imbalanced data. This paper provides an extensive analysis of each selected study, discusses their findings, and outlines directions for future research.
Collapse
Affiliation(s)
- Yuxuan Yang
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC, Australia
| | - Hadi Akbarzadeh Khorshidi
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC, Australia
- Cancer Health Services Research, Melbourne School of Population and Global Health, The University of Melbourne, Parkville, VIC, Australia
| | - Uwe Aickelin
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC, Australia
| |
Collapse
|
5
|
Jia Q, Chen C, Xu A, Wang S, He X, Shen G, Luo Y, Tu H, Sun T, Wu X. A biological age model based on physical examination data to predict mortality in a Chinese population. iScience 2024; 27:108891. [PMID: 38384842 PMCID: PMC10879664 DOI: 10.1016/j.isci.2024.108891] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2023] [Revised: 09/02/2023] [Accepted: 01/09/2024] [Indexed: 02/23/2024] Open
Abstract
Biological age could be reflective of an individual's health status and aging degree. Limited estimations of biological aging based on physical examination data in the Chinese population have been developed to quantify the rate of aging. We developed and validated a novel aging measure (Balanced-AGE) based on readily available physical health examination data. In this study, a repeated sub-sampling approach was applied to address the data imbalance issue, and this approach significantly improved the performance of biological age (Balanced-AGE) in predicting all-cause mortality with a 10-year time-dependent AUC of 0.908 for all-cause mortality. This mortality prediction tool was found to be effective across different subgroups by age, sex, smoking, and alcohol consumption status. Additionally, this study revealed that individuals who were underweight, smokers, or drinkers had a higher extent of age acceleration. The Balanced-AGE may serve as an effective and generally applicable tool for health assessment and management among the elderly population.
Collapse
Affiliation(s)
- Qingqing Jia
- Department of Big Data in Health Science School of Public Health, Center of Clinical Big Data and Analytics of The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou 310058, China
| | - Chen Chen
- Department of Big Data in Health Science School of Public Health, Center of Clinical Big Data and Analytics of The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou 310058, China
| | - Andi Xu
- Department of Big Data in Health Science School of Public Health, Center of Clinical Big Data and Analytics of The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou 310058, China
| | - Sicong Wang
- Department of Big Data in Health Science School of Public Health, Center of Clinical Big Data and Analytics of The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou 310058, China
| | - Xiaojie He
- Health Management Center, The Second Affiliated Hospital of Zhejiang University School of Medicine, Hangzhou 310009, China
| | - Guoli Shen
- Health Management Center, The Second Affiliated Hospital of Zhejiang University School of Medicine, Hangzhou 310009, China
| | - Yihong Luo
- Department of Big Data in Health Science School of Public Health, Center of Clinical Big Data and Analytics of The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou 310058, China
| | - Huakang Tu
- Department of Big Data in Health Science School of Public Health, Center of Clinical Big Data and Analytics of The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou 310058, China
| | - Ting Sun
- Health Management Center, The Second Affiliated Hospital of Zhejiang University School of Medicine, Hangzhou 310009, China
| | - Xifeng Wu
- Department of Big Data in Health Science School of Public Health, Center of Clinical Big Data and Analytics of The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou 310058, China
- National Institute for Data Science in Health and Medicine, Zhejiang University, Hangzhou, Zhejiang, China
- The Key Laboratory of Intelligent Preventive Medicine of Zhejiang Province, Hangzhou, Zhejiang, China
- Cancer Center, Zhejiang University, Hangzhou, Zhejiang, China
- School of Medicine and Health Science, George Washington University, Washington, DC, USA
| |
Collapse
|
6
|
Kang N, Chang H, Ma B, Shan S. A Comprehensive Framework for Long-Tailed Learning via Pretraining and Normalization. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:3437-3449. [PMID: 35895650 DOI: 10.1109/tnnls.2022.3192475] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Data in the visual world often present long-tailed distributions. However, learning high-quality representations and classifiers for imbalanced data is still challenging for data-driven deep learning models. In this work, we aim at improving the feature extractor and classifier for long-tailed recognition via contrastive pretraining and feature normalization, respectively. First, we carefully study the influence of contrastive pretraining under different conditions, showing that current self-supervised pretraining for long-tailed learning is still suboptimal in both performance and speed. We thus propose a new balanced contrastive loss and a fast contrastive initialization scheme to improve previous long-tailed pretraining. Second, based on the motivative analysis on the normalization for classifier, we propose a novel generalized normalization classifier that consists of generalized normalization and grouped learnable scaling. It outperforms traditional inner product classifier as well as cosine classifier. Both the two components proposed can improve recognition ability on tail classes without the expense of head classes. We finally build a unified framework that achieves competitive performance compared with state of the arts on several long-tailed recognition benchmarks and maintains high efficiency.
Collapse
|
7
|
Dablain D, Krawczyk B, Chawla NV. DeepSMOTE: Fusing Deep Learning and SMOTE for Imbalanced Data. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:6390-6404. [PMID: 35085094 DOI: 10.1109/tnnls.2021.3136503] [Citation(s) in RCA: 36] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Despite over two decades of progress, imbalanced data is still considered a significant challenge for contemporary machine learning models. Modern advances in deep learning have further magnified the importance of the imbalanced data problem, especially when learning from images. Therefore, there is a need for an oversampling method that is specifically tailored to deep learning models, can work on raw images while preserving their properties, and is capable of generating high-quality, artificial images that can enhance minority classes and balance the training set. We propose Deep synthetic minority oversampling technique (SMOTE), a novel oversampling algorithm for deep learning models that leverages the properties of the successful SMOTE algorithm. It is simple, yet effective in its design. It consists of three major components: 1) an encoder/decoder framework; 2) SMOTE-based oversampling; and 3) a dedicated loss function that is enhanced with a penalty term. An important advantage of DeepSMOTE over generative adversarial network (GAN)-based oversampling is that DeepSMOTE does not require a discriminator, and it generates high-quality artificial images that are both information-rich and suitable for visual inspection. DeepSMOTE code is publicly available at https://github.com/dd1github/DeepSMOTE.
Collapse
|
8
|
Yue Y, Cao L, Chen H, Chen Y, Su Z. Towards an Optimal KELM Using the PSO-BOA Optimization Strategy with Applications in Data Classification. Biomimetics (Basel) 2023; 8:306. [PMID: 37504194 PMCID: PMC10807650 DOI: 10.3390/biomimetics8030306] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2023] [Revised: 07/09/2023] [Accepted: 07/09/2023] [Indexed: 07/29/2023] Open
Abstract
The features of the kernel extreme learning machine-efficient processing, improved performance, and less human parameter setting-have allowed it to be effectively used to batch multi-label classification tasks. These classic classification algorithms must at present contend with accuracy and space-time issues as a result of the vast and quick, multi-label, and concept drift features of the developing data streams in the practical application sector. The KELM training procedure still has a difficulty in that it has to be repeated numerous times independently in order to maximize the model's generalization performance or the number of nodes in the hidden layer. In this paper, a kernel extreme learning machine multi-label data classification method based on the butterfly algorithm optimized by particle swarm optimization is proposed. The proposed algorithm, which fully accounts for the optimization of the model generalization ability and the number of hidden layer nodes, can train multiple KELM hidden layer networks at once while maintaining the algorithm's current time complexity and avoiding a significant number of repeated calculations. The simulation results demonstrate that, in comparison to the PSO-KELM, BBA-KELM, and BOA-KELM algorithms, the PSOBOA-KELM algorithm proposed in this paper can more effectively search the kernel extreme learning machine parameters and more effectively balance the global and local performance, resulting in a KELM prediction model with a higher prediction accuracy.
Collapse
Affiliation(s)
- Yinggao Yue
- School of Intelligent Manufacturing and Electronic Engineering, Wenzhou University of Technology, Wenzhou 325035, China; (Y.Y.); (L.C.); (H.C.); (Y.C.)
- Intelligent Information Systems Institute, Wenzhou University, Wenzhou 325035, China
| | - Li Cao
- School of Intelligent Manufacturing and Electronic Engineering, Wenzhou University of Technology, Wenzhou 325035, China; (Y.Y.); (L.C.); (H.C.); (Y.C.)
| | - Haishao Chen
- School of Intelligent Manufacturing and Electronic Engineering, Wenzhou University of Technology, Wenzhou 325035, China; (Y.Y.); (L.C.); (H.C.); (Y.C.)
| | - Yaodan Chen
- School of Intelligent Manufacturing and Electronic Engineering, Wenzhou University of Technology, Wenzhou 325035, China; (Y.Y.); (L.C.); (H.C.); (Y.C.)
| | - Zhonggen Su
- Taishun Research Institute, Wenzhou University of Technology, Wenzhou 325035, China
| |
Collapse
|
9
|
Luo J, Qiao H, Zhang B. A Minimax Probability Machine for Nondecomposable Performance Measures. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:2353-2365. [PMID: 34473631 DOI: 10.1109/tnnls.2021.3106484] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Imbalanced classification tasks are widespread in many real-world applications. For such classification tasks, in comparison with the accuracy rate (AR), it is usually much more appropriate to use nondecomposable performance measures such as the area under the receiver operating characteristic curve (AUC) and the Fβ measure as the classification criterion since the label class is imbalanced. On the other hand, the minimax probability machine is a popular method for binary classification problems and aims at learning a linear classifier by maximizing the AR, which makes it unsuitable to deal with imbalanced classification tasks. The purpose of this article is to develop a new minimax probability machine for the Fβ measure, called minimax probability machine for the Fβ -measures (MPMF), which can be used to deal with imbalanced classification tasks. A brief discussion is also given on how to extend the MPMF model for several other nondecomposable performance measures listed in the article. To solve the MPMF model effectively, we derive its equivalent form which can then be solved by an alternating descent method to learn a linear classifier. Further, the kernel trick is employed to derive a nonlinear MPMF model to learn a nonlinear classifier. Several experiments on real-world benchmark datasets demonstrate the effectiveness of our new model.
Collapse
|
10
|
Fatlawi HK, Kiss A. An Elastic Self-Adjusting Technique for Rare-Class Synthetic Oversampling Based on Cluster Distortion Minimization in Data Stream. SENSORS (BASEL, SWITZERLAND) 2023; 23:s23042061. [PMID: 36850659 PMCID: PMC9963940 DOI: 10.3390/s23042061] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/24/2023] [Revised: 02/08/2023] [Accepted: 02/10/2023] [Indexed: 06/12/2023]
Abstract
Adaptive machine learning has increasing importance due to its ability to classify a data stream and handle the changes in the data distribution. Various resources, such as wearable sensors and medical devices, can generate a data stream with an imbalanced distribution of classes. Many popular oversampling techniques have been designed for imbalanced batch data rather than a continuous stream. This work proposes a self-adjusting window to improve the adaptive classification of an imbalanced data stream based on minimizing cluster distortion. It includes two models; the first chooses only the previous data instances that preserve the coherence of the current chunk's samples. The second model relaxes the strict filter by excluding the examples of the last chunk. Both models include generating synthetic points for oversampling rather than the actual data points. The evaluation of the proposed models using the Siena EEG dataset showed their ability to improve the performance of several adaptive classifiers. The best results have been obtained using Adaptive Random Forest in which Sensitivity reached 96.83% and Precision reached 99.96%.
Collapse
Affiliation(s)
- Hayder K. Fatlawi
- Department of Information Systems, ELTE Eötvös Loránd University, 1117 Budapest, Hungary
- Center of Information Technology Research and Development, University of Kufa, Najaf 540011, Iraq
| | - Attila Kiss
- Department of Information Systems, ELTE Eötvös Loránd University, 1117 Budapest, Hungary
- Department of Informatics, J. Selye University, 94501 Komárno, Slovakia
| |
Collapse
|
11
|
Han M, Guo H, Li J, Wang W. Global-local information based oversampling for multi-class imbalanced data. INT J MACH LEARN CYB 2022. [DOI: 10.1007/s13042-022-01746-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
12
|
An imbalanced binary classification method via space mapping using normalizing flows with class discrepancy constraints. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.12.029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
|
13
|
Han M, Li A, Gao Z, Mu D, Liu S. A survey of multi-class imbalanced data classification methods. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2022. [DOI: 10.3233/jifs-221902] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
In reality, the data generated in many fields are often imbalanced, such as fraud detection, network intrusion detection and disease diagnosis. The class with fewer instances in the data is called the minority class, and the minority class in some applications contains the significant information. So far, many classification methods and strategies for binary imbalanced data have been proposed, but there are still many problems and challenges in multi-class imbalanced data that need to be solved urgently. The classification methods for multi-class imbalanced data are analyzed and summarized in terms of data preprocessing methods and algorithm-level classification methods, and the performance of the algorithms using the same dataset is compared separately. In the data preprocessing methods, the methods of oversampling, under-sampling, hybrid sampling and feature selection are mainly introduced. Algorithm-level classification methods are comprehensively introduced in four aspects: ensemble learning, neural network, support vector machine and multi-class decomposition technique. At the same time, all data preprocessing methods and algorithm-level classification methods are analyzed in detail in terms of the techniques used, comparison algorithms, pros and cons, respectively. Moreover, the evaluation metrics commonly used for multi-class imbalanced data classification methods are described comprehensively. Finally, the future directions of multi-class imbalanced data classification are given.
Collapse
Affiliation(s)
- Meng Han
- School of Computer Science and Engineering, North Minzu University, Yinchuan, China
| | - Ang Li
- School of Computer Science and Engineering, North Minzu University, Yinchuan, China
| | - Zhihui Gao
- School of Computer Science and Engineering, North Minzu University, Yinchuan, China
| | - Dongliang Mu
- School of Computer Science and Engineering, North Minzu University, Yinchuan, China
| | - Shujuan Liu
- School of Computer Science and Engineering, North Minzu University, Yinchuan, China
| |
Collapse
|
14
|
Solving the class imbalance problem using a counterfactual method for data augmentation. MACHINE LEARNING WITH APPLICATIONS 2022. [DOI: 10.1016/j.mlwa.2022.100375] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
|
15
|
Chen W, Yang K, Yu Z, Zhang W. Double-kernel based class-specific broad learning system for multiclass imbalance learning. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109535] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/16/2022]
|
16
|
Guo Y, Jiao B, Tan Y, Zhang P, Tang F. A transfer weighted extreme learning machine for imbalanced classification. INT J INTELL SYST 2022. [DOI: 10.1002/int.22899] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- Yinan Guo
- School of Mechanical Electronic and Information Engineering China University of Mining and Technology (Beijing) Beijing China
- School of Information and Control Engineering China University of Mining and Technology Xuzhou China
| | - Botao Jiao
- School of Information and Control Engineering China University of Mining and Technology Xuzhou China
| | - Ying Tan
- School of Artificial Intelligence, Key Laboratory of Machine Perceptron (MOE), Institute for Artificial Intellignce Peking University Beijing China
| | - Pei Zhang
- School of Information and Control Engineering China University of Mining and Technology Xuzhou China
| | - Fengzhen Tang
- State Key Laboratory of Robotics, Shenyang Institute of Automation Chinese Academy of Sciences Shenyang China
- Institute for Robotics and Intelligent Manufacturing Chinese Academy of Sciences Shenyang China
| |
Collapse
|
17
|
RDPVR: Random Data Partitioning with Voting Rule for Machine Learning from Class-Imbalanced Datasets. ELECTRONICS 2022. [DOI: 10.3390/electronics11020228] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Since most classifiers are biased toward the dominant class, class imbalance is a challenging problem in machine learning. The most popular approaches to solving this problem include oversampling minority examples and undersampling majority examples. Oversampling may increase the probability of overfitting, whereas undersampling eliminates examples that may be crucial to the learning process. We present a linear time resampling method based on random data partitioning and a majority voting rule to address both concerns, where an imbalanced dataset is partitioned into a number of small subdatasets, each of which must be class balanced. After that, a specific classifier is trained for each subdataset, and the final classification result is established by applying the majority voting rule to the results of all of the trained models. We compared the performance of the proposed method to some of the most well-known oversampling and undersampling methods, employing a range of classifiers, on 33 benchmark machine learning class-imbalanced datasets. The classification results produced by the classifiers employed on the generated data by the proposed method were comparable to most of the resampling methods tested, with the exception of SMOTEFUNA, which is an oversampling method that increases the probability of overfitting. The proposed method produced results that were comparable to the Easy Ensemble (EE) undersampling method. As a result, for solving the challenge of machine learning from class-imbalanced datasets, we advocate using either EE or our method.
Collapse
|
18
|
Dou J, Song Y, Wei G, Zhang Y. Fuzzy information decomposition incorporated and weighted Relief-F feature selection: When imbalanced data meet incompletion. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2021.10.057] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
19
|
RB-CCR: Radial-Based Combined Cleaning and Resampling algorithm for imbalanced data classification. Mach Learn 2021. [DOI: 10.1007/s10994-021-06012-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
AbstractReal-world classification domains, such as medicine, health and safety, and finance, often exhibit imbalanced class priors and have asynchronous misclassification costs. In such cases, the classification model must achieve a high recall without significantly impacting precision. Resampling the training data is the standard approach to improving classification performance on imbalanced binary data. However, the state-of-the-art methods ignore the local joint distribution of the data or correct it as a post-processing step. This can causes sub-optimal shifts in the training distribution, particularly when the target data distribution is complex. In this paper, we propose Radial-Based Combined Cleaning and Resampling (RB-CCR). RB-CCR utilizes the concept of class potential to refine the energy-based resampling approach of CCR. In particular, RB-CCR exploits the class potential to accurately locate sub-regions of the data-space for synthetic oversampling. The category sub-region for oversampling can be specified as an input parameter to meet domain-specific needs or be automatically selected via cross-validation. Our $$5\times 2$$
5
×
2
cross-validated results on 57 benchmark binary datasets with 9 classifiers show that RB-CCR achieves a better precision-recall trade-off than CCR and generally out-performs the state-of-the-art resampling methods in terms of AUC and G-mean.
Collapse
|
20
|
Ghadermarzi S, Krawczyk B, Song J, Kurgan L. XRRpred: Accurate Predictor of Crystal Structure Quality from Protein Sequence. Bioinformatics 2021; 37:4366-4374. [PMID: 34247234 DOI: 10.1093/bioinformatics/btab509] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2021] [Revised: 06/10/2021] [Accepted: 07/06/2021] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION X-ray crystallography was used to produce nearly 90% of protein structures. These efforts were supported by numerous sequence-based tools that accurately predict crystallizable proteins. However, protein structures vary widely in their quality, typically measured with resolution and R-free. This impacts the ability to use these structures for some applications including rational drug design and molecular docking and motivates development of methods that accurately predict structure quality. RESULTS We introduce XRRpred, the first predictor of the resolution and R-free values from protein sequences. XRRpred relies on original sequence profiles, hand-crafted features, empirically selected and parametrized regressors, and modern resampling techniques. Using an independent test dataset, we show that XRRpred provides accurate predictions of resolution and R-free. We demonstrate that XRRpred's predictions correctly model relationship between the resolution and R-free and reproduce structure quality relations between structural classes of proteins. We also show that XRRpred significantly outperforms indirect alternative ways to predict the structure quality that include predictors of crystallization propensity and an alignment-based approach. XRRpred is available as a convenient webserver that allows batch predictions and offers informative visualization of the results. AVAILABILITY http://biomine.cs.vcu.edu/servers/XRRPred/.
Collapse
Affiliation(s)
- Sina Ghadermarzi
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| | - Bartosz Krawczyk
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| |
Collapse
|
21
|
Classifying multiclass imbalanced data using generalized class-specific extreme learning machine. PROGRESS IN ARTIFICIAL INTELLIGENCE 2021. [DOI: 10.1007/s13748-021-00236-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
22
|
|
23
|
Kim KH, Sohn SY. Hybrid neural network with cost-sensitive support vector machine for class-imbalanced multimodal data. Neural Netw 2020; 130:176-184. [DOI: 10.1016/j.neunet.2020.06.026] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2020] [Revised: 06/13/2020] [Accepted: 06/30/2020] [Indexed: 10/23/2022]
|
24
|
Koziarski M, Woźniak M, Krawczyk B. Combined Cleaning and Resampling algorithm for multi-class imbalanced data with label noise. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2020.106223] [Citation(s) in RCA: 39] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|