1
|
Ahsan MM, Ali MS, Siddique Z. Enhancing and improving the performance of imbalanced class data using novel GBO and SSG: A comparative analysis. Neural Netw 2024; 173:106157. [PMID: 38335796 DOI: 10.1016/j.neunet.2024.106157] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Revised: 01/01/2024] [Accepted: 02/01/2024] [Indexed: 02/12/2024]
Abstract
Class imbalance problem (CIP) in a dataset is a major challenge that significantly affects the performance of Machine Learning (ML) models resulting in biased predictions. Numerous techniques have been proposed to address CIP, including, but not limited to, Oversampling, Undersampling, and cost-sensitive approaches. Due to its ability to generate synthetic data, oversampling techniques such as the Synthetic Minority Oversampling Technique (SMOTE) are the most widely used methodology by researchers. However, one of SMOTE's potential disadvantages is that newly created minor samples overlap with major samples. Therefore, the probability of ML models' biased performance toward major classes increases. Generative adversarial network (GAN) has recently garnered much attention due to their ability to create real samples. However, GAN is hard to train even though it has much potential. Considering these opportunities, this work proposes two novel techniques: GAN-based Oversampling (GBO) and Support Vector Machine-SMOTE-GAN (SSG) to overcome the limitations of the existing approaches. The preliminary results show that SSG and GBO performed better on the nine imbalanced benchmark datasets than several existing SMOTE-based approaches. Additionally, it can be observed that the proposed SSG and GBO methods can accurately classify the minor class with more than 90% accuracy when tested with 20%, 30%, and 40% of the test data. The study also revealed that the minor sample generated by SSG demonstrates Gaussian distributions, which is often difficult to achieve using original SMOTE and SVM-SMOTE.
Collapse
Affiliation(s)
- Md Manjurul Ahsan
- School of Industrial and Systems Engineering, University of Oklahoma, Norman, OK 73019, USA.
| | - Md Shahin Ali
- Department of Biomedical Engineering, Islamic University, Kushtia 7003, Bangladesh.
| | - Zahed Siddique
- School of Aerospace and Mechanical Engineering, University of Oklahoma, Norman, OK 73019, USA.
| |
Collapse
|
2
|
An adaptive multi-class imbalanced classification framework based on ensemble methods and deep network. Neural Comput Appl 2023. [DOI: 10.1007/s00521-023-08290-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/25/2023]
|
3
|
Jiang J, Chan L, Nadkarni GN. The promise of artificial intelligence for kidney pathophysiology. Curr Opin Nephrol Hypertens 2022; 31:380-386. [PMID: 35703218 PMCID: PMC10309072 DOI: 10.1097/mnh.0000000000000808] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
PURPOSE OF REVIEW We seek to determine recent advances in kidney pathophysiology that have been enabled or enhanced by artificial intelligence. We describe some of the challenges in the field as well as future directions. RECENT FINDINGS We first provide an overview of artificial intelligence terminologies and methodologies. We then describe the use of artificial intelligence in kidney diseases to discover risk factors from clinical data for disease progression, annotate whole slide imaging and decipher multiomics data. We delineate key examples of risk stratification and prognostication in acute kidney injury (AKI) and chronic kidney disease (CKD). We contextualize these applications in kidney disease oncology, one of the subfields to benefit demonstrably from artificial intelligence using all if these approaches. We conclude by elucidating technical challenges and ethical considerations and briefly considering future directions. SUMMARY The integration of clinical data, patient derived data, histology and proteomics and genomics can enhance the work of clinicians in providing more accurate diagnoses and elevating understanding of disease progression. Implementation research needs to be performed to translate these algorithms to the clinical setting.
Collapse
Affiliation(s)
- Joy Jiang
- Division of Data Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Lili Chan
- Division of Data Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Division of Nephrology, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Girish N. Nadkarni
- Division of Data Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Division of Nephrology, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- Mount Sinai Clinical Intelligence Center, Icahn School of Medicine at Mount Sinai, New York, New York, USA
- The Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| |
Collapse
|
4
|
Tasci E, Zhuge Y, Camphausen K, Krauze AV. Bias and Class Imbalance in Oncologic Data-Towards Inclusive and Transferrable AI in Large Scale Oncology Data Sets. Cancers (Basel) 2022; 14:2897. [PMID: 35740563 PMCID: PMC9221277 DOI: 10.3390/cancers14122897] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Revised: 06/07/2022] [Accepted: 06/09/2022] [Indexed: 02/06/2023] Open
Abstract
Recent technological developments have led to an increase in the size and types of data in the medical field derived from multiple platforms such as proteomic, genomic, imaging, and clinical data. Many machine learning models have been developed to support precision/personalized medicine initiatives such as computer-aided detection, diagnosis, prognosis, and treatment planning by using large-scale medical data. Bias and class imbalance represent two of the most pressing challenges for machine learning-based problems, particularly in medical (e.g., oncologic) data sets, due to the limitations in patient numbers, cost, privacy, and security of data sharing, and the complexity of generated data. Depending on the data set and the research question, the methods applied to address class imbalance problems can provide more effective, successful, and meaningful results. This review discusses the essential strategies for addressing and mitigating the class imbalance problems for different medical data types in the oncologic domain.
Collapse
Affiliation(s)
- Erdal Tasci
- Center for Cancer Research, National Cancer Institute, NIH, Building 10, Bethesda, MD 20892, USA; (E.T.); (Y.Z.); (K.C.)
- Department of Computer Engineering, Ege University, Izmir 35100, Turkey
| | - Ying Zhuge
- Center for Cancer Research, National Cancer Institute, NIH, Building 10, Bethesda, MD 20892, USA; (E.T.); (Y.Z.); (K.C.)
| | - Kevin Camphausen
- Center for Cancer Research, National Cancer Institute, NIH, Building 10, Bethesda, MD 20892, USA; (E.T.); (Y.Z.); (K.C.)
| | - Andra V. Krauze
- Center for Cancer Research, National Cancer Institute, NIH, Building 10, Bethesda, MD 20892, USA; (E.T.); (Y.Z.); (K.C.)
| |
Collapse
|
5
|
Al-Obeidat F, Rocha Á, Akram M, Razzaq S, Maqbool F. (CDRGI)-Cancer detection through relevant genes identification. Neural Comput Appl 2022. [DOI: 10.1007/s00521-021-05739-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
6
|
Li X, Li K. Imbalanced data classification based on improved EIWAPSO-AdaBoost-C ensemble algorithm. APPL INTELL 2022. [DOI: 10.1007/s10489-021-02708-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
7
|
Song B, Li S, Sunny S, Gurushanth K, Mendonca P, Mukhia N, Patrick S, Gurudath S, Raghavan S, Tsusennaro I, Leivon ST, Kolur T, Shetty V, Bushan V, Ramesh R, Peterson T, Pillai V, Wilder-Smith P, Sigamani A, Suresh A, Kuriakose MA, Birur P, Liang R. Classification of imbalanced oral cancer image data from high-risk population. JOURNAL OF BIOMEDICAL OPTICS 2021; 26:JBO-210246R. [PMID: 34689442 PMCID: PMC8536945 DOI: 10.1117/1.jbo.26.10.105001] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Accepted: 09/28/2021] [Indexed: 06/13/2023]
Abstract
SIGNIFICANCE Early detection of oral cancer is vital for high-risk patients, and machine learning-based automatic classification is ideal for disease screening. However, current datasets collected from high-risk populations are unbalanced and often have detrimental effects on the performance of classification. AIM To reduce the class bias caused by data imbalance. APPROACH We collected 3851 polarized white light cheek mucosa images using our customized oral cancer screening device. We use weight balancing, data augmentation, undersampling, focal loss, and ensemble methods to improve the neural network performance of oral cancer image classification with the imbalanced multi-class datasets captured from high-risk populations during oral cancer screening in low-resource settings. RESULTS By applying both data-level and algorithm-level approaches to the deep learning training process, the performance of the minority classes, which were difficult to distinguish at the beginning, has been improved. The accuracy of "premalignancy" class is also increased, which is ideal for screening applications. CONCLUSIONS Experimental results show that the class bias induced by imbalanced oral cancer image datasets could be reduced using both data- and algorithm-level methods. Our study may provide an important basis for helping understand the influence of unbalanced datasets on oral cancer deep learning classifiers and how to mitigate.
Collapse
Affiliation(s)
- Bofan Song
- The University of Arizona, Wyant College of Optical Sciences, Tucson, Arizona, United States
| | - Shaobai Li
- The University of Arizona, Wyant College of Optical Sciences, Tucson, Arizona, United States
| | | | | | | | - Nirza Mukhia
- KLE Society Institute of Dental Sciences, Bangalore, India
| | | | | | | | | | | | - Trupti Kolur
- Mazumdar Shaw Medical Foundation, Bangalore, India
| | - Vivek Shetty
- Mazumdar Shaw Medical Foundation, Bangalore, India
| | - Vidya Bushan
- Mazumdar Shaw Medical Foundation, Bangalore, India
| | - Rohan Ramesh
- Christian Institute of Health Sciences and Research, Dimapur, India
| | - Tyler Peterson
- The University of Arizona, Wyant College of Optical Sciences, Tucson, Arizona, United States
| | - Vijay Pillai
- Mazumdar Shaw Medical Foundation, Bangalore, India
| | - Petra Wilder-Smith
- University of California Beckman Laser Institute and Medical Clinic, Irvine, California, United States
| | | | - Amritha Suresh
- Mazumdar Shaw Medical Centre, Bangalore, India
- Mazumdar Shaw Medical Foundation, Bangalore, India
| | | | - Praveen Birur
- KLE Society Institute of Dental Sciences, Bangalore, India
- Biocon Foundation, Bangalore, India
| | - Rongguang Liang
- The University of Arizona, Wyant College of Optical Sciences, Tucson, Arizona, United States
| |
Collapse
|
8
|
Abstract
In recent years, the demand for alternative medical diagnostics of the human kidney or renal is growing, and some of the reasons behind this relate to its non-invasive, early, real-time, and pain-free mechanism. The chronic kidney problem is one of the major kidney problems, which require an early-stage diagnosis. Therefore, in this work, we have proposed and developed an Intelligent Iris-based Chronic Kidney Identification System (ICKIS). The ICKIS takes an image of human iris as input and on the basis of iridology a deep neural network model on a GPU-based supercomputing machine is applied. The deep neural network models are trained while using 2000 subjects that have healthy and chronic kidney problems. While testing the proposed ICKIS on 2000 separate subjects (1000 healthy and 1000 chronic kidney problems), the system achieves iris-based chronic kidney assessment with an accuracy of 96.8%. In the future, we will work to improve our AI algorithm and try data-set cleaning, so that accuracy can be increased by more efficiently learning the features.
Collapse
|
9
|
Deperlioglu O, Kose U, Gupta D, Khanna A, Sangaiah AK. Diagnosis of heart diseases by a secure Internet of Health Things system based on Autoencoder Deep Neural Network. COMPUTER COMMUNICATIONS 2020; 162:31-50. [PMID: 32843778 PMCID: PMC7434639 DOI: 10.1016/j.comcom.2020.08.011] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/03/2020] [Revised: 08/01/2020] [Accepted: 08/17/2020] [Indexed: 05/04/2023]
Abstract
Objective of this study is to introduce a secure IoHT system, which acts as a clinical decision support system with the diagnosis of cardiovascular diseases. In this sense, it was emphasized that the accuracy rate of diagnosis (classification) can be improved via deep learning algorithms, by needing no hybrid-complex models, and a secure data processing can be achieved with a multi-authentication and Tangle based approach. In detail, heart sounds were classified with Autoencoder Neural Networks (AEN) and the IoHT system was built for supporting doctors in real-time. For developing the diagnosis infrastructure by the AEN, PASCAL B-Training and Physiobank-PhysioNet A-Training heart sound datasets were used accordingly. For the PASCAL dataset, the AEN provided a diagnosis-classification performance with the accuracy of 100%, sensitivity of 100%, and the specificity of 100% whereas the rates were respectively 99.8%, 99.65%, and 99.13% for the PhysioNet dataset. It was seen that the findings by the developed AEN based solution were better than the alternative solutions from the literature. Additionally, usability of the whole IoHT system was found positive by the doctors, and according to the 479 real-case applications, the system was able to achieve accuracy rates of 96.03% for normal heart sounds, 91.91% for extrasystole, and 90.11% for murmur. In terms of security approach, the system was also robust against several attacking methods including synthetic data impute as well as trying to penetrating to the system via central system or mobile devices.
Collapse
Affiliation(s)
| | - Utku Kose
- Suleyman Demirel University, Isparta, Turkey
| | - Deepak Gupta
- Maharaja Agrasen Institute of Technology, Delhi, India
| | - Ashish Khanna
- Maharaja Agrasen Institute of Technology, Delhi, India
| | - Arun Kumar Sangaiah
- School of Computing Science and Engineering, Vellore Institute of Technology, Vellore, India
- Department of Industrial Engineering and Management, National Yunlin University of Science and Technology, Taiwan
| |
Collapse
|
10
|
Applied Identification of Industry Data Science Using an Advanced Multi-Componential Discretization Model. Symmetry (Basel) 2020. [DOI: 10.3390/sym12101620] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Applied human large-scale data are collected from heterogeneous science or industry databases for the purposes of achieving data utilization in complex application environments, such as in financial applications. This has posed great opportunities and challenges to all kinds of scientific data researchers. Thus, finding an intelligent hybrid model that solves financial application problems of the stock market is an important issue for financial analysts. In practice, classification applications that focus on the earnings per share (EPS) with financial ratios from an industry database often demonstrate that the data meet the abovementioned standards and have particularly high application value. This study proposes several advanced multicomponential discretization models, named Models A–E, where each model identifies and presents a positive/negative diagnosis based on the experiences of the latest financial statements from six different industries. The varied components of the model test performance measurements comparatively by using data-preprocessing, data-discretization, feature-selection, two data split methods, machine learning, rule-based decision tree knowledge, time-lag effects, different times of running experiments, and two different class types. The experimental dataset had 24 condition features and a decision feature EPS that was used to classify the data into two and three classes for comparison. Empirically, the analytical results of this study showed that three main determinants were identified: total asset growth rate, operating income per share, and times interest earned. The core components of the following techniques are as follows: data-discretization and feature-selection, with some noted classifiers that had significantly better accuracy. Total solution results demonstrated the following key points: (1) The highest accuracy, 92.46%, occurred in Model C from the use of decision tree learning with a percentage-split method for two classes in one run; (2) the highest accuracy mean, 91.44%, occurred in Models D and E from the use of naïve Bayes learning for cross-validation and percentage-split methods for each class for 10 runs; (3) the highest average accuracy mean, 87.53%, occurred in Models D and E with a cross-validation method for each class; (4) the highest accuracy, 92.46%, occurred in Model C from the use of decision tree learning-C4.5 with the percentage-split method and no time-lag for each class. This study concludes that its contribution is regarded as managerial implication and technical direction for practical finance in which a multicomponential discretization model has limited use and is rarely seen as applied by scientific industry data due to various restrictions.
Collapse
|
11
|
Data Sampling Methods to Deal With the Big Data Multi-Class Imbalance Problem. APPLIED SCIENCES-BASEL 2020. [DOI: 10.3390/app10041276] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
The class imbalance problem has been a hot topic in the machine learning community in recent years. Nowadays, in the time of big data and deep learning, this problem remains in force. Much work has been performed to deal to the class imbalance problem, the random sampling methods (over and under sampling) being the most widely employed approaches. Moreover, sophisticated sampling methods have been developed, including the Synthetic Minority Over-sampling Technique (SMOTE), and also they have been combined with cleaning techniques such as Editing Nearest Neighbor or Tomek’s Links (SMOTE+ENN and SMOTE+TL, respectively). In the big data context, it is noticeable that the class imbalance problem has been addressed by adaptation of traditional techniques, relatively ignoring intelligent approaches. Thus, the capabilities and possibilities of heuristic sampling methods on deep learning neural networks in big data domain are analyzed in this work, and the cleaning strategies are particularly analyzed. This study is developed on big data, multi-class imbalanced datasets obtained from hyper-spectral remote sensing images. The effectiveness of a hybrid approach on these datasets is analyzed, in which the dataset is cleaned by SMOTE followed by the training of an Artificial Neural Network (ANN) with those data, while the neural network output noise is processed with ENN to eliminate output noise; after that, the ANN is trained again with the resultant dataset. Obtained results suggest that best classification outcome is achieved when the cleaning strategies are applied on an ANN output instead of input feature space only. Consequently, the need to consider the classifier’s nature when the classical class imbalance approaches are adapted in deep learning and big data scenarios is clear.
Collapse
|