Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

Download

Total Articles

76
(from Reference Citation Analysis)

Article PDFs (24)

Cited by > 0 (49)

Searched Name

Imbalanced data

Ranked By

Results Analysis

Year Published Analysis
Article Type Analysis
Publication Title Analysis
Category Analysis

Results Analysis

Indexed Articles

Year Published

Show more Refine

Article Type

Show more Refine

Article Statistics

Refine

MESH Headings

Show more Refine

First Author

Show more Refine

First Author Affiliations

Show more Refine

Authors

Show more Refine

Publication Titles

Show more Refine

Grant Agencies

Show more Refine

Countries/Regions

Show more Refine

Affiliations

Show more Refine

Corresponding Author Affiliations

Show more Refine

Category

Show more Refine

Number

Citation Analysis

Kasprzak J, Westphalen CB, Frey S, Schmitt Y, Heinemann V, Fey T, Nasseh D. Supporting the decision to perform molecular profiling for cancer patients based on routinely collected data through the use of machine learning. Clin Exp Med 2024;24:73. [PMID: 38598013 PMCID: PMC11006770 DOI: 10.1007/s10238-024-01336-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2023] [Accepted: 03/21/2024] [Indexed: 04/11/2024]

John M, Shaiba H. Identification of self-care problem in children using machine learning. Heliyon 2024;10:e26977. [PMID: 38463780 PMCID: PMC10923687 DOI: 10.1016/j.heliyon.2024.e26977] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2023] [Revised: 02/14/2024] [Accepted: 02/22/2024] [Indexed: 03/12/2024] Open

Liu Q, Chen Y, Xie P, Luo Y, Wang B, Meng Y, Zhong J, Mei J, Zou W. Development of a predictive machine learning model for pathogen profiles in patients with secondary immunodeficiency. BMC Med Inform Decis Mak 2024;24:48. [PMID: 38350899 PMCID: PMC10863296 DOI: 10.1186/s12911-024-02447-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Accepted: 01/30/2024] [Indexed: 02/15/2024] Open

Severinsen I, Yu W, Walmsley T, Young B. COVERT: A classless approach to generating balanced datasets for process modelling. ISA Trans 2024;144:1-10. [PMID: 37951753 DOI: 10.1016/j.isatra.2023.10.031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Revised: 09/04/2023] [Accepted: 10/27/2023] [Indexed: 11/14/2023]

Zhang W, Guan X, Jiao S, Wang G, Wang X. Development and validation of an artificial intelligence prediction model and a survival risk stratification for lung metastasis in colorectal cancer from highly imbalanced data: A multicenter retrospective study. Eur J Surg Oncol 2023;49:107107. [PMID: 37883884 DOI: 10.1016/j.ejso.2023.107107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Revised: 09/08/2023] [Accepted: 10/02/2023] [Indexed: 10/28/2023]

Liu X, Lu J, Chen X, Fong YHC, Ma X, Zhang F. Attention based spatio-temporal graph convolutional network with focal loss for crash risk evaluation on urban road traffic network based on multi-source risks. Accid Anal Prev 2023;192:107262. [PMID: 37598458 DOI: 10.1016/j.aap.2023.107262] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/29/2022] [Revised: 04/17/2023] [Accepted: 08/06/2023] [Indexed: 08/22/2023]

Abstract

The urban road transportation has presented a high probability of crash occurrence, and the aim of the present study is to evaluate the crash risk for urban road networks. However, the irregular structure of urban road networks, the high-dimensional spatio-temporal correlations among multi-source risks (i.e., the contributing risks from traffic flow, meteorological conditions, road design, and so forth), and the issue of data imbalance have brought challenges to this topic. To solve these issues, an Attention based Spatio-Temporal Graph Convolutional Network (ASTGCN) model with focal loss function is used for the first time to evaluate crash risk on an urban road network. This work can be summarized as (1) adopting the spatio-temporal graph convolution structure to capture the spatio-temporal properties and characterize the multi-source risks; (2) utilizing an attention mechanism network to address the critical contributing risks during crash risk evaluation; (3) introducing the focal loss function to improve the model performance impacted by the imbalanced data; and (4) investigating the different contributions of multi-source risks to model performance. The evaluation performance is tested in a real-world urban road traffic network. The raw data consists of 1239 crash records with corresponding datasets of traffic flow characteristics, meteorological conditions, road attributes and the topological structure of the road network. At the same time, three baseline models Artificial Neural Network (ANN), Random Forest (RF), and Deep Spatio-Temporal Graph Convolutional Network (DSTGCN) are compared to the proposed ASTGCN on the same datasets. Overall, the results show that ASTGCN outperforms the baseline models in several evaluation metrics. ASTGCN with focal loss function further improves performance by tackling the issues of dataset imbalance. Additionally, it is also found that the traffic flow risk is most crucial to model performance. The findings of the present study indicate that the proposed model can efficiently evaluate dynamic crash risk for urban road networks, which will benefit the safety management of urban road transportation.

Collapse

Li Y, Yang Z, Xing L, Yuan C, Liu F, Wu D, Yang H. Crash injury severity prediction considering data imbalance: A Wasserstein generative adversarial network with gradient penalty approach. Accid Anal Prev 2023;192:107271. [PMID: 37659275 DOI: 10.1016/j.aap.2023.107271] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Revised: 07/29/2023] [Accepted: 08/24/2023] [Indexed: 09/04/2023]

Abstract

For each road crash event, it is necessary to predict its injury severity. However, predicting crash injury severity with the imbalanced data frequently results in ineffective classifier. Due to the rarity of severe injuries in road traffic crashes, the crash data is extremely imbalanced among injury severity classes, making it challenging to the training of prediction models. To achieve interclass balance, it is possible to generate certain minority class samples using data augmentation techniques. Aiming to address the imbalance issue of crash injury severity data, this study applies a novel deep learning method, the Wasserstein generative adversarial network with gradient penalty (WGAN-GP), to investigate a massive amount of crash data, which can generate synthetic injury severity data linked to traffic crashes to rebalance the dataset. To evaluate the effectiveness of the WGAN-GP model, we systematically compare performances of various commonly-used sampling techniques (random under-sampling, random over-sampling, synthetic minority over-sampling technique and adaptive synthetic sampling) with respect to dataset balance and crash injury severity prediction. After rebalancing the dataset, this study categorizes the crash injury severity using logistic regression, multilayer perceptron, random forest, AdaBoost and XGBoost. The AUC, specificity and sensitivity are employed as evaluation indicators to compare the prediction performances. Results demonstrate that sampling techniques can considerably improve the prediction performance of minority classes in an imbalanced dataset, and the combination of XGBoost and WGAN-GP performs best with an AUC of 0.794 and a sensitivity of 0.698. Finally, the interpretability of the model is improved by the explainable machine learning technique SHAP (SHapley Additive exPlanation), allowing for a deeper understanding of the effects of each variable on crash injury severity. Findings of this study shed light on the prediction of crash injury severity with data imbalance using data-driven approaches.

Collapse

Li X, Luo G, Wang W, Wang K, Li S. Curriculum label distribution learning for imbalanced medical image segmentation. Med Image Anal 2023;89:102911. [PMID: 37542795 DOI: 10.1016/j.media.2023.102911] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2022] [Revised: 04/27/2023] [Accepted: 07/25/2023] [Indexed: 08/07/2023]

Zeinolabedini Rezaabad M, Lacey H, Marshall L, Johnson F. Influence of resampling techniques on Bayesian network performance in predicting increased algal activity. Water Res 2023;244:120558. [PMID: 37666153 DOI: 10.1016/j.watres.2023.120558] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/23/2023] [Revised: 08/10/2023] [Accepted: 08/30/2023] [Indexed: 09/06/2023]

Abstract

Early warning of increased algal activity is important to mitigate potential impacts on aquatic life and human health. While many methods have been developed to predict increased algal activity, an ongoing issue is that severe algal blooms often occur with low frequency in water bodies. This results in imbalanced data sets available for model specification, leading to poor predictions of the frequency of increased algal activity. One approach to address this is to resample data sets of increased algal activity to increase the prevalence of higher than normal algal activity in calibration data and ultimately improve model predictions. This study aims to investigate the use of resampling techniques to address the imbalanced dataset and determine if such methods can improve the prediction of increased algal activity. Three techniques were investigated, Kmeans under-sampling (US_Kmeans), synthetic minority over-sampling technique (SMOTE), and 'SMOTE and cluster-based under-sampling technique' (SCUT). The resampling methods were applied to a Bayesian network (BN) model of Lake Burragorang in New South Wales, Australia. The model was developed to predict chlorophyll-a (chl-a) using a range of water quality parameters as predictors. The original data and each of the balanced datasets were used for BN structures and parameter learning. The results showed that the best graphical structure was obtained by adding synthetic data from SMOTE with the highest true positive rate (TPR) and area under the curve (AUC). When compared using a fixed graphical structure for the BN, all resampling techniques increased the ability of the BN to detect events with higher probability of increased algal activity. The resampling model results can also be used to better understand the most important influences on high chl-a concentrations and suggest future data collection and model development priorities.

Collapse

Ghavidel A, Pazos P. Machine learning (ML) techniques to predict breast cancer in imbalanced datasets: a systematic review. J Cancer Surviv 2023:10.1007/s11764-023-01465-3. [PMID: 37749361 DOI: 10.1007/s11764-023-01465-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2023] [Accepted: 09/09/2023] [Indexed: 09/27/2023]

Ebrahimi A, Wiil UK, Baskaran R, Peimankar A, Andersen K, Nielsen AS. AUD-DSS: a decision support system for early detection of patients with alcohol use disorder. BMC Bioinformatics 2023;24:329. [PMID: 37658294 PMCID: PMC10474761 DOI: 10.1186/s12859-023-05450-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2023] [Accepted: 08/21/2023] [Indexed: 09/03/2023] Open

Abstract

BACKGROUND

Alcohol use disorder (AUD) causes significant morbidity, mortality, and injuries. According to reports, approximately 5% of all registered deaths in Denmark could be due to AUD. The problem is compounded by the late identification of patients with AUD, a situation that can cause enormous problems, from psychological to physical to economic problems. Many individuals suffering from AUD never undergo specialist treatment during their addiction due to obstacles such as taboo and the poor performance of current screening tools. Therefore, there is a lack of rapid intervention. This can be mitigated by the early detection of patients with AUD. A clinical decision support system (DSS) powered by machine learning (ML) methods can be used to diagnose patients' AUD status earlier.

METHODS

This study proposes an effective AUD prediction model (AUDPM), which can be used in a DSS. The proposed model consists of four distinct components: (1) imputation to address missing values using the k-nearest neighbours approach, (2) recursive feature elimination with cross validation to select the most relevant subset of features, (3) a hybrid synthetic minority oversampling technique-edited nearest neighbour approach to remove noise and balance the distribution of the training data, and (4) an ML model for the early detection of patients with AUD. Two data sources, including a questionnaire and electronic health records of 2571 patients, were collected from Odense University Hospital in the Region of Southern Denmark for the AUD-Dataset. Then, the AUD-Dataset was used to build ML models. The results of different ML models, such as support vector machine, K-nearest neighbour, decision tree, random forest, and extreme gradient boosting, were compared. Finally, a combination of all these models in an ensemble learning approach was selected for the AUDPM.

RESULTS

The results revealed that the proposed ensemble AUDPM outperformed other single models and our previous study results, achieving 0.96, 0.94, 0.95, and 0.97 precision, recall, F1-score, and accuracy, respectively. In addition, we designed and developed an AUD-DSS prototype.

CONCLUSION

It was shown that our proposed AUDPM achieved high classification performance. In addition, we identified clinical factors related to the early detection of patients with AUD. The designed AUD-DSS is intended to be integrated into the existing Danish health care system to provide novel information to clinical staff if a patient shows signs of harmful alcohol use; in other words, it gives staff a good reason for having a conversation with patients for whom a conversation is relevant.

Collapse

Liu S, Roemer F, Ge Y, Bedrick EJ, Li ZM, Guermazi A, Sharma L, Eaton C, Hochberg MC, Hunter DJ, Nevitt MC, Wirth W, Kent Kwoh C, Sun X. Comparison of evaluation metrics of deep learning for imbalanced imaging data in osteoarthritis studies. Osteoarthritis Cartilage 2023;31:1242-1248. [PMID: 37209993 PMCID: PMC10524686 DOI: 10.1016/j.joca.2023.05.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/14/2022] [Revised: 04/14/2023] [Accepted: 05/12/2023] [Indexed: 05/22/2023]

Affiliation(s)

Shen Liu Department of Epidemiology and Biostatistics, University of Arizona, 1295 N. Martin Ave., Tucson, AZ 85724, USA.
Frank Roemer Department of Radiology, University of Erlangen - Nuremberg, Erlangen, Germany; Department of Radiology, Boston University School of Medicine, MA, USA.
Yong Ge Department of Management Information Systems, University of Arizona, AZ, USA.
Edward J Bedrick Department of Epidemiology and Biostatistics, University of Arizona, 1295 N. Martin Ave., Tucson, AZ 85724, USA.
Zong-Ming Li University of Arizona Arthritis Center, University of Arizona College of Medicine, Tucson, AZ, USA.
Ali Guermazi Department of Radiology, Boston University School of Medicine, MA, USA.
Leena Sharma Feinberh School of Medicine, Northwestern University, IL, USA.
Charles Eaton Kent Memorial Hospital, and Department of Family Medicine, Warren Alpert Medical School, and Department of Epidemiology, School of Public Health, Brown University, RI, USA.
Marc C Hochberg School of Medicine, University of Maryland, and Medical Care Clinical Center, VA Maryland Health Care System, Baltimore, MD, USA.
David J Hunter Sydney Musculoskeletal Health, Kolling Institute, Faculty of Medicine and Health, The University of Sydney, Sydney, 2065 NSW, Australia, and Rheumatology Department, Royal North Shore Hospital, St Leonards, NSW 2065 Australia.
Michael C Nevitt Department of Epidemiology and Biostatistics, University of California San Francisco, CA, USA.
Wolfgang Wirth Department of Imaging & Functional Musculoskeletal Research, Institute of Anatomy & Cell Biology, Paracelsus Medical University Salzburg & Nuremberg, Salzburg, Austria, and Ludwig Boltzmann Inst. for Arthritis and Rehabilitation, Paracelsus Medical University Salzburg & Nuremberg, Salzburg, Austria, and Chondrometrics GmbH, Ainring, Germany.
C Kent Kwoh University of Arizona Arthritis Center, University of Arizona College of Medicine, Tucson, AZ, USA.
Xiaoxiao Sun Department of Epidemiology and Biostatistics, University of Arizona, 1295 N. Martin Ave., Tucson, AZ 85724, USA.

Collapse

Wang Z, An T, Wang W, Fan S, Chen L, Tian X. Qualitative and quantitative detection of aflatoxins B1 in maize kernels with fluorescence hyperspectral imaging based on the combination method of boosting and stacking. Spectrochim Acta A Mol Biomol Spectrosc 2023;296:122679. [PMID: 37011441 DOI: 10.1016/j.saa.2023.122679] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/26/2023] [Revised: 03/17/2023] [Accepted: 03/26/2023] [Indexed: 06/19/2023]

Hassanzadeh R, Farhadian M, Rafieemehr H. Hospital mortality prediction in traumatic injuries patients: comparing different SMOTE-based machine learning algorithms. BMC Med Res Methodol 2023;23:101. [PMID: 37087425 PMCID: PMC10122327 DOI: 10.1186/s12874-023-01920-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2022] [Accepted: 04/13/2023] [Indexed: 04/24/2023] Open

Abstract

BACKGROUND

Trauma is one of the most critical public health issues worldwide, leading to death and disability and influencing all age groups. Therefore, there is great interest in models for predicting mortality in trauma patients admitted to the ICU. The main objective of the present study is to develop and evaluate SMOTE-based machine-learning tools for predicting hospital mortality in trauma patients with imbalanced data.

METHODS

This retrospective cohort study was conducted on 126 trauma patients admitted to an intensive care unit at Besat hospital in Hamadan Province, western Iran, from March 2020 to March 2021. Data were extracted from the medical information records of patients. According to the imbalanced property of the data, SMOTE techniques, namely SMOTE, Borderline-SMOTE1, Borderline-SMOTE2, SMOTE-NC, and SVM-SMOTE, were used for primary preprocessing. Then, the Decision Tree (DT), Random Forest (RF), Naive Bayes (NB), Artificial Neural Network (ANN), Support Vector Machine (SVM), and Extreme Gradient Boosting (XGBoost) methods were used to predict patients' hospital mortality with traumatic injuries. The performance of the methods used was evaluated by sensitivity, specificity, Positive Predictive Value (PPV), Negative Predictive Value (NPV), accuracy, Area Under the Curve (AUC), Geometric Mean (G-means), F1 score, and P-value of McNemar's test.

RESULTS

Of the 126 patients admitted to an ICU, 117 (92.9%) survived and 9 (7.1%) died. The mean follow-up time from the date of trauma to the date of outcome was 3.98 ± 4.65 days. The performance of ML algorithms is not good with imbalanced data, whereas the performance of SMOTE-based ML algorithms is significantly improved. The mean area under the ROC curve (AUC) of all SMOTE-based models was more than 91%. F1-score and G-means before balancing the dataset were below 70% for all ML models except ANN. In contrast, F1-score and G-means for the balanced datasets reached more than 90% for all SMOTE-based models. Among all SMOTE-based ML methods, RF and ANN based on SMOTE and XGBoost based on SMOTE-NC achieved the highest value for all evaluation criteria.

CONCLUSIONS

This study has shown that SMOTE-based ML algorithms better predict outcomes in traumatic injuries than ML algorithms. They have the potential to assist ICU physicians in making clinical decisions.

Collapse

Wang X, Ren H, Ren J, Song W, Qiao Y, Ren Z, Zhao Y, Linghu L, Cui Y, Zhao Z, Chen L, Qiu L. Machine learning-enabled risk prediction of chronic obstructive pulmonary disease with unbalanced data. Comput Methods Programs Biomed 2023;230:107340. [PMID: 36640604 DOI: 10.1016/j.cmpb.2023.107340] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/09/2022] [Revised: 11/25/2022] [Accepted: 01/04/2023] [Indexed: 06/17/2023]

Abstract

BACKGROUND AND OBJECTIVE

Since the early symptoms of chronic obstructive pulmonary disease (COPD) are not obvious, patients are not easily identified, causing improper time for prevention and treatment. In present study, machine learning (ML) methods were employed to construct a risk prediction model for COPD to improve its prediction efficiency.

METHODS

We collected data from a sample of 5807 cases with a complete COPD diagnosis from the 2019 COPD Surveillance Program in Shanxi Province and extracted 34 potentially relevant variables from the dataset. Firstly, we used feature selection methods (i.e., Generalized elastic net, Lasso and Adaptive lasso) to select ten variables. Afterwards, we employed supervised classifiers for class imbalanced data by combining the cost-sensitive learning and SMOTE resampling methods with the ML methods (Logistic Regression, SVM, Random Forest, XGBoost, LightGBM, NGBoost and Stacking), respectively. Last, we assessed their performance.

RESULTS

The cough frequently at age 14 and before and other 9 variables are significant parameters for COPD. The Stacking heterogeneous ensemble model showed relatively good performance in the unbalanced datasets. The Logistic Regression with class weighting enjoyed the best classification performance in the balancing data when these composite indicators (AUC, F1-Score and G-mean) were used as criteria for model comparison. The values of F1-Score and G-mean for the top three ML models were 0.290/0.660 for Logistic Regression with class weighting, 0.288/0.649 for Stacking with synthetic minority oversampling technique (SMOTE), and 0.285/0.648 for LightGBM with SMOTE.

CONCLUSIONS

This paper combining feature selection methods, unbalanced data processing methods and machine learning methods with data from disease surveillance questionnaires and physical measurements to identify people at risk of COPD, concluded that machine learning models based on survey questionnaires could provide an automated identification for patients at risk of COPD, and provide a simple and scientific aid for early identification of COPD.

Collapse

Sato H, Kimura Y, Ohba M, Ara Y, Wakabayashi S, Watanabe H. Prediction of Prednisolone Dose Correction Using Machine Learning. J Healthc Inform Res 2023;7:84-103. [PMID: 36910914 DOI: 10.1007/s41666-023-00128-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2022] [Revised: 11/20/2022] [Accepted: 02/03/2023] [Indexed: 02/17/2023]

Abstract

Wrong dose, a common prescription error, can cause serious patient harm, especially in the case of high-risk drugs like oral corticosteroids. This study aims to build a machine learning model to predict dose-related prescription modifications for oral prednisolone tablets (i.e., highly imbalanced data with very few positive cases). Prescription data were obtained from the electronic medical records at a single institute. Cluster analysis classified the clinical departments into six clusters with similar patterns of prednisolone prescription. Two patterns of training datasets were created with/without preprocessing by the SMOTE method. Five ML models (SVM, KNN, GB, RF, and BRF) and logistic regression (LR) models were constructed by Python. The model was internally validated by five-fold stratified cross-validation and was validated with a 30% holdout test dataset. Eighty-two thousand five hundred fifty-three prescribing data for prednisolone tablets containing 135 dose-corrected positive cases were obtained. In the original dataset (without SMOTE), only the BRF model showed a good performance (in test dataset, ROC-AUC:0.917, recall: 0.951). In the training dataset preprocessed by SMOTE, performance was improved on all models. The highest performance models with SMOTE were SVM (in test dataset, ROC-AUC: 0.820, recall: 0.659) and BRF (ROC-AUC: 0.814, recall: 0.634). Although the prescribing data for dose-related collection are highly imbalanced, various techniques such as the following have allowed us to build high-performance prediction models: data preprocessing by SMOTE, stratified cross-validation, and BRF classifier corresponding to imbalanced data. ML is useful in complicated dose audits such as oral prednisolone.

Supplementary Information

The online version contains supplementary material available at 10.1007/s41666-023-00128-3.

Collapse

Wang Z, Stavrakis S, Yao B. Hierarchical deep learning with Generative Adversarial Network for automatic cardiac diagnosis from ECG signals. Comput Biol Med 2023;155:106641. [PMID: 36773553 DOI: 10.1016/j.compbiomed.2023.106641] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Revised: 01/11/2023] [Accepted: 02/05/2023] [Indexed: 02/10/2023]

Mafarja M, Thaher T, Al-Betar MA, Too J, Awadallah MA, Abu Doush I, Turabieh H. Classification framework for faulty-software using enhanced exploratory whale optimizer-based feature selection scheme and random forest ensemble learning. APPL INTELL 2023;53:1-43. [PMID: 36785593 PMCID: PMC9909674 DOI: 10.1007/s10489-022-04427-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/23/2022] [Indexed: 02/11/2023]

Abstract

Software Fault Prediction (SFP) is an important process to detect the faulty components of the software to detect faulty classes or faulty modules early in the software development life cycle. In this paper, a machine learning framework is proposed for SFP. Initially, pre-processing and re-sampling techniques are applied to make the SFP datasets ready to be used by ML techniques. Thereafter seven classifiers are compared, namely K-Nearest Neighbors (KNN), Naive Bayes (NB), Linear Discriminant Analysis (LDA), Linear Regression (LR), Decision Tree (DT), Support Vector Machine (SVM), and Random Forest (RF). The RF classifier outperforms all other classifiers in terms of eliminating irrelevant/redundant features. The performance of RF is improved further using a dimensionality reduction method called binary whale optimization algorithm (BWOA) to eliminate the irrelevant/redundant features. Finally, the performance of BWOA is enhanced by hybridizing the exploration strategies of the grey wolf optimizer (GWO) and harris hawks optimization (HHO) algorithms. The proposed method is called SBEWOA. The SFP datasets utilized are selected from the PROMISE repository using sixteen datasets for software projects with different sizes and complexity. The comparative evaluation against nine well-established feature selection methods proves that the proposed SBEWOA is able to significantly produce competitively superior results for several instances of the evaluated dataset. The algorithms' performance is compared in terms of accuracy, the number of features, and fitness function. This is also proved by the 2-tailed P-values of the Wilcoxon signed ranks statistical test used. In conclusion, the proposed method is an efficient alternative ML method for SFP that can be used for similar problems in the software engineering domain.

Collapse

Huang G, Jafari AH. Enhanced balancing GAN: minority-class image generation. Neural Comput Appl 2023;35:5145-54. [PMID: 34177125 DOI: 10.1007/s00521-021-06163-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2020] [Accepted: 05/24/2021] [Indexed: 01/25/2023]

Werner de Vargas V, Schneider Aranda JA, dos Santos Costa R, da Silva Pereira PR, Victória Barbosa JL. Imbalanced data preprocessing techniques for machine learning: a systematic mapping study. Knowl Inf Syst 2023;65:31-57. [PMID: 36405957 PMCID: PMC9645765 DOI: 10.1007/s10115-022-01772-8] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2021] [Revised: 09/27/2022] [Accepted: 10/02/2022] [Indexed: 11/10/2022]

Sowjanya AM, Mrudula O. Effective treatment of imbalanced datasets in health care using modified SMOTE coupled with stacked deep learning algorithms. Appl Nanosci 2023;13:1829-1840. [PMID: 35132368 PMCID: PMC8811587 DOI: 10.1007/s13204-021-02063-4] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/21/2021] [Accepted: 08/28/2021] [Indexed: 12/03/2022]

Duong HTH, Tran LTM, To HQ, Van Nguyen K. Academic performance warning system based on data driven for higher education. Neural Comput Appl 2023;35:5819-5837. [PMID: 36408289 PMCID: PMC9640845 DOI: 10.1007/s00521-022-07997-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2022] [Accepted: 10/25/2022] [Indexed: 11/09/2022]

Zhang X, Liu K, Yuan B, Wang H, Chen S, Xue Y, Chen W, Liu M, Hu Y. A hybrid adaptive approach for instance transfer learning with dynamic and imbalanced data. INT J INTELL SYST 2022;37:11582-11599. [PMID: 36816520 PMCID: PMC9936919 DOI: 10.1002/int.23055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2022] [Accepted: 08/16/2022] [Indexed: 11/06/2022]

Roumani YF. Sports analytics in the NFL: classifying the winner of the superbowl. Ann Oper Res 2022;325:715-730. [PMID: 36467004 PMCID: PMC9684891 DOI: 10.1007/s10479-022-05063-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Accepted: 11/02/2022] [Indexed: 06/03/2023]

Tang J, Wang X, Wan H, Lin C, Shao Z, Chang Y, Wang H, Wu Y, Zhang T, Du Y. Joint modeling strategy for using electronic medical records data to build machine learning models: an example of intracerebral hemorrhage. BMC Med Inform Decis Mak 2022;22:278. [PMID: 36284327 PMCID: PMC9594939 DOI: 10.1186/s12911-022-02018-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2022] [Accepted: 10/10/2022] [Indexed: 11/28/2022] Open

Abstract

Background

Outliers and class imbalance in medical data could affect the accuracy of machine learning models. For physicians who want to apply predictive models, how to use the data at hand to build a model and what model to choose are very thorny problems. Therefore, it is necessary to consider outliers, imbalanced data, model selection, and parameter tuning when modeling.

Methods

This study used a joint modeling strategy consisting of: outlier detection and removal, data balancing, model fitting and prediction, performance evaluation. We collected medical record data for all ICH patients with admissions in 2017–2019 from Sichuan Province. Clinical and radiological variables were used to construct models to predict mortality outcomes 90 days after discharge. We used stacking ensemble learning to combine logistic regression (LR), random forest (RF), artificial neural network (ANN), support vector machine (SVM), and k-nearest neighbors (KNN) models. Accuracy, sensitivity, specificity, AUC, precision, and F1 score were used to evaluate model performance. Finally, we compared all 84 combinations of the joint modeling strategy, including training set with and without cross-validated committees filter (CVCF), five resampling techniques (random under-sampling (RUS), random over-sampling (ROS), adaptive synthetic sampling (ADASYN), Borderline synthetic minority oversampling technique (Borderline SMOTE), synthetic minority oversampling technique and edited nearest neighbor (SMOTEENN)) and no resampling, seven models (LR, RF, ANN, SVM, KNN, Stacking, AdaBoost).

Results

Among 4207 patients with ICH, 2909 (69.15%) survived 90 days after discharge, and 1298 (30.85%) died within 90 days after discharge. The performance of all models improved with removing outliers by CVCF except sensitivity. For data balancing processing, the performance of training set without resampling was better than that of training set with resampling in terms of accuracy, specificity, and precision. And the AUC of ROS was the best. For seven models, the average accuracy, specificity, AUC, and precision of RF were the highest. Stacking performed best in F1 score. Among all 84 combinations of joint modeling strategy, eight combinations performed best in terms of accuracy (0.816). For sensitivity, the best performance was SMOTEENN + Stacking (0.662). For specificity, the best performance was CVCF + KNN (0.987). Stacking and AdaBoost had the best performances in AUC (0.756) and F1 score (0.602), respectively. For precision, the best performance was CVCF + SVM (0.938).

Conclusion

This study proposed a joint modeling strategy including outlier detection and removal, data balancing, model fitting and prediction, performance evaluation, in order to provide a reference for physicians and researchers who want to build their own models. This study illustrated the importance of outlier detection and removal for machine learning and showed that ensemble learning might be a good modeling strategy. Due to the low imbalanced ratio (IR, the ratio of majority class and minority class) in this study, we did not find any improvement in models with resampling in terms of accuracy, specificity, and precision, while ROS performed best on AUC.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12911-022-02018-x.

Collapse

Affiliation(s)

Jianxiang Tang Department of Epidemiology and Health Statistics, West China School of Public Health and West China Fourth Hospital, Sichuan University, Chengdu, Sichuan, People's Republic of China
Xiaoyu Wang Department of Neurosurgery, West China Hospital of Sichuan University, Chengdu, Sichuan, People's Republic of China
Hongli Wan Department of Epidemiology and Health Statistics, West China School of Public Health and West China Fourth Hospital, Sichuan University, Chengdu, Sichuan, People's Republic of China.,Health Emergency Management Research Center, West China-PUMC C.C. Chen Institute of Health, Sichuan University, Chengdu, Sichuan, People's Republic of China
Chunying Lin Department of Epidemiology and Health Statistics, West China School of Public Health and West China Fourth Hospital, Sichuan University, Chengdu, Sichuan, People's Republic of China.,Health Emergency Management Research Center, West China-PUMC C.C. Chen Institute of Health, Sichuan University, Chengdu, Sichuan, People's Republic of China
Zilun Shao Department of Epidemiology and Health Statistics, West China School of Public Health and West China Fourth Hospital, Sichuan University, Chengdu, Sichuan, People's Republic of China.,Health Emergency Management Research Center, West China-PUMC C.C. Chen Institute of Health, Sichuan University, Chengdu, Sichuan, People's Republic of China
Yang Chang Department of Epidemiology and Health Statistics, West China School of Public Health and West China Fourth Hospital, Sichuan University, Chengdu, Sichuan, People's Republic of China.,Health Emergency Management Research Center, West China-PUMC C.C. Chen Institute of Health, Sichuan University, Chengdu, Sichuan, People's Republic of China
Hexuan Wang Department of Epidemiology and Health Statistics, West China School of Public Health and West China Fourth Hospital, Sichuan University, Chengdu, Sichuan, People's Republic of China.,Health Emergency Management Research Center, West China-PUMC C.C. Chen Institute of Health, Sichuan University, Chengdu, Sichuan, People's Republic of China
Yi Wu Department of Epidemiology and Health Statistics, West China School of Public Health and West China Fourth Hospital, Sichuan University, Chengdu, Sichuan, People's Republic of China.,Health Emergency Management Research Center, West China-PUMC C.C. Chen Institute of Health, Sichuan University, Chengdu, Sichuan, People's Republic of China
Tao Zhang Department of Epidemiology and Health Statistics, West China School of Public Health and West China Fourth Hospital, Sichuan University, Chengdu, Sichuan, People's Republic of China. .,Health Emergency Management Research Center, West China-PUMC C.C. Chen Institute of Health, Sichuan University, Chengdu, Sichuan, People's Republic of China.
Yu Du Health Emergency Management Research Center, West China-PUMC C.C. Chen Institute of Health, Sichuan University, Chengdu, Sichuan, People's Republic of China. .,Department of Emergency and Critical Care Medicine, West China School of Public Health, West China Fourth Hospital, Sichuan University, Chengdu, Sichuan, People's Republic of China.

Collapse

Hajek P, Abedin MZ, Sivarajah U. Fraud Detection in Mobile Payment Systems using an XGBoost-based Framework. Inf Syst Front 2022;25:1-19. [PMID: 36258679 PMCID: PMC9560719 DOI: 10.1007/s10796-022-10346-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Accepted: 09/23/2022] [Indexed: 06/16/2023]

Wibowo P, Fatichah C. Pruning-based oversampling technique with smoothed bootstrap resampling for imbalanced clinical dataset of Covid-19. J King Saud Univ Comput Inf Sci 2022;34:7830-7839. [PMID: 38620726 PMCID: PMC8482553 DOI: 10.1016/j.jksuci.2021.09.021] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/01/2021] [Revised: 07/28/2021] [Accepted: 09/25/2021] [Indexed: 11/25/2022]

Khodabandelu S, Basirat Z, Khaleghi S, Khafri S, Montazery Kordy H, Golsorkhtabaramiri M. Developing machine learning-based models to predict intrauterine insemination (IUI) success by address modeling challenges in imbalanced data and providing modification solutions for them. BMC Med Inform Decis Mak 2022;22:228. [PMID: 36050710 PMCID: PMC9434923 DOI: 10.1186/s12911-022-01974-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2022] [Accepted: 08/24/2022] [Indexed: 12/03/2022] Open

Abstract

Background

This study sought to provide machine learning-based classification models to predict the success of intrauterine insemination (IUI) therapy. Additionally, we sought to illustrate the effect of models fitting with balanced data vs original data with imbalanced data labels using two different types of resampling methods. Finally, we fit models with all features against optimized feature sets using various feature selection techniques.

Methods

The data for the cross-sectional study were collected from 546 infertile couples with IUI at the Fatemehzahra Infertility Research Center, Babol, North of Iran. Logistic regression (LR), support vector classification, random forest, Extreme Gradient Boosting (XGBoost) and, Stacking generalization (Stack) as the machine learning classifiers were used to predict IUI success by Python v3.7. We employed the Smote-Tomek (Stomek) and Smote-ENN (SENN) resampling methods to address the imbalance problem in the original dataset. Furthermore, to increase the performance of the models, mutual information classification (MIC-FS), genetic algorithm (GA-FS), and random forest (RF-FS) were used to select the ideal feature sets for model development.

Results

In this study, 28% of patients undergoing IUI treatment obtained a successful pregnancy. Also, the average age of women and men was 24.98 and 29.85 years, respectively. The calibration plot in this study for IUI success prediction by machine learning models showed that between feature selection methods, the RF-FS, and among the datasets used to fit the models, the balanced dataset with the Stomek method had well-calibrating predictions than other methods. Finally, the brier scores for the LR, SVC, RF, XGBoost, and Stack models that were fitted utilizing the Stomek dataset and the chosen feature set using the Random Forest technique obtained equal to 0.202, 0.183, 0.158, 0.129, and 0.134, respectively. It showed duration of infertility, male and female age, sperm concentration, and sperm motility grading score as the most predictable factors in IUI success.

Conclusion

The results of this study with the XGBoost prediction model can be used to foretell the individual success of IUI for each couple before initiating therapy.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12911-022-01974-8.

Collapse

Xu X, Wang C, Gui B, Yuan X, Li C, Zhao Y, Martyniuk CJ, Su L. Application of machine learning to predict the inhibitory activity of organic chemicals on thyroid stimulating hormone receptor. Environ Res 2022;212:113175. [PMID: 35351457 DOI: 10.1016/j.envres.2022.113175] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/27/2021] [Revised: 03/04/2022] [Accepted: 03/22/2022] [Indexed: 06/14/2023]

Liu X, Guo L, Wang H, Guo J, Yang S, Duan L. Research on imbalance machine learning methods for MR[Formula: see text]WI soft tissue sarcoma data. BMC Med Imaging 2022;22:149. [PMID: 36028803 PMCID: PMC9417078 DOI: 10.1186/s12880-022-00876-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2022] [Accepted: 08/08/2022] [Indexed: 11/10/2022] Open

Boo Y, Choi Y. Comparison of mortality prediction models for road traffic accidents: an ensemble technique for imbalanced data. BMC Public Health 2022;22:1476. [PMID: 35918672 PMCID: PMC9344638 DOI: 10.1186/s12889-022-13719-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2021] [Accepted: 06/27/2022] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

Injuries caused by RTA are classified under the International Classification of Diseases-10 as 'S00-T99' and represent imbalanced samples with a mortality rate of only 1.2% among all RTA victims. To predict the characteristics of external causes of road traffic accident (RTA) injuries and mortality, we compared performances based on differences in the correction and classification techniques for imbalanced samples.

METHODS

The present study extracted and utilized data spanning over a 5-year period (2013-2017) from the Korean National Hospital Discharge In-depth Injury Survey (KNHDS), a national level survey conducted by the Korea Disease Control and Prevention Agency, A total of eight variables were used in the prediction, including patient, accident, and injury/disease characteristics. As the data was imbalanced, a sample consisting of only severe injuries was constructed and compared against the total sample. Considering the characteristics of the samples, preprocessing was performed in the study. The samples were standardized first, considering that they contained many variables with different units. Among the ensemble techniques for classification, the present study utilized Random Forest, Extra-Trees, and XGBoost. Four different over- and under-sampling techniques were used to compare the performance of algorithms using "accuracy", "precision", "recall", "F1", and "MCC".

RESULTS

The results showed that among the prediction techniques, XGBoost had the best performance. While the synthetic minority oversampling technique (SMOTE), a type of over-sampling, also demonstrated a certain level of performance, under-sampling was the most superior. Overall, prediction by the XGBoost model with samples using SMOTE produced the best results.

CONCLUSION

This study presented the results of an empirical comparison of the validity of sampling techniques and classification algorithms that affect the accuracy of imbalanced samples by combining two techniques. The findings could be used as reference data in classification analyses of imbalanced data in the medical field.

Collapse

Bonannella C, Hengl T, Heisig J, Parente L, Wright MN, Herold M, de Bruin S. Forest tree species distribution for Europe 2000-2020: mapping potential and realized distributions using spatiotemporal machine learning. PeerJ 2022;10:e13728. [PMID: 35910765 PMCID: PMC9332400 DOI: 10.7717/peerj.13728] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2022] [Accepted: 06/22/2022] [Indexed: 01/17/2023] Open

Abstract

This article describes a data-driven framework based on spatiotemporal machine learning to produce distribution maps for 16 tree species (Abies alba Mill., Castanea sativa Mill., Corylus avellana L., Fagus sylvatica L., Olea europaea L., Picea abies L. H. Karst., Pinus halepensis Mill., Pinus nigra J. F. Arnold, Pinus pinea L., Pinus sylvestris L., Prunus avium L., Quercus cerris L., Quercus ilex L., Quercus robur L., Quercus suber L. and Salix caprea L.) at high spatial resolution (30 m). Tree occurrence data for a total of three million of points was used to train different algorithms: random forest, gradient-boosted trees, generalized linear models, k-nearest neighbors, CART and an artificial neural network. A stack of 305 coarse and high resolution covariates representing spectral reflectance, different biophysical conditions and biotic competition was used as predictors for realized distributions, while potential distribution was modelled with environmental predictors only. Logloss and computing time were used to select the three best algorithms to tune and train an ensemble model based on stacking with a logistic regressor as a meta-learner. An ensemble model was trained for each species: probability and model uncertainty maps of realized distribution were produced for each species using a time window of 4 years for a total of six distribution maps per species, while for potential distributions only one map per species was produced. Results of spatial cross validation show that the ensemble model consistently outperformed or performed as good as the best individual model in both potential and realized distribution tasks, with potential distribution models achieving higher predictive performances (TSS = 0.898, R2 logloss = 0.857) than realized distribution ones on average (TSS = 0.874, R2 logloss = 0.839). Ensemble models for Q. suber achieved the best performances in both potential (TSS = 0.968, R2 logloss = 0.952) and realized (TSS = 0.959, R2 logloss = 0.949) distribution, while P. sylvestris (TSS = 0.731, 0.785, R2 logloss = 0.585, 0.670, respectively, for potential and realized distribution) and P. nigra (TSS = 0.658, 0.686, R2 logloss = 0.623, 0.664) achieved the worst. Importance of predictor variables differed across species and models, with the green band for summer and the Normalized Difference Vegetation Index (NDVI) for fall for realized distribution and the diffuse irradiation and precipitation of the driest quarter (BIO17) being the most frequent and important for potential distribution. On average, fine-resolution models outperformed coarse resolution models (250 m) for realized distribution (TSS = +6.5%, R2 logloss = +7.5%). The framework shows how combining continuous and consistent Earth Observation time series data with state of the art machine learning can be used to derive dynamic distribution maps. The produced predictions can be used to quantify temporal trends of potential forest degradation and species composition change.

Collapse

Zhang S, Mi T, Wu Q, Luo Y, Grieneisen ML, Shi G, Yang F, Zhan Y. A data-augmentation approach to deriving long-term surface SO₂ across Northern China: Implications for interpretable machine learning. Sci Total Environ 2022;827:154278. [PMID: 35248628 DOI: 10.1016/j.scitotenv.2022.154278] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/22/2021] [Revised: 02/22/2022] [Accepted: 02/27/2022] [Indexed: 06/14/2023]

Liu R, Wang M, Zheng T, Zhang R, Li N, Chen Z, Yan H, Shi Q. An artificial intelligence-based risk prediction model of myocardial infarction. BMC Bioinformatics 2022;23:217. [PMID: 35672659 PMCID: PMC9175344 DOI: 10.1186/s12859-022-04761-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2022] [Accepted: 05/30/2022] [Indexed: 02/08/2023] Open

Abstract

BACKGROUND

Myocardial infarction can lead to malignant arrhythmia, heart failure, and sudden death. Clinical studies have shown that early identification of and timely intervention for acute MI can significantly reduce mortality. The traditional MI risk assessment models are subjective, and the data that go into them are difficult to obtain. Generally, the assessment is only conducted among high-risk patient groups.

OBJECTIVE

To construct an artificial intelligence-based risk prediction model of myocardial infarction (MI) for continuous and active monitoring of inpatients, especially those in noncardiovascular departments, and early warning of MI.

METHODS

The imbalanced data contain 59 features, which were constructed into a specific dataset through proportional division, upsampling, downsampling, easy ensemble, and w-easy ensemble. Then, the dataset was traversed using supervised machine learning, with recursive feature elimination as the top-layer algorithm and random forest, gradient boosting decision tree (GBDT), logistic regression, and support vector machine as the bottom-layer algorithms, to select the best model out of many through a variety of evaluation indices.

RESULTS

GBDT was the best bottom-layer algorithm, and downsampling was the best dataset construction method. In the validation set, the F1 score and accuracy of the 24-feature downsampling GBDT model were both 0.84. In the test set, the F1 score and accuracy of the 24-feature downsampling GBDT model were both 0.83, and the area under the curve was 0.91.

CONCLUSION

Compared with traditional models, artificial intelligence-based machine learning models have better accuracy and real-time performance and can reduce the occurrence of in-hospital MI from a data-driven perspective, thereby increasing the cure rate of patients and improving their prognosis.

Collapse

Liu S, Yao W. Prediction of lung cancer using gene expression and deep learning with KL divergence gene selection. BMC Bioinformatics 2022;23:175. [PMID: 35549644 PMCID: PMC9103042 DOI: 10.1186/s12859-022-04689-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2021] [Accepted: 04/13/2022] [Indexed: 11/24/2022] Open

Tarimo CS, Bhuyan SS, Zhao Y, Ren W, Mohammed A, Li Q, Gardner M, Mahande MJ, Wang Y, Wu J. Prediction of low Apgar score at five minutes following labor induction intervention in vaginal deliveries: machine learning approach for imbalanced data at a tertiary hospital in North Tanzania. BMC Pregnancy Childbirth 2022;22:275. [PMID: 35365129 PMCID: PMC8976377 DOI: 10.1186/s12884-022-04534-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2021] [Accepted: 02/28/2022] [Indexed: 11/18/2022] Open

Abstract

Background

Prediction of low Apgar score for vaginal deliveries following labor induction intervention is critical for improving neonatal health outcomes. We set out to investigate important attributes and train popular machine learning (ML) algorithms to correctly classify neonates with a low Apgar scores from an imbalanced learning perspective.

Methods

We analyzed 7716 induced vaginal deliveries from the electronic birth registry of the Kilimanjaro Christian Medical Centre (KCMC). 733 (9.5%) of which constituted of low (< 7) Apgar score neonates. The ‘extra-tree classifier’ was used to assess features’ importance. We used Area Under Curve (AUC), recall, precision, F-score, Matthews Correlation Coefficient (MCC), balanced accuracy (BA), bookmaker informedness (BM), and markedness (MK) to evaluate the performance of the selected six (6) machine learning classifiers. To address class imbalances, we examined three widely used resampling techniques: the Synthetic Minority Oversampling Technique (SMOTE) and Random Oversampling Examples (ROS) and Random undersampling techniques (RUS). We applied Decision Curve Analysis (DCA) to evaluate the net benefit of the selected classifiers.

Results

Birth weight, maternal age, and gestational age were found to be important predictors for the low Apgar score following induced vaginal delivery. SMOTE, ROS and and RUS techniques were more effective at improving “recalls” among other metrics in all the models under investigation. A slight improvement was observed in the F1 score, BA, and BM. DCA revealed potential benefits of applying Boosting method for predicting low Apgar scores among the tested models.

Conclusion

There is an opportunity for more algorithms to be tested to come up with theoretical guidance on more effective rebalancing techniques suitable for this particular imbalanced ratio. Future research should prioritize a debate on which performance indicators to look up to when dealing with imbalanced or skewed data.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12884-022-04534-0.

Collapse

Chu C, Wang S, Rudd JC, Ibrahim AMH, Xue Q, Devkota RN, Baker JA, Baker S, Simoneaux B, Opena G, Dong H, Liu X, Jessup KE, Chen MS, Hui K, Metz R, Johnson CD, Zhang ZS, Liu S. A new strategy for using historical imbalanced yield data to conduct genome-wide association studies and develop genomic prediction models for wheat breeding. Mol Breed 2022;42:18. [PMID: 37309459 PMCID: PMC10248704 DOI: 10.1007/s11032-022-01287-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/31/2021] [Accepted: 03/02/2022] [Indexed: 06/14/2023]

Abstract

Using imbalanced historical yield data to predict performance and select new lines is an arduous breeding task. Genome-wide association studies (GWAS) and high throughput genotyping based on sequencing techniques can increase prediction accuracy. An association mapping panel of 227 Texas elite (TXE) wheat breeding lines was used for GWAS and a training population to develop prediction models for grain yield selection. An imbalanced set of yield data collected from 102 environments (year-by-location) over 10 years, through testing yield in 40-66 lines each year at 6-14 locations with 38-41 lines repeated in the test in any two consecutive years, was used. Based on correlations among data from different environments within two adjacent years and heritability estimated in each environment, yield data from 87 environments were selected and assigned to two correlation-based groups. The yield best linear unbiased estimation (BLUE) from each group, along with reaction to greenbug and Hessian fly in each line, was used for GWAS to reveal genomic regions associated with yield and insect resistance. A total of 74 genomic regions were associated with grain yield and two of them were commonly detected in both correlation-based groups. Greenbug resistance in TXE lines was mainly controlled by Gb3 on chromosome 7DL in addition to two novel regions on 3DL and 6DS, and Hessian fly resistance was conferred by the region on 1AS. Genomic prediction models developed in two correlation-based groups were validated using a set of 105 new advanced breeding lines and the model from correlation-based group G2 was more reliable for prediction. This research not only identified genomic regions associated with yield and insect resistance but also established the method of using historical imbalanced breeding data to develop a genomic prediction model for crop improvement.

Supplementary Information

The online version contains supplementary material available at 10.1007/s11032-022-01287-8.

Collapse

Sadeghi S, Khalili D, Ramezankhani A, Mansournia MA, Parsaeian M. Diabetes mellitus risk prediction in the presence of class imbalance using flexible machine learning methods. BMC Med Inform Decis Mak 2022;22:36. [PMID: 35139846 PMCID: PMC8830137 DOI: 10.1186/s12911-022-01775-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2021] [Accepted: 02/07/2022] [Indexed: 12/24/2022] Open

Abstract

Background

Early detection and prediction of type two diabetes mellitus incidence by baseline measurements could reduce associated complications in the future. The low incidence rate of diabetes in comparison with non-diabetes makes accurate prediction of minority diabetes class more challenging.

Methods

Deep neural network (DNN), extremely gradient boosting (XGBoost), and random forest (RF) performance is compared in predicting minority diabetes class in Tehran Lipid and Glucose Study (TLGS) cohort data. The impact of changing threshold, cost-sensitive learning, over and under-sampling strategies as solutions to class imbalance have been compared in improving algorithms performance.

Results

DNN with the highest accuracy in predicting diabetes, 54.8%, outperformed XGBoost and RF in terms of AUROC, g-mean, and f1-measure in original imbalanced data. Changing threshold based on the maximum of f1-measure improved performance in g-mean, and f1-measure in three algorithms. Repeated edited nearest neighbors (RENN) under-sampling in DNN and cost-sensitive learning in tree-based algorithms were the best solutions to tackle the imbalance issue. RENN increased ROC and Precision-Recall AUCs, g-mean and f1-measure from 0.857, 0.603, 0.713, 0.575 to 0.862, 0.608, 0.773, 0.583, respectively in DNN. Weighing improved g-mean and f1-measure from 0.667, 0.554 to 0.776, 0.588 in XGBoost, and from 0.659, 0.543 to 0.775, 0.566 in RF, respectively. Also, ROC and Precision-Recall AUCs in RF increased from 0.840, 0.578 to 0.846, 0.591, respectively.

Conclusion

G-mean experienced the most increase by all imbalance solutions. Weighing and changing threshold as efficient strategies, in comparison with resampling methods are faster solutions to handle class imbalance. Among sampling strategies, under-sampling methods had better performance than others.

Collapse

Chen W, Han X, Wang J, Cao Y, Jia X, Zheng Y, Zhou J, Zeng W, Wang L, Shi H, Feng J. Deep diagnostic agent forest (DDAF): A deep learning pathogen recognition system for pneumonia based on CT. Comput Biol Med 2021;141:105143. [PMID: 34953357 DOI: 10.1016/j.compbiomed.2021.105143] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2021] [Revised: 12/05/2021] [Accepted: 12/12/2021] [Indexed: 11/03/2022]

Affiliation(s)

Weixiang Chen Department of Automation, Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing, China
Xiaoyu Han Department of Radiology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China; Department of Laboratory Medicine, Liyuan Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
Jian Wang Department of Clinical Laboratory, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China; Research Center for Tissue Engineering and Regenerative Medicine, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
Yukun Cao Department of Radiology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China; Department of Laboratory Medicine, Liyuan Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
Xi Jia Department of Radiology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China; Department of Laboratory Medicine, Liyuan Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
Yuting Zheng Department of Radiology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China; Department of Laboratory Medicine, Liyuan Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
Jie Zhou Department of Automation, Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing, China
Wenjuan Zeng Department of Clinical Laboratory, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China; Research Center for Tissue Engineering and Regenerative Medicine, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China.
Lin Wang Department of Clinical Laboratory, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China; Research Center for Tissue Engineering and Regenerative Medicine, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China.
Heshui Shi Department of Radiology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China; Department of Laboratory Medicine, Liyuan Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China.
Jianjiang Feng Department of Automation, Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing, China.

Collapse

Wu J, Shen J, Xu M, Shao M. A novel combined dynamic ensemble selection model for imbalanced data to detect COVID-19 from complete blood count. Comput Methods Programs Biomed 2021;211:106444. [PMID: 34614451 PMCID: PMC8479386 DOI: 10.1016/j.cmpb.2021.106444] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/27/2021] [Accepted: 09/22/2021] [Indexed: 06/01/2023]

Sayed GI, Soliman MM, Hassanien AE. A novel melanoma prediction model for imbalanced data using optimized SqueezeNet by bald eagle search optimization. Comput Biol Med 2021;136:104712. [PMID: 34388470 DOI: 10.1016/j.compbiomed.2021.104712] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2021] [Revised: 07/28/2021] [Accepted: 07/28/2021] [Indexed: 10/20/2022]

Gnip P, Vokorokos L, Drotár P. Selective oversampling approach for strongly imbalanced data. PeerJ Comput Sci 2021;7:e604. [PMID: 34239981 PMCID: PMC8237317 DOI: 10.7717/peerj-cs.604] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2021] [Accepted: 05/31/2021] [Indexed: 06/03/2023]

Xiao Y, Wu J, Lin Z. Cancer diagnosis using generative adversarial networks based on deep learning from imbalanced data. Comput Biol Med 2021;135:104540. [PMID: 34153791 DOI: 10.1016/j.compbiomed.2021.104540] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2020] [Revised: 05/14/2021] [Accepted: 05/26/2021] [Indexed: 11/19/2022]

Chan JH, Li C. Learning from imbalanced COVID-19 chest X-ray (CXR) medical imaging data. Methods 2021;202:31-39. [PMID: 34090971 DOI: 10.1016/j.ymeth.2021.06.002] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2021] [Revised: 05/07/2021] [Accepted: 06/01/2021] [Indexed: 12/24/2022] Open

Chiong R, Budhi GS, Dhakal S, Chiong F. A textual-based featuring approach for depression detection using machine learning classifiers and social media texts. Comput Biol Med 2021;135:104499. [PMID: 34174760 DOI: 10.1016/j.compbiomed.2021.104499] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2021] [Revised: 05/12/2021] [Accepted: 05/14/2021] [Indexed: 10/21/2022]

Ma JH, Feng Z, Wu JY, Zhang Y, Di W. Learning from imbalanced fetal outcomes of systemic lupus erythematosus in artificial neural networks. BMC Med Inform Decis Mak 2021;21:127. [PMID: 33845834 PMCID: PMC8042715 DOI: 10.1186/s12911-021-01486-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2021] [Accepted: 03/31/2021] [Indexed: 11/10/2022] Open

Srinivasan R, Subalalitha CN. Sentimental analysis from imbalanced code-mixed data using machine learning approaches. Distrib Parallel Databases 2021;41:37-52. [PMID: 33776212 PMCID: PMC7980744 DOI: 10.1007/s10619-021-07331-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 03/03/2021] [Indexed: 05/29/2023]

Wang X, Zhai M, Ren Z, Ren H, Li M, Quan D, Chen L, Qiu L. Exploratory study on classification of diabetes mellitus through a combined Random Forest Classifier. BMC Med Inform Decis Mak 2021;21:105. [PMID: 33743696 PMCID: PMC7980612 DOI: 10.1186/s12911-021-01471-4] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2020] [Accepted: 03/11/2021] [Indexed: 12/23/2022] Open

Abstract

BACKGROUND

Diabetes Mellitus (DM) has become the third chronic non-communicable disease that hits patients after tumors, cardiovascular and cerebrovascular diseases, and has become one of the major public health problems in the world. Therefore, it is of great importance to identify individuals at high risk for DM in order to establish prevention strategies for DM.

METHODS

Aiming at the problem of high-dimensional feature space and high feature redundancy of medical data, as well as the problem of data imbalance often faced. This study explored different supervised classifiers, combined with SVM-SMOTE and two feature dimensionality reduction methods (Logistic stepwise regression and LAASO) to classify the diabetes survey sample data with unbalanced categories and complex related factors. Analysis and discussion of the classification results of 4 supervised classifiers based on 4 data processing methods. Five indicators including Accuracy, Precision, Recall, F1-Score and AUC are selected as the key indicators to evaluate the performance of the classification model.

RESULTS

According to the result, Random Forest Classifier combining SVM-SMOTE resampling technology and LASSO feature screening method (Accuracy = 0.890, Precision = 0.869, Recall = 0.919, F1-Score = 0.893, AUC = 0.948) proved the best way to tell those at high risk of DM. Besides, the combined algorithm helps enhance the classification performance for prediction of high-risk people of DM. Also, age, region, heart rate, hypertension, hyperlipidemia and BMI are the top six most critical characteristic variables affecting diabetes.

CONCLUSIONS

The Random Forest Classifier combining with SVM-SMOTE and LASSO feature reduction method perform best in identifying high-risk people of DM from individuals. And the combined method proposed in the study would be a good tool for early screening of DM.

Collapse

Zunair H, Hamza AB. Synthesis of COVID-19 chest X-rays using unpaired image-to-image translation. Soc Netw Anal Min 2021;11:23. [PMID: 33643491 PMCID: PMC7903408 DOI: 10.1007/s13278-021-00731-5] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2020] [Revised: 01/05/2021] [Accepted: 02/04/2021] [Indexed: 12/28/2022]

Sampath V, Maurtua I, Aguilar Martín JJ, Gutierrez A. A survey on generative adversarial networks for imbalance problems in computer vision tasks. J Big Data 2021;8:27. [PMID: 33552840 PMCID: PMC7845583 DOI: 10.1186/s40537-021-00414-0] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/30/2020] [Accepted: 01/16/2021] [Indexed: 05/21/2023]

Abstract

Any computer vision application development starts off by acquiring images and data, then preprocessing and pattern recognition steps to perform a task. When the acquired images are highly imbalanced and not adequate, the desired task may not be achievable. Unfortunately, the occurrence of imbalance problems in acquired image datasets in certain complex real-world problems such as anomaly detection, emotion recognition, medical image analysis, fraud detection, metallic surface defect detection, disaster prediction, etc., are inevitable. The performance of computer vision algorithms can significantly deteriorate when the training dataset is imbalanced. In recent years, Generative Adversarial Neural Networks (GANs) have gained immense attention by researchers across a variety of application domains due to their capability to model complex real-world image data. It is particularly important that GANs can not only be used to generate synthetic images, but also its fascinating adversarial learning idea showed good potential in restoring balance in imbalanced datasets. In this paper, we examine the most recent developments of GANs based techniques for addressing imbalance problems in image data. The real-world challenges and implementations of synthetic image generation based on GANs are extensively covered in this survey. Our survey first introduces various imbalance problems in computer vision tasks and its existing solutions, and then examines key concepts such as deep generative image models and GANs. After that, we propose a taxonomy to summarize GANs based techniques for addressing imbalance problems in computer vision tasks into three major categories: 1. Image level imbalances in classification, 2. object level imbalances in object detection and 3. pixel level imbalances in segmentation tasks. We elaborate the imbalance problems of each group, and provide GANs based solutions in each group. Readers will understand how GANs based techniques can handle the problem of imbalances and boost performance of the computer vision algorithms.

Collapse