1
|
Zeinolabedini Rezaabad M, Lacey H, Marshall L, Johnson F. Influence of resampling techniques on Bayesian network performance in predicting increased algal activity. WATER RESEARCH 2023; 244:120558. [PMID: 37666153 DOI: 10.1016/j.watres.2023.120558] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/23/2023] [Revised: 08/10/2023] [Accepted: 08/30/2023] [Indexed: 09/06/2023]
Abstract
Early warning of increased algal activity is important to mitigate potential impacts on aquatic life and human health. While many methods have been developed to predict increased algal activity, an ongoing issue is that severe algal blooms often occur with low frequency in water bodies. This results in imbalanced data sets available for model specification, leading to poor predictions of the frequency of increased algal activity. One approach to address this is to resample data sets of increased algal activity to increase the prevalence of higher than normal algal activity in calibration data and ultimately improve model predictions. This study aims to investigate the use of resampling techniques to address the imbalanced dataset and determine if such methods can improve the prediction of increased algal activity. Three techniques were investigated, Kmeans under-sampling (US_Kmeans), synthetic minority over-sampling technique (SMOTE), and 'SMOTE and cluster-based under-sampling technique' (SCUT). The resampling methods were applied to a Bayesian network (BN) model of Lake Burragorang in New South Wales, Australia. The model was developed to predict chlorophyll-a (chl-a) using a range of water quality parameters as predictors. The original data and each of the balanced datasets were used for BN structures and parameter learning. The results showed that the best graphical structure was obtained by adding synthetic data from SMOTE with the highest true positive rate (TPR) and area under the curve (AUC). When compared using a fixed graphical structure for the BN, all resampling techniques increased the ability of the BN to detect events with higher probability of increased algal activity. The resampling model results can also be used to better understand the most important influences on high chl-a concentrations and suggest future data collection and model development priorities.
Collapse
Affiliation(s)
- Maryam Zeinolabedini Rezaabad
- Water Research Centre, School of Civil and Environmental Engineering, University of New South Wales, Kensington, New South Wales, Australia; ARC Training Centre Data Analytics for Resources and Environments, School of Life and Environmental Sciences, The University of Sydney, Camperdown, New South Wales, Australia.
| | | | - Lucy Marshall
- Water Research Centre, School of Civil and Environmental Engineering, University of New South Wales, Kensington, New South Wales, Australia; ARC Training Centre Data Analytics for Resources and Environments, School of Life and Environmental Sciences, The University of Sydney, Camperdown, New South Wales, Australia; Faculty of Science and Engineering, Macquarie University, North Ryde, New South Wales, Australia
| | - Fiona Johnson
- Water Research Centre, School of Civil and Environmental Engineering, University of New South Wales, Kensington, New South Wales, Australia; ARC Training Centre Data Analytics for Resources and Environments, School of Life and Environmental Sciences, The University of Sydney, Camperdown, New South Wales, Australia
| |
Collapse
|
2
|
Ren J, Wang Y, Mao M, Cheung YM. Equalization ensemble for large scale highly imbalanced data classification. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.108295] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
3
|
He F, Zhang W, Yan Z. A novel multi-stage ensemble model for credit scoring based on synthetic sampling and feature transformation. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2022. [DOI: 10.3233/jifs-211467] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Credit scoring has become increasingly important for financial institutions. With the advancement of artificial intelligence, machine learning methods, especially ensemble learning methods, have become increasingly popular for credit scoring. However, the problems of imbalanced data distribution and underutilized feature information have not been well addressed sufficiently. To make the credit scoring model more adaptable to imbalanced datasets, the original model-based synthetic sampling method is extended herein to balance the datasets by generating appropriate minority samples to alleviate class overlap. To enable the credit scoring model to extract inherent correlations from features, a new bagging-based feature transformation method is proposed, which transforms features using a tree-based algorithm and selects features using the chi-square statistic. Furthermore, a two-layer ensemble method that combines the advantages of dynamic ensemble selection and stacking is proposed to improve the classification performance of the proposed multi-stage ensemble model. Finally, four standardized datasets are used to evaluate the performance of the proposed ensemble model using six evaluation metrics. The experimental results confirm that the proposed ensemble model is effective in improving classification performance and is superior to other benchmark models.
Collapse
Affiliation(s)
- Fang He
- School of Information Management and Artificial Intelligence, Zhejiang University of Finance and Economics, Hangzhou, China
| | - Wenyu Zhang
- School of Information Management and Artificial Intelligence, Zhejiang University of Finance and Economics, Hangzhou, China
| | - Zhijia Yan
- School of Information Technology, Zhejiang Yuyin College of Vocational Technology, Hangzhou, China
| |
Collapse
|
4
|
Shogrkhodaei SZ, Razavi-Termeh SV, Fathnia A. Spatio-temporal modeling of PM 2.5 risk mapping using three machine learning algorithms. ENVIRONMENTAL POLLUTION (BARKING, ESSEX : 1987) 2021; 289:117859. [PMID: 34340183 DOI: 10.1016/j.envpol.2021.117859] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/18/2021] [Revised: 06/29/2021] [Accepted: 07/26/2021] [Indexed: 06/13/2023]
Abstract
Urban air pollution is one of the most critical issues that affect the environment, community health, economy, and management of urban areas. From a public health perspective, PM2.5 is one of the primary air pollutants, especially in Tehran's metropolis. Owing to the different patterns of PM2.5 in different seasons, Spatio-temporal modeling and identification of high-risk areas to reduce its effects seems necessary. The purpose of this study was Spatio-temporal modeling and preparation of PM2.5 risk mapping using three machine learning algorithms (random forest (RF), AdaBoost, and stochastic gradient descent (SGD)) in the metropolis of Tehran, Iran. Therefore, in the first step, to prepare the dependent variable data, the PM2.5 average was used for the four seasons of spring, summer, autumn, and winter. Then, using remote sensing (RS) and a geographic information system (GIS), independent data such as temperature, maximum temperature, minimum temperature, wind speed, rainfall, humidity, normalized difference vegetation index (NDVI), population density, street density, and distance to industrial centers were prepared as a seasonal average. To Spatio-temporal modeling using machine learning algorithms, 70% of the data were used for training and 30% for validation. The frequency ratio (FR) model was used as input to machine learning algorithms to calculate the spatial relationship between PM2.5 and the effective parameters. Finally, Spatio-temporal modeling and PM2.5 risk mapping were performed using three machine learning algorithms. The receiver operating characteristic (ROC) area under the curve (AUC) results showed that the RF algorithm had the greatest modeling accuracy, with values of 0.926, 0.94, 0.949, and 0.949 for spring, summer, autumn, and winter, respectively. According to the RF model, the most important variable in spring and autumn was NDVI. Temperature and distance to industrial centers were the most important variables in the summer and winter, respectively. The results showed that autumn, winter, summer, and spring had the highest risk of PM2.5, respectively.
Collapse
Affiliation(s)
| | - Seyed Vahid Razavi-Termeh
- Geoinformation Tech. Center of Excellence, Faculty of Geodesy and Geomatics Engineering, K.N. Toosi University of Technology, Tehran, 19697, Iran.
| | - Amanollah Fathnia
- Department of Geography, Faculty of Literature and Humanities, Razi University, Kermanshah, Iran.
| |
Collapse
|
5
|
Affiliation(s)
- Yang Feng
- Department of Biostatistics School of Global Public Health, New York University New York New York USA
| | - Min Zhou
- Division of Science and Technology Beijing Normal University‐Hong Kong Baptist University United International College Zhuhai China
| | - Xin Tong
- Department of Data Sciences and Operations Marshall School of Business, University of Southern California Los Angeles California USA
| |
Collapse
|
6
|
Esposito C, Landrum GA, Schneider N, Stiefl N, Riniker S. GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data in Machine Learning. J Chem Inf Model 2021; 61:2623-2640. [PMID: 34100609 DOI: 10.1021/acs.jcim.1c00160] [Citation(s) in RCA: 46] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Machine learning classifiers trained on class imbalanced data are prone to overpredict the majority class. This leads to a larger misclassification rate for the minority class, which in many real-world applications is the class of interest. For binary data, the classification threshold is set by default to 0.5 which, however, is often not ideal for imbalanced data. Adjusting the decision threshold is a good strategy to deal with the class imbalance problem. In this work, we present two different automated procedures for the selection of the optimal decision threshold for imbalanced classification. A major advantage of our procedures is that they do not require retraining of the machine learning models or resampling of the training data. The first approach is specific for random forest (RF), while the second approach, named GHOST, can be potentially applied to any machine learning classifier. We tested these procedures on 138 public drug discovery data sets containing structure-activity data for a variety of pharmaceutical targets. We show that both thresholding methods improve significantly the performance of RF. We tested the use of GHOST with four different classifiers in combination with two molecular descriptors, and we found that most classifiers benefit from threshold optimization. GHOST also outperformed other strategies, including random undersampling and conformal prediction. Finally, we show that our thresholding procedures can be effectively applied to real-world drug discovery projects, where the imbalance and characteristics of the data vary greatly between the training and test sets.
Collapse
Affiliation(s)
- Carmen Esposito
- Laboratory of Physical Chemistry, ETH Zurich, Vladimir-Prelog-Weg 2, 8093 Zurich, Switzerland
| | - Gregory A Landrum
- Laboratory of Physical Chemistry, ETH Zurich, Vladimir-Prelog-Weg 2, 8093 Zurich, Switzerland.,T5 Informatics GmbH, Spalenring 11, 4055 Basel, Switzerland
| | - Nadine Schneider
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Novartis Campus, 4002 Basel, Switzerland
| | - Nikolaus Stiefl
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Novartis Campus, 4002 Basel, Switzerland
| | - Sereina Riniker
- Laboratory of Physical Chemistry, ETH Zurich, Vladimir-Prelog-Weg 2, 8093 Zurich, Switzerland
| |
Collapse
|
7
|
Sun F, Fang F, Wang R, Wan B, Guo Q, Li H, Wu X. An Impartial Semi-Supervised Learning Strategy for Imbalanced Classification on VHR Images. SENSORS 2020; 20:s20226699. [PMID: 33238513 PMCID: PMC7700671 DOI: 10.3390/s20226699] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/27/2020] [Revised: 11/20/2020] [Accepted: 11/21/2020] [Indexed: 11/20/2022]
Abstract
Imbalanced learning is a common problem in remote sensing imagery-based land-use and land-cover classifications. Imbalanced learning can lead to a reduction in classification accuracy and even the omission of the minority class. In this paper, an impartial semi-supervised learning strategy based on extreme gradient boosting (ISS-XGB) is proposed to classify very high resolution (VHR) images with imbalanced data. ISS-XGB solves multi-class classification by using several semi-supervised classifiers. It first employs multi-group unlabeled data to eliminate the imbalance of training samples and then utilizes gradient boosting-based regression to simulate the target classes with positive and unlabeled samples. In this study, experiments were conducted on eight study areas with different imbalanced situations. The results showed that ISS-XGB provided a comparable but more stable performance than most commonly used classification approaches (i.e., random forest (RF), XGB, multilayer perceptron (MLP), and support vector machine (SVM)), positive and unlabeled learning (PU-Learning) methods (PU-BP and PU-SVM), and typical synthetic sample-based imbalanced learning methods. Especially under extremely imbalanced situations, ISS-XGB can provide high accuracy for the minority class without losing overall performance (the average overall accuracy achieves 85.92%). The proposed strategy has great potential in solving the imbalanced classification problems in remote sensing.
Collapse
Affiliation(s)
- Fei Sun
- School of Geography and Information Engineering, China University of Geosciences, Wuhan 430078, China; (F.S.); (F.F.); (B.W.); (H.L.); (X.W.)
- Academy of Computer, Huanggang Normal University, No. 146 Xinggang 2nd Road, Huanggang 438000, China
| | - Fang Fang
- School of Geography and Information Engineering, China University of Geosciences, Wuhan 430078, China; (F.S.); (F.F.); (B.W.); (H.L.); (X.W.)
- National Engineering Research Center for Geographic Information System, China University of Geosciences, Wuhan 430078, China
| | - Run Wang
- School of Geography and Information Engineering, China University of Geosciences, Wuhan 430078, China; (F.S.); (F.F.); (B.W.); (H.L.); (X.W.)
- Key Laboratory of Geological Survey and Evaluation of Ministry of Education, China University of Geosciences, Wuhan 430078, China
- Correspondence: ; Tel.: +86-027-6788-3728
| | - Bo Wan
- School of Geography and Information Engineering, China University of Geosciences, Wuhan 430078, China; (F.S.); (F.F.); (B.W.); (H.L.); (X.W.)
- National Engineering Research Center for Geographic Information System, China University of Geosciences, Wuhan 430078, China
| | - Qinghua Guo
- State Key Laboratory of Vegetation and Environmental Change, Institute of Botany, Chinese Academy of Sciences, Beijing 100093, China;
| | - Hong Li
- School of Geography and Information Engineering, China University of Geosciences, Wuhan 430078, China; (F.S.); (F.F.); (B.W.); (H.L.); (X.W.)
- National Engineering Research Center for Geographic Information System, China University of Geosciences, Wuhan 430078, China
| | - Xincai Wu
- School of Geography and Information Engineering, China University of Geosciences, Wuhan 430078, China; (F.S.); (F.F.); (B.W.); (H.L.); (X.W.)
- National Engineering Research Center for Geographic Information System, China University of Geosciences, Wuhan 430078, China
| |
Collapse
|
8
|
Feng S, Zhao C, Fu P. A cluster-based hybrid sampling approach for imbalanced data classification. THE REVIEW OF SCIENTIFIC INSTRUMENTS 2020; 91:055101. [PMID: 32486749 DOI: 10.1063/5.0008935] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/27/2020] [Accepted: 04/15/2020] [Indexed: 06/11/2023]
Abstract
When processing instrumental data by using classification approaches, the imbalanced dataset problem is usually challenging. As the minority class instances could be overwhelmed by the majority class instances, training a typical classifier with such a dataset directly might get poor results in classifying the minority class. We propose a cluster-based hybrid sampling approach CUSS (Cluster-based Under-sampling and SMOTE) for imbalanced dataset classification, which belongs to the type of data-level methods and is different from previously proposed hybrid methods. A new cluster-based under-sampling method is designed for CUSS, and a new strategy to set the expected instance number according to data distribution in the original training dataset is also proposed in this paper. The proposed method is compared with five other popular resampling methods on 15 datasets with different instance numbers and different imbalance ratios. The experimental results show that the CUSS method has good performance and outperforms other state-of-the-art methods.
Collapse
Affiliation(s)
- Shou Feng
- College of Information and Communication Engineering, Harbin Engineering University, Harbin 150001, China
| | - Chunhui Zhao
- College of Information and Communication Engineering, Harbin Engineering University, Harbin 150001, China
| | - Ping Fu
- School of Electronics and Information Engineering, Harbin Institute of Technology, Harbin 150001, China
| |
Collapse
|
9
|
Brøndum RF, Michaelsen TY, Bøgsted M. Regression on imperfect class labels derived by unsupervised clustering. Brief Bioinform 2020; 22:2012-2019. [PMID: 32124917 PMCID: PMC7986660 DOI: 10.1093/bib/bbaa014] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2019] [Revised: 01/23/2020] [Accepted: 01/24/2020] [Indexed: 12/29/2022] Open
Abstract
Outcome regressed on class labels identified by unsupervised clustering is custom in many applications. However, it is common to ignore the misclassification of class labels caused by the learning algorithm, which potentially leads to serious bias of the estimated effect parameters. Due to their generality we suggest to address the problem by use of regression calibration or the misclassification simulation and extrapolation method. Performance is illustrated by simulated data from Gaussian mixture models, documenting a reduced bias and improved coverage of confidence intervals when adjusting for misclassification with either method. Finally, we apply our method to data from a previous study, which regressed overall survival on class labels derived from unsupervised clustering of gene expression data from bone marrow samples of multiple myeloma patients.
Collapse
Affiliation(s)
- Rasmus Froberg Brøndum
- Corresponding author: Rasmus Froberg Brøndum, Sdr skovvej 15, DK-9000 Aalborg, Denmark; E-mail:
| | | | | |
Collapse
|