301
|
Asniar, Maulidevi NU, Surendro K. SMOTE-LOF for noise identification in imbalanced data classification. JOURNAL OF KING SAUD UNIVERSITY - COMPUTER AND INFORMATION SCIENCES 2021. [DOI: 10.1016/j.jksuci.2021.01.014] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
302
|
SMOTE-Based Weighted Deep Rotation Forest for the Imbalanced Hyperspectral Data Classification. REMOTE SENSING 2021. [DOI: 10.3390/rs13030464] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
Conventional classification algorithms have shown great success in balanced hyperspectral data classification. However, the imbalanced class distribution is a fundamental problem of hyperspectral data, and it is regarded as one of the great challenges in classification tasks. To solve this problem, a non-ANN based deep learning, namely SMOTE-Based Weighted Deep Rotation Forest (SMOTE-WDRoF) is proposed in this paper. First, the neighboring pixels of instances are introduced as the spatial information and balanced datasets are created by using the SMOTE algorithm. Second, these datasets are fed into the WDRoF model that consists of the rotation forest and the multi-level cascaded random forests. Specifically, the rotation forest is used to generate rotation feature vectors, which are input into the subsequent cascade forest. Furthermore, the output probability of each level and the original data are stacked as the dataset of the next level. And the sample weights are automatically adjusted according to the dynamic weight function constructed by the classification results of each level. Compared with the traditional deep learning approaches, the proposed method consumes much less training time. The experimental results on four public hyperspectral data demonstrate that the proposed method can get better performance than support vector machine, random forest, rotation forest, SMOTE combined rotation forest, convolutional neural network, and rotation-based deep forest in multiclass imbalance learning.
Collapse
|
303
|
A Rebalancing Framework for Classification of Imbalanced Medical Appointment No-show Data. JOURNAL OF DATA AND INFORMATION SCIENCE 2021. [DOI: 10.2478/jdis-2021-0011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Abstract
Purpose
This paper aims to improve the classification performance when the data is imbalanced by applying different sampling techniques available in Machine Learning.
Design/methodology/approach
The medical appointment no-show dataset is imbalanced, and when classification algorithms are applied directly to the dataset, it is biased towards the majority class, ignoring the minority class. To avoid this issue, multiple sampling techniques such as Random Over Sampling (ROS), Random Under Sampling (RUS), Synthetic Minority Oversampling TEchnique (SMOTE), ADAptive SYNthetic Sampling (ADASYN), Edited Nearest Neighbor (ENN), and Condensed Nearest Neighbor (CNN) are applied in order to make the dataset balanced. The performance is assessed by the Decision Tree classifier with the listed sampling techniques and the best performance is identified.
Findings
This study focuses on the comparison of the performance metrics of various sampling methods widely used. It is revealed that, compared to other techniques, the Recall is high when ENN is applied CNN and ADASYN have performed equally well on the Imbalanced data.
Research limitations
The testing was carried out with limited dataset and needs to be tested with a larger dataset.
Practical implications
This framework will be useful whenever the data is imbalanced in real world scenarios, which ultimately improves the performance.
Originality/value
This paper uses the rebalancing framework on medical appointment no-show dataset to predict the no-shows and removes the bias towards minority class.
Collapse
|
304
|
Chongomweru H, Kasem A. A novel ensemble method for classification in imbalanced datasets using split balancing technique based on instance hardness (sBal_IH). Neural Comput Appl 2021. [DOI: 10.1007/s00521-020-05570-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
305
|
Raghuwanshi BS, Shukla S. Classifying imbalanced data using SMOTE based class-specific kernelized ELM. INT J MACH LEARN CYB 2021. [DOI: 10.1007/s13042-020-01232-1] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
306
|
Jain S, Saha A. Rank-based univariate feature selection methods on machine learning classifiers for code smell detection. EVOLUTIONARY INTELLIGENCE 2021. [DOI: 10.1007/s12065-020-00536-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
307
|
Shen F, Zhao X, Kou G, Alsaadi FE. A new deep learning ensemble credit risk evaluation model with an improved synthetic minority oversampling technique. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2020.106852] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
308
|
Béquignon OJ, Pawar G, van de Water B, Cronin MT, van Westen GJ. Computational Approaches for Drug-Induced Liver Injury (DILI) Prediction: State of the Art and Challenges. SYSTEMS MEDICINE 2021. [DOI: 10.1016/b978-0-12-801238-3.11535-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
|
309
|
Performance analysis of cost-sensitive learning methods with application to imbalanced medical data. INFORMATICS IN MEDICINE UNLOCKED 2021. [DOI: 10.1016/j.imu.2021.100690] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
|
310
|
Fraiwan L, Hassanin O, Fraiwan M, Khassawneh B, Ibnian AM, Alkhodari M. Automatic identification of respiratory diseases from stethoscopic lung sound signals using ensemble classifiers. Biocybern Biomed Eng 2021. [DOI: 10.1016/j.bbe.2020.11.003] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
|
311
|
Jing XY, Zhang X, Zhu X, Wu F, You X, Gao Y, Shan S, Yang JY. Multiset Feature Learning for Highly Imbalanced Data Classification. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2021; 43:139-156. [PMID: 31331881 DOI: 10.1109/tpami.2019.2929166] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
With the expansion of data, increasing imbalanced data has emerged. When the imbalance ratio (IR) of data is high, most existing imbalanced learning methods decline seriously in classification performance. In this paper, we systematically investigate the highly imbalanced data classification problem, and propose an uncorrelated cost-sensitive multiset learning (UCML) approach for it. Specifically, UCML first constructs multiple balanced subsets through random partition, and then employs the multiset feature learning (MFL) to learn discriminant features from the constructed multiset. To enhance the usability of each subset and deal with the non-linearity issue existed in each subset, we further propose a deep metric based UCML (DM-UCML) approach. DM-UCML introduces the generative adversarial network technique into the multiset constructing process, such that each subset can own similar distribution with the original dataset. To cope with the non-linearity issue, DM-UCML integrates deep metric learning with MFL, such that more favorable performance can be achieved. In addition, DM-UCML designs a new discriminant term to enhance the discriminability of learned metrics. Experiments on eight traditional highly class-imbalanced datasets and two large-scale datasets indicate that: the proposed approaches outperform state-of-the-art highly imbalanced learning methods and are more robust to high IR.
Collapse
|
312
|
Abd Rahman HA, Wah YB, Huat OS. Predictive Performance of Logistic Regression for Imbalanced Data with Categorical Covariate. PERTANIKA JOURNAL OF SCIENCE AND TECHNOLOGY 2021; 29. [DOI: 10.47836/pjst.29.1.10] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
Abstract
Logistic regression is often used for the classification of a binary categorical dependent variable using various types of covariates (continuous or categorical). Imbalanced data will lead to biased parameter estimates and classification performance of the logistic regression model. Imbalanced data occurs when the number of cases in one category of the binary dependent variable is very much smaller than the other category. This simulation study investigates the effect of imbalanced data measured by imbalanced ratio on the parameter estimate of the binary logistic regression with a categorical covariate. Datasets were simulated with controlled different percentages of imbalance ratio (IR), from 1% to 50%, and for various sample sizes. The simulated datasets were then modeled using binary logistic regression. The bias in the estimates was measured using MSE (Mean Square Error). The simulation results provided evidence that the effect of imbalance ratio on the parameter estimate of the covariate decreased as sample size increased. The bias of the estimates depended on sample size whereby for sample size 100, 500, 1000 – 2000 and 2500 – 3500, the estimates were biased for IR below 30%, 10%, 5% and 2% respectively. Results also showed that parameter estimates were all biased at IR 1% for all sample size. An application using a real dataset supported the simulation results.
Collapse
|
313
|
EnGRNT: Inference of gene regulatory networks using ensemble methods and topological feature extraction. INFORMATICS IN MEDICINE UNLOCKED 2021. [DOI: 10.1016/j.imu.2021.100773] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022] Open
|
314
|
Guo Y, Chu Y, Jiao B, Cheng J, Yu Z, Cui N, Ma L. Evolutionary Dual-Ensemble Class Imbalance Learning for Human Activity Recognition. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE 2021. [DOI: 10.1109/tetci.2021.3079966] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
315
|
Pirgazi J, Pirmohammadi A, Shams R. A New Optimal Ensemble Algorithm Based on SVDD Sampling for Imbalanced Data Classification. INT J PATTERN RECOGN 2020. [DOI: 10.1142/s0218001421500208] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Nowadays, imbalanced data classification is a hot topic in data mining and recently, several valuable researches have been conducted to overcome certain difficulties in the field. Moreover, those approaches, which are based on ensemble classifiers, have achieved reasonable results. Despite the success of these works, there are still many unsolved issues such as disregarding the importance of samples in balancing, determination of proper number of classifiers and optimizing weights of base classifiers in voting stage of ensemble methods. This paper intends to find an admissible solution for these challenges. The solution suggested in this paper applies the support vector data descriptor (SVDD) for sampling both minority and majority classes. After determining the optimal number of base classifiers, the selected samples are utilized to adjust base classifiers. Finally, genetic algorithm optimization is used in order to find the optimum weights of each base classifier in the voting stage. The proposed method is compared with some existing algorithms. The results of experiments confirm its effectiveness.
Collapse
Affiliation(s)
- Jamshid Pirgazi
- Department of Electrical and Computer Engineering, University of Science and Technology of Mazandaran, Behshahr, Iran
| | | | - Reza Shams
- Faculty of Information Technology and Computer Engineering, Shahrood University of Technology, Shahrood, Iran
| |
Collapse
|
316
|
Wang MWH, Goodman JM, Allen TEH. Machine Learning in Predictive Toxicology: Recent Applications and Future Directions for Classification Models. Chem Res Toxicol 2020; 34:217-239. [PMID: 33356168 DOI: 10.1021/acs.chemrestox.0c00316] [Citation(s) in RCA: 48] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
In recent times, machine learning has become increasingly prominent in predictive toxicology as it has shifted from in vivo studies toward in silico studies. Currently, in vitro methods together with other computational methods such as quantitative structure-activity relationship modeling and absorption, distribution, metabolism, and excretion calculations are being used. An overview of machine learning and its applications in predictive toxicology is presented here, including support vector machines (SVMs), random forest (RF) and decision trees (DTs), neural networks, regression models, naïve Bayes, k-nearest neighbors, and ensemble learning. The recent successes of these machine learning methods in predictive toxicology are summarized, and a comparison of some models used in predictive toxicology is presented. In predictive toxicology, SVMs, RF, and DTs are the dominant machine learning methods due to the characteristics of the data available. Lastly, this review describes the current challenges facing the use of machine learning in predictive toxicology and offers insights into the possible areas of improvement in the field.
Collapse
Affiliation(s)
- Marcus W H Wang
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom
| | - Jonathan M Goodman
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom
| | - Timothy E H Allen
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom.,MRC Toxicology Unit, University of Cambridge, Hodgkin Building, Lancaster Road, Leicester LE1 7HB, United Kingdom
| |
Collapse
|
317
|
Antelo-Collado A, Carrasco-Velar R, García-Pedrajas N, Cerruela-García G. Effective Feature Selection Method for Class-Imbalance Datasets Applied to Chemical Toxicity Prediction. J Chem Inf Model 2020; 61:76-94. [PMID: 33350301 DOI: 10.1021/acs.jcim.0c00908] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
During the drug development process, it is common to carry out toxicity tests and adverse effect studies, which are essential to guarantee patient safety and the success of the research. The use of in silico quantitative structure-activity relationship (QSAR) approaches for this task involves processing a huge amount of data that, in many cases, have an imbalanced distribution of active and inactive samples. This is usually termed the class-imbalance problem and may have a significant negative effect on the performance of the learned models. The performance of feature selection (FS) for QSAR models is usually damaged by the class-imbalance nature of the involved datasets. This paper proposes the use of an FS method focused on dealing with the class-imbalance problems. The method is based on the use of FS ensembles constructed by boosting and using two well-known FS methods, fast clustering-based FS and the fast correlation-based filter. The experimental results demonstrate the efficiency of the proposal in terms of the classification performance compared to standard methods. The proposal can be extended to other FS methods and applied to other problems in cheminformatics.
Collapse
Affiliation(s)
| | | | - Nicolás García-Pedrajas
- Department of Computing and Numerical Analysis, University of Córdoba, Campus de Rabanales, Albert Einstein Building, E-14071 Córdoba, Spain
| | - Gonzalo Cerruela-García
- Department of Computing and Numerical Analysis, University of Córdoba, Campus de Rabanales, Albert Einstein Building, E-14071 Córdoba, Spain
| |
Collapse
|
318
|
Wu X, Yang Y, Ren L. Entropy difference and kernel-based oversampling technique for imbalanced data learning. INTELL DATA ANAL 2020. [DOI: 10.3233/ida-194761] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Class imbalance is often a problem in various real-world datasets, where one class contains a small number of data and the other contains a large number of data. It is notably difficult to develop an effective model using traditional data mining and machine learning algorithms without using data preprocessing techniques to balance the dataset. Oversampling is often used as a pretreatment method for imbalanced datasets. Specifically, synthetic oversampling techniques focus on balancing the number of training instances between the majority class and the minority class by generating extra artificial minority class instances. However, the current oversampling techniques simply consider the imbalance of quantity and pay no attention to whether the distribution is balanced or not. Therefore, this paper proposes an entropy difference and kernel-based SMOTE (EDKS) which considers the imbalance degree of dataset from distribution by entropy difference and overcomes the limitation of SMOTE for nonlinear problems by oversampling in the feature space of support vector machine classifier. First, the EDKS method maps the input data into a feature space to increase the separability of the data. Then EDKS calculates the entropy difference in kernel space, determines the majority class and minority class, and finds the sparse regions in the minority class. Moreover, the proposed method balances the data distribution by synthesizing new instances and evaluating its retention capability. Our algorithm can effectively distinguish those datasets with the same imbalance ratio but different distribution. The experimental study evaluates and compares the performance of our method against state-of-the-art algorithms, and then demonstrates that the proposed approach is competitive with the state-of-art algorithms on multiple benchmark imbalanced datasets.
Collapse
|
319
|
Mortaz E. Imbalance accuracy metric for model selection in multi-class imbalance classification problems. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2020.106490] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
320
|
Wang Z, Cao C, Zhu Y. Entropy and Confidence-Based Undersampling Boosting Random Forests for Imbalanced Problems. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:5178-5191. [PMID: 31995503 DOI: 10.1109/tnnls.2020.2964585] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
In this article, we propose a novel entropy and confidence-based undersampling boosting (ECUBoost) framework to solve imbalanced problems. The boosting-based ensemble is combined with a new undersampling method to improve the generalization performance. To avoid losing informative samples during the data preprocessing of the boosting-based ensemble, both confidence and entropy are used in ECUBoost as benchmarks to ensure the validity and structural distribution of the majority samples during the undersampling. Furthermore, different from other iterative dynamic resampling methods, ECUBoost based on confidence can be applied to algorithms without iterations such as decision trees. Meanwhile, random forests are used as base classifiers in ECUBoost. Furthermore, experimental results on both artificial data sets and KEEL data sets prove the effectiveness of the proposed method.
Collapse
|
321
|
Zhu Y, Yan Y, Zhang Y, Zhang Y. EHSO: Evolutionary Hybrid Sampling in overlapping scenarios for imbalanced learning. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2020.08.060] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
322
|
Abstract
One of the significant challenges in machine learning is the classification of imbalanced data. In many situations, standard classifiers cannot learn how to distinguish minority class examples from the others. Since many real problems are unbalanced, this problem has become very relevant and deeply studied today. This paper presents a new preprocessing method based on Delaunay tessellation and the preprocessing algorithm SMOTE (Synthetic Minority Over-sampling Technique), which we call DTO-SMOTE (Delaunay Tessellation Oversampling SMOTE). DTO-SMOTE constructs a mesh of simplices (in this paper, we use tetrahedrons) for creating synthetic examples. We compare results with five preprocessing algorithms (GEOMETRIC-SMOTE, SVM-SMOTE, SMOTE-BORDERLINE-1, SMOTE-BORDERLINE-2, and SMOTE), eight classification algorithms, and 61 binary-class data sets. For some classifiers, DTO-SMOTE has higher performance than others in terms of Area Under the ROC curve (AUC), Geometric Mean (GEO), and Generalized Index of Balanced Accuracy (IBA).
Collapse
|
323
|
Sun F, Fang F, Wang R, Wan B, Guo Q, Li H, Wu X. An Impartial Semi-Supervised Learning Strategy for Imbalanced Classification on VHR Images. SENSORS 2020; 20:s20226699. [PMID: 33238513 PMCID: PMC7700671 DOI: 10.3390/s20226699] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/27/2020] [Revised: 11/20/2020] [Accepted: 11/21/2020] [Indexed: 11/20/2022]
Abstract
Imbalanced learning is a common problem in remote sensing imagery-based land-use and land-cover classifications. Imbalanced learning can lead to a reduction in classification accuracy and even the omission of the minority class. In this paper, an impartial semi-supervised learning strategy based on extreme gradient boosting (ISS-XGB) is proposed to classify very high resolution (VHR) images with imbalanced data. ISS-XGB solves multi-class classification by using several semi-supervised classifiers. It first employs multi-group unlabeled data to eliminate the imbalance of training samples and then utilizes gradient boosting-based regression to simulate the target classes with positive and unlabeled samples. In this study, experiments were conducted on eight study areas with different imbalanced situations. The results showed that ISS-XGB provided a comparable but more stable performance than most commonly used classification approaches (i.e., random forest (RF), XGB, multilayer perceptron (MLP), and support vector machine (SVM)), positive and unlabeled learning (PU-Learning) methods (PU-BP and PU-SVM), and typical synthetic sample-based imbalanced learning methods. Especially under extremely imbalanced situations, ISS-XGB can provide high accuracy for the minority class without losing overall performance (the average overall accuracy achieves 85.92%). The proposed strategy has great potential in solving the imbalanced classification problems in remote sensing.
Collapse
Affiliation(s)
- Fei Sun
- School of Geography and Information Engineering, China University of Geosciences, Wuhan 430078, China; (F.S.); (F.F.); (B.W.); (H.L.); (X.W.)
- Academy of Computer, Huanggang Normal University, No. 146 Xinggang 2nd Road, Huanggang 438000, China
| | - Fang Fang
- School of Geography and Information Engineering, China University of Geosciences, Wuhan 430078, China; (F.S.); (F.F.); (B.W.); (H.L.); (X.W.)
- National Engineering Research Center for Geographic Information System, China University of Geosciences, Wuhan 430078, China
| | - Run Wang
- School of Geography and Information Engineering, China University of Geosciences, Wuhan 430078, China; (F.S.); (F.F.); (B.W.); (H.L.); (X.W.)
- Key Laboratory of Geological Survey and Evaluation of Ministry of Education, China University of Geosciences, Wuhan 430078, China
- Correspondence: ; Tel.: +86-027-6788-3728
| | - Bo Wan
- School of Geography and Information Engineering, China University of Geosciences, Wuhan 430078, China; (F.S.); (F.F.); (B.W.); (H.L.); (X.W.)
- National Engineering Research Center for Geographic Information System, China University of Geosciences, Wuhan 430078, China
| | - Qinghua Guo
- State Key Laboratory of Vegetation and Environmental Change, Institute of Botany, Chinese Academy of Sciences, Beijing 100093, China;
| | - Hong Li
- School of Geography and Information Engineering, China University of Geosciences, Wuhan 430078, China; (F.S.); (F.F.); (B.W.); (H.L.); (X.W.)
- National Engineering Research Center for Geographic Information System, China University of Geosciences, Wuhan 430078, China
| | - Xincai Wu
- School of Geography and Information Engineering, China University of Geosciences, Wuhan 430078, China; (F.S.); (F.F.); (B.W.); (H.L.); (X.W.)
- National Engineering Research Center for Geographic Information System, China University of Geosciences, Wuhan 430078, China
| |
Collapse
|
324
|
Bedi S, Samal A, Ray C, Snow D. Comparative evaluation of machine learning models for groundwater quality assessment. ENVIRONMENTAL MONITORING AND ASSESSMENT 2020; 192:776. [PMID: 33219864 DOI: 10.1007/s10661-020-08695-3] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/27/2020] [Accepted: 10/20/2020] [Indexed: 06/11/2023]
Abstract
Contamination from pesticides and nitrate in groundwater is a significant threat to water quality in general and agriculturally intensive regions in particular. Three widely used machine learning models, namely, artificial neural networks (ANN), support vector machines (SVM), and extreme gradient boosting (XGB), were evaluated for their efficacy in predicting contamination levels using sparse data with non-linear relationships. The predictive ability of the models was assessed using a dataset consisting of 303 wells across 12 Midwestern states in the USA. Multiple hydrogeologic, water quality, and land use features were chosen as the independent variables, and classes were based on measured concentration ranges of nitrate and pesticide. This study evaluates the classification performance of the models for two, three, and four class scenarios and compares them with the corresponding regression models. The study also examines the issue of class imbalance and tests the efficacy of three class imbalance mitigation techniques: oversampling, weighting, and oversampling and weighting, for all the scenarios. The models' performance is reported using multiple metrics, both insensitive to class imbalance (accuracy) and sensitive to class imbalance (F1 score and MCC). Finally, the study assesses the importance of features using game-theoretic Shapley values to rank features consistently and offer model interpretability.
Collapse
Affiliation(s)
- Shine Bedi
- Computer Science and Engineering, University of Nebraska, Lincoln, NE, USA.
| | - Ashok Samal
- Computer Science and Engineering, University of Nebraska, Lincoln, NE, USA
| | | | - Daniel Snow
- Water Sciences Laboratory, University of Nebraska, Lincoln, NE, USA
| |
Collapse
|
325
|
Zhang R, Zhang Z, Wang D. RFCL: A new under-sampling method of reducing the degree of imbalance and overlap. Pattern Anal Appl 2020. [DOI: 10.1007/s10044-020-00929-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
326
|
Ensembles of feature selectors for dealing with class-imbalanced datasets: A proposal and comparative study. Inf Sci (N Y) 2020. [DOI: 10.1016/j.ins.2020.05.077] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
327
|
Idakwo G, Thangapandian S, Luttrell J, Li Y, Wang N, Zhou Z, Hong H, Yang B, Zhang C, Gong P. Structure-activity relationship-based chemical classification of highly imbalanced Tox21 datasets. J Cheminform 2020; 12:66. [PMID: 33372637 PMCID: PMC7592558 DOI: 10.1186/s13321-020-00468-x] [Citation(s) in RCA: 34] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2019] [Accepted: 10/13/2020] [Indexed: 12/14/2022] Open
Abstract
The specificity of toxicant-target biomolecule interactions lends to the very imbalanced nature of many toxicity datasets, causing poor performance in Structure–Activity Relationship (SAR)-based chemical classification. Undersampling and oversampling are representative techniques for handling such an imbalance challenge. However, removing inactive chemical compound instances from the majority class using an undersampling technique can result in information loss, whereas increasing active toxicant instances in the minority class by interpolation tends to introduce artificial minority instances that often cross into the majority class space, giving rise to class overlapping and a higher false prediction rate. In this study, in order to improve the prediction accuracy of imbalanced learning, we employed SMOTEENN, a combination of Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbor (ENN) algorithms, to oversample the minority class by creating synthetic samples, followed by cleaning the mislabeled instances. We chose the highly imbalanced Tox21 dataset, which consisted of 12 in vitro bioassays for > 10,000 chemicals that were distributed unevenly between binary classes. With Random Forest (RF) as the base classifier and bagging as the ensemble strategy, we applied four hybrid learning methods, i.e., RF without imbalance handling (RF), RF with Random Undersampling (RUS), RF with SMOTE (SMO), and RF with SMOTEENN (SMN). The performance of the four learning methods was compared using nine evaluation metrics, among which F1 score, Matthews correlation coefficient and Brier score provided a more consistent assessment of the overall performance across the 12 datasets. The Friedman’s aligned ranks test and the subsequent Bergmann-Hommel post hoc test showed that SMN significantly outperformed the other three methods. We also found that a strong negative correlation existed between the prediction accuracy and the imbalance ratio (IR), which is defined as the number of inactive compounds divided by the number of active compounds. SMN became less effective when IR exceeded a certain threshold (e.g., > 28). The ability to separate the few active compounds from the vast amounts of inactive ones is of great importance in computational toxicology. This work demonstrates that the performance of SAR-based, imbalanced chemical toxicity classification can be significantly improved through the use of data rebalancing.
Collapse
Affiliation(s)
- Gabriel Idakwo
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS, 39406, USA
| | - Sundar Thangapandian
- Environmental Laboratory, U.S. Army Engineer Research and Development Center, Vicksburg, MS, 39180, USA
| | - Joseph Luttrell
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS, 39406, USA
| | - Yan Li
- Bennett Aerospace Inc, Cary, NC, 27518, USA
| | - Nan Wang
- Department of Computer Science, New Jersey City University, Jersey City, NJ, 07305, USA
| | - Zhaoxian Zhou
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS, 39406, USA
| | - Huixiao Hong
- Division of Bioinformatics and Biostatistics, National Centre for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR, 72079, USA
| | - Bei Yang
- School of Information & Engineering, Zhengzhou University, Zhengzhou, 450000, China
| | - Chaoyang Zhang
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS, 39406, USA.
| | - Ping Gong
- Environmental Laboratory, U.S. Army Engineer Research and Development Center, Vicksburg, MS, 39180, USA.
| |
Collapse
|
328
|
RUESVMs: An Ensemble Method to Handle the Class Imbalance Problem in Land Cover Mapping Using Google Earth Engine. REMOTE SENSING 2020. [DOI: 10.3390/rs12213484] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Timely and accurate Land Cover (LC) information is required for various applications, such as climate change analysis and sustainable development. Although machine learning algorithms are most likely successful in LC mapping tasks, the class imbalance problem is known as a common challenge in this regard. This problem occurs during the training phase and reduces classification accuracy for infrequent and rare LC classes. To address this issue, this study proposes a new method by integrating random under-sampling of majority classes and an ensemble of Support Vector Machines, namely Random Under-sampling Ensemble of Support Vector Machines (RUESVMs). The performance of RUESVMs for LC classification was evaluated in Google Earth Engine (GEE) over two different case studies using Sentinel-2 time-series data and five well-known spectral indices, including the Normalized Difference Vegetation Index (NDVI), Green Normalized Difference Vegetation Index (GNDVI), Soil-Adjusted Vegetation Index (SAVI), Normalized Difference Built-up Index (NDBI), and Normalized Difference Water Index (NDWI). The performance of RUESVMs was also compared with the traditional SVM and combination of SVM with three benchmark data balancing techniques namely the Random Over-Sampling (ROS), Random Under-Sampling (RUS), and Synthetic Minority Over-sampling Technique (SMOTE). It was observed that the proposed method considerably improved the accuracy of LC classification, especially for the minority classes. After adopting RUESVMs, the overall accuracy of the generated LC map increased by approximately 4.95 percentage points, and this amount for the geometric mean of producer’s accuracies was almost 3.75 percentage points, in comparison to the most accurate data balancing method (i.e., SVM-SMOTE). Regarding the geometric mean of users’ accuracies, RUESVMs also outperformed the SVM-SMOTE method with an average increase of 6.45 percentage points.
Collapse
|
329
|
Jiang F, Yu X, Zhao H, Gong D, Du J. Ensemble learning based on random super-reduct and resampling. Artif Intell Rev 2020. [DOI: 10.1007/s10462-020-09922-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
|
330
|
Rahman HAA, Wah YB, Huat OS. Predictive Performance of Logistic Regression for Imbalanced Data with Categorical Covariate. PERTANIKA JOURNAL OF SCIENCE AND TECHNOLOGY 2020; 28. [DOI: 10.47836/pjst.28.4.02] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
Abstract
Logistic regression is often used for the classification of a binary categorical dependent variable using various types of covariates (continuous or categorical). Imbalanced data will lead to biased parameter estimates and classification performance of the logistic regression model. Imbalanced data occurs when the number of cases in one category of the binary dependent variable is very much smaller than the other category. This simulation study investigates the effect of imbalanced data measured by imbalanced ratio on the parameter estimate of the binary logistic regression with a categorical covariate. Datasets were simulated with controlled different percentages of imbalance ratio (IR), from 1% to 50%, and for various sample sizes. The simulated datasets were then modeled using binary logistic regression. The bias in the estimates was measured using Mean Square Error (MSE). The simulation results provided evidence that the effect of imbalance ratio on the parameter estimate of the covariate decreased as sample size increased. The bias of the estimated depends on sample size whereby for sample size 100, 500, 1000 - 2000 and 2500 - 3500, the estimated were biased for IR below 30%, 10%, 5% and 2% respectively. Results also showed that parameter estimates were all biased at IR 1% for all sample size. An application using a real dataset supported the simulation results.
Collapse
|
331
|
Yang Z, Zhu Y, Liu T, Zhao S, Wang Y, Tao D. Output Layer Multiplication for Class Imbalance Problem in Convolutional Neural Networks. Neural Process Lett 2020. [DOI: 10.1007/s11063-020-10366-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
332
|
An effective method using clustering-based adaptive decomposition and editing-based diversified oversamping for multi-class imbalanced datasets. APPL INTELL 2020. [DOI: 10.1007/s10489-020-01883-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
|
333
|
Prediction of Antidepressant Treatment Response and Remission Using an Ensemble Machine Learning Framework. Pharmaceuticals (Basel) 2020; 13:ph13100305. [PMID: 33065962 PMCID: PMC7599952 DOI: 10.3390/ph13100305] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2020] [Revised: 10/08/2020] [Accepted: 10/12/2020] [Indexed: 12/19/2022] Open
Abstract
In the wake of recent advances in machine learning research, the study of pharmacogenomics using predictive algorithms serves as a new paradigmatic application. In this work, our goal was to explore an ensemble machine learning approach which aims to predict probable antidepressant treatment response and remission in major depressive disorder (MDD). To discover the status of antidepressant treatments, we established an ensemble predictive model with a feature selection algorithm resulting from the analysis of genetic variants and clinical variables of 421 patients who were treated with selective serotonin reuptake inhibitors. We also compared our ensemble machine learning framework with other state-of-the-art models including multi-layer feedforward neural networks (MFNNs), logistic regression, support vector machine, C4.5 decision tree, naïve Bayes, and random forests. Our data revealed that the ensemble predictive algorithm with feature selection (using fewer biomarkers) performed comparably to other predictive algorithms (such as MFNNs and logistic regression) to derive the perplexing relationship between biomarkers and the status of antidepressant treatments. Our study demonstrates that the ensemble machine learning framework may present a useful technique to create bioinformatics tools for discriminating non-responders from responders prior to antidepressant treatments.
Collapse
|
334
|
A Hybrid Data Balancing Method for Classification of Imbalanced Training Data within Google Earth Engine: Case Studies from Mountainous Regions. REMOTE SENSING 2020. [DOI: 10.3390/rs12203301] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Distribution of Land Cover (LC) classes is mostly imbalanced with some majority LC classes dominating against minority classes in mountainous areas. Although standard Machine Learning (ML) classifiers can achieve high accuracies for majority classes, they largely fail to provide reasonable accuracies for minority classes. This is mainly due to the class imbalance problem. In this study, a hybrid data balancing method, called the Partial Random Over-Sampling and Random Under-Sampling (PROSRUS), was proposed to resolve the class imbalance issue. Unlike most data balancing techniques which seek to fully balance datasets, PROSRUS uses a partial balancing approach with hundreds of fractions for majority and minority classes to balance datasets. For this, time-series of Landsat-8 and SRTM topographic data along with various spectral indices and topographic data were used over three mountainous sites within the Google Earth Engine (GEE) cloud platform. It was observed that PROSRUS had better performance than several other balancing methods and increased the accuracy of minority classes without a reduction in overall classification accuracy. Furthermore, adopting complementary information, particularly topographic data, considerably increased the accuracy of minority classes in mountainous areas. Finally, the obtained results from PROSRUS indicated that every imbalanced dataset requires a specific fraction(s) for addressing the class imbalance problem, because different datasets contain various characteristics.
Collapse
|
335
|
Kim KH, Sohn SY. Hybrid neural network with cost-sensitive support vector machine for class-imbalanced multimodal data. Neural Netw 2020; 130:176-184. [DOI: 10.1016/j.neunet.2020.06.026] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2020] [Revised: 06/13/2020] [Accepted: 06/30/2020] [Indexed: 10/23/2022]
|
336
|
Rahim T, Usman MA, Shin SY. A survey on contemporary computer-aided tumor, polyp, and ulcer detection methods in wireless capsule endoscopy imaging. Comput Med Imaging Graph 2020; 85:101767. [DOI: 10.1016/j.compmedimag.2020.101767] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2019] [Revised: 07/13/2020] [Accepted: 07/18/2020] [Indexed: 12/12/2022]
|
337
|
Dey L, Chakraborty S, Mukhopadhyay A. Machine learning techniques for sequence-based prediction of viral-host interactions between SARS-CoV-2 and human proteins. Biomed J 2020; 43:438-450. [PMID: 33036956 PMCID: PMC7470713 DOI: 10.1016/j.bj.2020.08.003] [Citation(s) in RCA: 50] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2020] [Revised: 07/22/2020] [Accepted: 08/05/2020] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND COVID-19 (Coronavirus Disease-19), a disease caused by the SARS-CoV-2 virus, has been declared as a pandemic by the World Health Organization on March 11, 2020. Over 15 million people have already been affected worldwide by COVID-19, resulting in more than 0.6 million deaths. Protein-protein interactions (PPIs) play a key role in the cellular process of SARS-CoV-2 virus infection in the human body. Recently a study has reported some SARS-CoV-2 proteins that interact with several human proteins while many potential interactions remain to be identified. METHOD In this article, various machine learning models are built to predict the PPIs between the virus and human proteins that are further validated using biological experiments. The classification models are prepared based on different sequence-based features of human proteins like amino acid composition, pseudo amino acid composition, and conjoint triad. RESULT We have built an ensemble voting classifier using SVMRadial, SVMPolynomial, and Random Forest technique that gives a greater accuracy, precision, specificity, recall, and F1 score compared to all other models used in the work. A total of 1326 potential human target proteins of SARS-CoV-2 have been predicted by the proposed ensemble model and validated using gene ontology and KEGG pathway enrichment analysis. Several repurposable drugs targeting the predicted interactions are also reported. CONCLUSION This study may encourage the identification of potential targets for more effective anti-COVID drug discovery.
Collapse
Affiliation(s)
- Lopamudra Dey
- Department of Computer Science & Engineering, Heritage Institute of Technology, Kolkata, India; Department of Information Technology, Techno Main, Saltlake, Kolkata, India; Department of. Computer Science & Engineering, University of Kalyani, Kalyani, India
| | - Sanjay Chakraborty
- Department of Computer Science & Engineering, Heritage Institute of Technology, Kolkata, India; Department of Information Technology, Techno Main, Saltlake, Kolkata, India; Department of. Computer Science & Engineering, University of Kalyani, Kalyani, India
| | - Anirban Mukhopadhyay
- Department of Computer Science & Engineering, Heritage Institute of Technology, Kolkata, India; Department of Information Technology, Techno Main, Saltlake, Kolkata, India; Department of. Computer Science & Engineering, University of Kalyani, Kalyani, India.
| |
Collapse
|
338
|
Gradient descent evolved imbalanced data gravitation classification with an application on Internet video traffic identification. Inf Sci (N Y) 2020. [DOI: 10.1016/j.ins.2020.05.141] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
339
|
Rodríguez A, Mendoza D, Ascuntar J, Jaimes F. Supervised classification techniques for prediction of mortality in adult patients with sepsis. Am J Emerg Med 2020; 45:392-397. [PMID: 33036848 DOI: 10.1016/j.ajem.2020.09.013] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2020] [Revised: 08/14/2020] [Accepted: 09/06/2020] [Indexed: 01/31/2023] Open
Abstract
BACKGROUND Sepsis mortality is still unacceptably high and an appropriate prognostic tool may increase the accuracy for clinical decisions. OBJECTIVE To evaluate several supervised techniques of Artificial Intelligence (AI) for classification and prediction of mortality, in adult patients hospitalized by emergency services with sepsis diagnosis. METHODS Secondary data analysis of a prospective cohort in three university hospitals in Medellín, Colombia. We included patients >18 years hospitalized for suspected or confirmed infection and any organ dysfunction according to the Sepsis-related Organ Failure Assessment. The outcome variable was hospital mortality and the prediction variables were grouped into those related to the initial clinical treatment and care or to the direct measurement of physiological disturbances. Four supervised classification techniques were analyzed: the C4.5 Decision Tree, Random Forest, artificial neural networks (ANN) and support vector machine (SVM) models. Their performance was evaluated by the concordance between the observed and predicted outcomes and by the discrimination according to AUC-ROC. RESULTS A total of 2510 patients with a median age of 62 years (IQR = 46-74) and an overall hospital mortality rate of 11.5% (n = 289). The best discrimination was provided by the SVM and ANN using physiological variables, with an AUC-ROC of 0.69 (95%CI: 0.62; 0.76) and AUC-ROC of 0.69 (95%CI: 0.61; 0.76) respectively. CONCLUSION Deep learning and AI are increasingly used as support tools in clinical medicine. Their performance in a syndrome as complex and heterogeneous as sepsis may be a new horizon in clinical research. SVM and ANN seem promising for improving sepsis classification and prognosis.
Collapse
Affiliation(s)
| | - Deibie Mendoza
- School of Medicine, Universidad de Antioquia, Medellín, Colombia
| | - Johana Ascuntar
- GRAEPIC - Clinical Epidemiology Academic Group (Grupo Académico de Epidemiología Clínica), Universidad de Antioquia, Medellín, Colombia
| | - Fabián Jaimes
- GRAEPIC - Clinical Epidemiology Academic Group (Grupo Académico de Epidemiología Clínica), Universidad de Antioquia, Medellín, Colombia; Department of Internal Medicine; Universidad de Antioquia; Medellín, Colombia; Hospital San Vicente Fundación, Medellín, Colombia.
| |
Collapse
|
340
|
Adaptive Decision Threshold-Based Extreme Learning Machine for Classifying Imbalanced Multi-label Data. Neural Process Lett 2020. [DOI: 10.1007/s11063-020-10343-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
341
|
The Prediction of Oceanic Mesoscale Eddy Properties and Propagation Trajectories Based on Machine Learning. WATER 2020. [DOI: 10.3390/w12092521] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Mesoscale eddies play an important role in ocean circulation, material energy exchange and variation of ocean environments. Machine learning methods can efficiently process massive amounts of data and automatically learn the implicit features, thus providing a new approach to eddy prediction research. Using the mesoscale eddy trajectory data derived from multimission satellite altimetry, we propose relevant machine learning models based on long short-term memory network (LSTM) and the extra trees (ET) algorithm for the prediction of eddy properties and propagation trajectories. Characteristic factors, including attribute features and past eddy displacements, were exploited to construct prediction models with high effectiveness and few predictors. To study their effects at different forecasting times, we separately trained the models by rebuilding the corresponding relationship between eddy samples and labels. In addition, the variation characteristics and the predictability of eddy properties and propagation trajectories were discussed from the prediction results. Cross-validation shows that at different prediction times, our method is superior to previous methods in terms of the mean absolute error (MAE) of eddy properties and the root mean square error (RMSE) of propagation. The stable variation in eddy properties makes the prediction more dependent on the historical time series than that of a propagation forecast. The short-term propagation prediction of eddies contained more noise than long-term predictions, and the long-term predictions revealed a more significant trend. Finally, we discuss the effect of eddy properties on the prediction ability of the eddy propagation trajectory.
Collapse
|
342
|
Ni Q, Fan Z, Zhang L, Nugent CD, Cleland I, Zhang Y, Zhou N. Leveraging Wearable Sensors for Human Daily Activity Recognition with Stacked Denoising Autoencoders. SENSORS 2020; 20:s20185114. [PMID: 32911780 PMCID: PMC7570862 DOI: 10.3390/s20185114] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/03/2020] [Revised: 09/05/2020] [Accepted: 09/06/2020] [Indexed: 11/16/2022]
Abstract
Activity recognition has received considerable attention in many research fields, such as industrial and healthcare fields. However, many researches about activity recognition have focused on static activities and dynamic activities in current literature, while, the transitional activities, such as stand-to-sit and sit-to-stand, are more difficult to recognize than both of them. Consider that it may be important in real applications. Thus, a novel framework is proposed in this paper to recognize static activities, dynamic activities, and transitional activities by utilizing stacked denoising autoencoders (SDAE), which is able to extract features automatically as a deep learning model rather than utilize manual features extracted by conventional machine learning methods. Moreover, the resampling technique (random oversampling) is used to improve problem of unbalanced samples due to relatively short duration characteristic of transitional activity. The experiment protocol is designed to collect twelve daily activities (three types) by using wearable sensors from 10 adults in smart lab of Ulster University, the experiment results show the significant performance on transitional activity recognition and achieve the overall accuracy of 94.88% on three types of activities. The results obtained by comparing with other methods and performances on other three public datasets verify the feasibility and priority of our framework. This paper also explores the effect of multiple sensors (accelerometer and gyroscope) to determine the optimal combination for activity recognition.
Collapse
Affiliation(s)
- Qin Ni
- College of Information, Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai 201418, China; (Q.N.); (Z.F.); (Y.Z.); (N.Z.)
| | - Zhuo Fan
- College of Information, Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai 201418, China; (Q.N.); (Z.F.); (Y.Z.); (N.Z.)
| | - Lei Zhang
- College of Information Science and Technology, Donghua University, Shanghai 201620, China
- Correspondence:
| | - Chris D. Nugent
- School of Computing and Mathematics, University of Ulster, Belfast BT370QB, UK; (C.D.N.); (I.C.)
| | - Ian Cleland
- School of Computing and Mathematics, University of Ulster, Belfast BT370QB, UK; (C.D.N.); (I.C.)
| | - Yuping Zhang
- College of Information, Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai 201418, China; (Q.N.); (Z.F.); (Y.Z.); (N.Z.)
| | - Nan Zhou
- College of Information, Mechanical and Electrical Engineering, Shanghai Normal University, Shanghai 201418, China; (Q.N.); (Z.F.); (Y.Z.); (N.Z.)
| |
Collapse
|
343
|
Lu Y, Cheung YM, Tang YY. Bayes Imbalance Impact Index: A Measure of Class Imbalanced Data Set for Classification Problem. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:3525-3539. [PMID: 31689217 DOI: 10.1109/tnnls.2019.2944962] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Recent studies of imbalanced data classification have shown that the imbalance ratio (IR) is not the only cause of performance loss in a classifier, as other data factors, such as small disjuncts, noise, and overlapping, can also make the problem difficult. The relationship between the IR and other data factors has been demonstrated, but to the best of our knowledge, there is no measurement of the extent to which class imbalance influences the classification performance of imbalanced data. In addition, it is also unknown which data factor serves as the main barrier for classification in a data set. In this article, we focus on the Bayes optimal classifier and examine the influence of class imbalance from a theoretical perspective. We propose an instance measure called the Individual Bayes Imbalance Impact Index (IBI3) and a data measure called the Bayes Imbalance Impact Index (BI3). IBI3 and BI3 reflect the extent of influence using only the imbalance factor, in terms of each minority class sample and the whole data set, respectively. Therefore, IBI3 can be used as an instance complexity measure of imbalance and BI3 as a criterion to demonstrate the degree to which imbalance deteriorates the classification of a data set. We can, therefore, use BI3 to access whether it is worth using imbalance recovery methods, such as sampling or cost-sensitive methods, to recover the performance loss of a classifier. The experiments show that IBI3 is highly consistent with the increase of the prediction score obtained by the imbalance recovery methods and that BI3 is highly consistent with the improvement in the F1 score obtained by the imbalance recovery methods on both synthetic and real benchmark data sets.
Collapse
|
344
|
Classification of Dermoscopy Skin Lesion Color-Images Using Fractal-Deep Learning Features. APPLIED SCIENCES-BASEL 2020. [DOI: 10.3390/app10175954] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
The detection of skin diseases is becoming one of the priority tasks worldwide due to the increasing amount of skin cancer. Computer-aided diagnosis is a helpful tool to help dermatologists in the detection of these kinds of illnesses. This work proposes a computer-aided diagnosis based on 1D fractal signatures of texture-based features combining with deep-learning features using transferred learning based in Densenet-201. This proposal works with three 1D fractal signatures built per color-image. The energy, variance, and entropy of the fractal signatures are used combined with 100 features extracted from Densenet-201 to construct the features vector. Because commonly, the classes in the dataset of skin lesion images are imbalanced, we use the technique of ensemble of classifiers: K-nearest neighbors and two types of support vector machines. The computer-aided diagnosis output was determined based on the linear plurality vote. In this work, we obtained an average accuracy of 97.35%, an average precision of 91.61%, an average sensitivity of 66.45%, and an average specificity of 97.85% in the eight classes’ classification in the International Skin Imaging Collaboration (ISIC) archive-2019.
Collapse
|
345
|
Ding R, Wang R, Ding Y, Yin W, Liu Y, Li J, Liu J. Designing AI‐Aided Analysis and Prediction Models for Nonprecious Metal Electrocatalyst‐Based Proton‐Exchange Membrane Fuel Cells. Angew Chem Int Ed Engl 2020. [DOI: 10.1002/ange.202006928] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Affiliation(s)
- Rui Ding
- National Laboratory of Solid State Microstructures College of Engineering and Applied Sciences Collaborative Innovation Center of Advanced Microstructures Nanjing University 22 Hankou Road Nanjing 210093 China
| | - Ran Wang
- National Laboratory of Solid State Microstructures College of Engineering and Applied Sciences Collaborative Innovation Center of Advanced Microstructures Nanjing University 22 Hankou Road Nanjing 210093 China
| | - Yiqin Ding
- National Laboratory of Solid State Microstructures College of Engineering and Applied Sciences Collaborative Innovation Center of Advanced Microstructures Nanjing University 22 Hankou Road Nanjing 210093 China
| | - Wenjuan Yin
- National Laboratory of Solid State Microstructures College of Engineering and Applied Sciences Collaborative Innovation Center of Advanced Microstructures Nanjing University 22 Hankou Road Nanjing 210093 China
| | - Yide Liu
- National Laboratory of Solid State Microstructures College of Engineering and Applied Sciences Collaborative Innovation Center of Advanced Microstructures Nanjing University 22 Hankou Road Nanjing 210093 China
| | - Jia Li
- National Laboratory of Solid State Microstructures College of Engineering and Applied Sciences Collaborative Innovation Center of Advanced Microstructures Nanjing University 22 Hankou Road Nanjing 210093 China
| | - Jianguo Liu
- National Laboratory of Solid State Microstructures College of Engineering and Applied Sciences Collaborative Innovation Center of Advanced Microstructures Nanjing University 22 Hankou Road Nanjing 210093 China
| |
Collapse
|
346
|
Ding R, Wang R, Ding Y, Yin W, Liu Y, Li J, Liu J. Designing AI‐Aided Analysis and Prediction Models for Nonprecious Metal Electrocatalyst‐Based Proton‐Exchange Membrane Fuel Cells. Angew Chem Int Ed Engl 2020; 59:19175-19183. [DOI: 10.1002/anie.202006928] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2020] [Revised: 07/02/2020] [Indexed: 11/10/2022]
Affiliation(s)
- Rui Ding
- National Laboratory of Solid State Microstructures College of Engineering and Applied Sciences Collaborative Innovation Center of Advanced Microstructures Nanjing University 22 Hankou Road Nanjing 210093 China
| | - Ran Wang
- National Laboratory of Solid State Microstructures College of Engineering and Applied Sciences Collaborative Innovation Center of Advanced Microstructures Nanjing University 22 Hankou Road Nanjing 210093 China
| | - Yiqin Ding
- National Laboratory of Solid State Microstructures College of Engineering and Applied Sciences Collaborative Innovation Center of Advanced Microstructures Nanjing University 22 Hankou Road Nanjing 210093 China
| | - Wenjuan Yin
- National Laboratory of Solid State Microstructures College of Engineering and Applied Sciences Collaborative Innovation Center of Advanced Microstructures Nanjing University 22 Hankou Road Nanjing 210093 China
| | - Yide Liu
- National Laboratory of Solid State Microstructures College of Engineering and Applied Sciences Collaborative Innovation Center of Advanced Microstructures Nanjing University 22 Hankou Road Nanjing 210093 China
| | - Jia Li
- National Laboratory of Solid State Microstructures College of Engineering and Applied Sciences Collaborative Innovation Center of Advanced Microstructures Nanjing University 22 Hankou Road Nanjing 210093 China
| | - Jianguo Liu
- National Laboratory of Solid State Microstructures College of Engineering and Applied Sciences Collaborative Innovation Center of Advanced Microstructures Nanjing University 22 Hankou Road Nanjing 210093 China
| |
Collapse
|
347
|
A Generalized Flow for B2B Sales Predictive Modeling: An Azure Machine-Learning Approach. FORECASTING 2020. [DOI: 10.3390/forecast2030015] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Predicting the outcome of sales opportunities is a core part of successful business management. Conventionally, undertaking this prediction has relied mostly on subjective human evaluations in the process of sales decision-making. In this paper, we addressed the problem of forecasting the outcome of Business to Business (B2B) sales by proposing a thorough data-driven Machine-Learning (ML) workflow on a cloud-based computing platform: Microsoft Azure Machine-Learning Service (Azure ML). This workflow consists of two pipelines: (1) An ML pipeline to train probabilistic predictive models on the historical sales opportunities data. In this pipeline, data is enriched with an extensive feature enhancement step and then used to train an ensemble of ML classification models in parallel. (2) A prediction pipeline to use the trained ML model and infer the likelihood of winning new sales opportunities along with calculating optimal decision boundaries. The effectiveness of the proposed workflow was evaluated on a real sales dataset of a major global B2B consulting firm. Our results implied that decision-making based on the ML predictions is more accurate and brings a higher monetary value.
Collapse
|
348
|
Seow D, Graham I, Massey A. Prediction models for musculoskeletal injuries in professional sporting activities: A systematic review. TRANSLATIONAL SPORTS MEDICINE 2020. [DOI: 10.1002/tsm2.181] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
349
|
Predicting Acute Kidney Injury: A Machine Learning Approach Using Electronic Health Records. INFORMATION 2020. [DOI: 10.3390/info11080386] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023] Open
Abstract
Acute kidney injury (AKI) is a common complication in hospitalized patients and can result in increased hospital stay, health-related costs, mortality and morbidity. A number of recent studies have shown that AKI is predictable and avoidable if early risk factors can be identified by analyzing Electronic Health Records (EHRs). In this study, we employ machine learning techniques to identify older patients who have a risk of readmission with AKI to the hospital or emergency department within 90 days after discharge. One million patients’ records are included in this study who visited the hospital or emergency department in Ontario between 2014 and 2016. The predictor variables include patient demographics, comorbid conditions, medications and diagnosis codes. We developed 31 prediction models based on different combinations of two sampling techniques, three ensemble methods, and eight classifiers. These models were evaluated through 10-fold cross-validation and compared based on the AUROC metric. The performances of these models were consistent, and the AUROC ranged between 0.61 and 0.88 for predicting AKI among 31 prediction models. In general, the performances of ensemble-based methods were higher than the cost-sensitive logistic regression. We also validated features that are most relevant in predicting AKI with a healthcare expert to improve the performance and reliability of the models. This study predicts the risk of AKI for a patient after being discharged, which provides healthcare providers enough time to intervene before the onset of AKI.
Collapse
|
350
|
Asadi H, Zhou G, Lee JJ, Aggarwal V, Yu D. A computer vision approach for classifying isometric grip force exertion levels. ERGONOMICS 2020; 63:1010-1026. [PMID: 32202214 DOI: 10.1080/00140139.2020.1745898] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/19/2019] [Accepted: 02/13/2020] [Indexed: 06/10/2023]
Abstract
Exposure to high and/or repetitive force exertions can lead to musculoskeletal injuries. However, measuring worker force exertion levels is challenging, and existing techniques can be intrusive, interfere with human-machine interface, and/or limited by subjectivity. In this work, computer vision techniques are developed to detect isometric grip exertions using facial videos and wearable photoplethysmogram. Eighteen participants (19-24 years) performed isometric grip exertions at varying levels of maximum voluntary contraction. Novel features that predict forces were identified and extracted from video and photoplethysmogram data. Two experiments with two (High/Low) and three (0%MVC/50%MVC/100%MVC) labels were performed to classify exertions. The Deep Neural Network classifier performed the best with 96% and 87% accuracy for two- and three-level classifications, respectively. This approach was robust to leave subjects out during cross-validation (86% accuracy when 3-subjects were left out) and robust to noise (i.e. 89% accuracy for correctly classifying talking activities as low force exertions). Practitioner summary: Forceful exertions are contributing factors to musculoskeletal injuries, yet it remains difficult to measure in work environments. This paper presents an approach to estimate force exertion levels, which is less distracting to workers, easier to implement by practitioners, and could potentially be used in a wide variety of workplaces. Abbreviations: MSD: musculoskeletal disorders; ACGIH: American Conference of Governmental Industrial Hygienists; HAL: hand activity level; MVC: maximum voluntary contraction; PPG: photoplethysmogram; DNN: deep neural networks; LOSO: leave-one-subject-out; ROC: receiver operating characteristic; AUC: area under curve.
Collapse
Affiliation(s)
- Hamed Asadi
- School of Industrial Engineering, Purdue University, West Lafayette, IN, USA
| | - Guoyang Zhou
- School of Industrial Engineering, Purdue University, West Lafayette, IN, USA
| | - Jae Joong Lee
- Department of Computer Science, Purdue University, West Lafayette, IN, USA
| | - Vaneet Aggarwal
- School of Industrial Engineering, Purdue University, West Lafayette, IN, USA
- School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, USA
| | - Denny Yu
- School of Industrial Engineering, Purdue University, West Lafayette, IN, USA
| |
Collapse
|