201
|
Upadhyay K, Kaur P, Verma DK. Evaluating the Performance of Data Level Methods Using KEEL Tool to Address Class Imbalance Problem. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING 2021. [DOI: 10.1007/s13369-021-06377-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
202
|
Smart U, Ingrasci MJ, Sarker GC, Lalremsanga H, Murphy RW, Ota H, Tu MC, Shouche Y, Orlov NL, Smith EN. A comprehensive appraisal of evolutionary diversity in venomous Asian coralsnakes of the genus
Sinomicrurus
(Serpentes: Elapidae) using Bayesian coalescent inference and supervised machine learning. J ZOOL SYST EVOL RES 2021. [DOI: 10.1111/jzs.12547] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Utpal Smart
- Center for Human Identification University of North Texas Health Science Center Fort Worth Texas USA
- Department of Biology The Amphibian and Reptile Diversity Research Center The University of Texas at Arlington Arlington Texas USA
| | - Matthew J. Ingrasci
- Department of Biology The Amphibian and Reptile Diversity Research Center The University of Texas at Arlington Arlington Texas USA
| | - Goutam C. Sarker
- Department of Biology The Amphibian and Reptile Diversity Research Center The University of Texas at Arlington Arlington Texas USA
| | | | - Robert W. Murphy
- Royal Ontario Museum Centre for Biodiversity and Conservation Biology Toronto ON Canada
- State Key Laboratory of Genetic Resources and Evolution Kunming Institute of Zoology Kunming China
| | - Hidetoshi Ota
- Institute of Natural and Environmental Sciences Museum of Nature and Human Activities University of Hyogo Sanda Japan
| | - Ming Chung Tu
- Department of Life Sciences National Taiwan Normal University Taipei City Taiwan
| | - Yogesh Shouche
- National Centre for Microbial ResourceNational Center for Cell Science Pune India
| | - Nikolai L. Orlov
- Zoological Institute Russian Academy of SciencesSaint Petersburg Russia
| | - Eric N. Smith
- Department of Biology The Amphibian and Reptile Diversity Research Center The University of Texas at Arlington Arlington Texas USA
| |
Collapse
|
203
|
Han S, Williamson BD, Fong Y. Improving random forest predictions in small datasets from two-phase sampling designs. BMC Med Inform Decis Mak 2021; 21:322. [PMID: 34809631 PMCID: PMC8607560 DOI: 10.1186/s12911-021-01688-3] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2021] [Accepted: 11/10/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND While random forests are one of the most successful machine learning methods, it is necessary to optimize their performance for use with datasets resulting from a two-phase sampling design with a small number of cases-a common situation in biomedical studies, which often have rare outcomes and covariates whose measurement is resource-intensive. METHODS Using an immunologic marker dataset from a phase III HIV vaccine efficacy trial, we seek to optimize random forest prediction performance using combinations of variable screening, class balancing, weighting, and hyperparameter tuning. RESULTS Our experiments show that while class balancing helps improve random forest prediction performance when variable screening is not applied, class balancing has a negative impact on performance in the presence of variable screening. The impact of the weighting similarly depends on whether variable screening is applied. Hyperparameter tuning is ineffective in situations with small sample sizes. We further show that random forests under-perform generalized linear models for some subsets of markers, and prediction performance on this dataset can be improved by stacking random forests and generalized linear models trained on different subsets of predictors, and that the extent of improvement depends critically on the dissimilarities between candidate learner predictions. CONCLUSION In small datasets from two-phase sampling design, variable screening and inverse sampling probability weighting are important for achieving good prediction performance of random forests. In addition, stacking random forests and simple linear models can offer improvements over random forests.
Collapse
Affiliation(s)
- Sunwoo Han
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, USA
| | - Brian D. Williamson
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, USA
| | - Youyi Fong
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, USA
| |
Collapse
|
204
|
Guo J, Lee JHW. Development of Predictive Models for "Very Poor" Beach Water Quality Gradings Using Class-Imbalance Learning. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2021; 55:14990-15000. [PMID: 34634206 DOI: 10.1021/acs.est.1c03350] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Statistical water quality forecast models are useful tools to assist with beach management. In particular, multiple linear regression (MLR) models have been successfully developed for prediction of fecal indicator bacteria concentrations for beaches in river, lake, and marine environments. Nevertheless, an unresolved challenging issue is the reliable prediction of infrequent events of high bacterial concentrations to inform beach closure decisions to protect public health. The number of field data available for the infrequent events is typically an order of magnitude less than that for days when the water quality criterion is met-MLR models often perform poorly in predicting bacterial concentrations on days when the beaches should be closed. For beach management in Hong Kong, MLR models have been developed to predict beach water quality indices in terms of four gradings (BWQI-1 to 4) based on Escherichia coli (E. coli) concentrations. In this study, we propose an artificial intelligence (AI)-based binary classification (EasyEnsemble) model using class-imbalance learning to predict "very poor" occasions (BWQI-4)-when E. coli concentration exceeds 610 counts/100 mL. Models are developed for three marine beaches with different hydrographic and pollution characteristics using a 30 year data set spanning three periods with different water quality status. The model-data comparison over a wide range of conditions shows that the proposed method results in a significant improvement in the prediction of "very poor" water quality. The proposed class-imbalance method for predicting rare events has an F-score of 0.84, and it significantly outperforms MLR and classification tree (CT) models with corresponding F-scores of 0.39 and 0.69. A robust beach water quality forecast system can hence be developed using hybrid MLR-binary classification modeling.
Collapse
Affiliation(s)
- Jiuhao Guo
- Department of Civil and Environmental Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China
| | - Joseph H W Lee
- Macao Environmental Research Institute, Macau University of Science and Technology, Taipa, Macao, China
- Institute for Advanced Study, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China
| |
Collapse
|
205
|
Fu C, Zhan Q, Liu W. Evidential reasoning based ensemble classifier for uncertain imbalanced data. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2021.07.027] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
206
|
A Novel Intrusion Detection Approach Using Machine Learning Ensemble for IoT Environments. APPLIED SCIENCES-BASEL 2021. [DOI: 10.3390/app112110268] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The Internet of Things (IoT) has gained significant importance due to its applicability in diverse environments. Another reason for the influence of the IoT is its use of a flexible and scalable framework. The extensive and diversified use of the IoT in the past few years has attracted cyber-criminals. They exploit the vulnerabilities of the open-source IoT framework due to the absentia of robust and standard security protocols, hence discouraging existing and potential stakeholders. The authors propose a binary classifier approach developed from a machine learning ensemble method to filter and dump malicious traffic to prevent malicious actors from accessing the IoT network and its peripherals. The gradient boosting machine (GBM) ensemble approach is used to train the binary classifier using pre-processed recorded data packets to detect the anomaly and prevent the IoT networks from zero-day attacks. The positive class performance metrics of the model resulted in an accuracy of 98.27%, a precision of 96.40%, and a recall of 95.70%. The simulation results prove the effectiveness of the proposed model against cyber threats, thus making it suitable for critical applications for the IoT.
Collapse
|
207
|
Rezvani S, Wang X. Class imbalance learning using fuzzy ART and intuitionistic fuzzy twin support vector machines. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2021.07.010] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
208
|
Paul MK, Islam MR, Sattar AS. An efficient perturbation approach for multivariate data in sensitive and reliable data mining. JOURNAL OF INFORMATION SECURITY AND APPLICATIONS 2021. [DOI: 10.1016/j.jisa.2021.102954] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
209
|
Impact of the learners diversity and combination method on the generation of heterogeneous classifier ensembles. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2021.107689] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
210
|
Yang L, Heiselman C, Quirk JG, Djurić PM. CLASS-IMBALANCED CLASSIFIERS USING ENSEMBLES OF GAUSSIAN PROCESSES AND GAUSSIAN PROCESS LATENT VARIABLE MODELS. PROCEEDINGS OF THE ... IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING. ICASSP (CONFERENCE) 2021; 2021. [PMID: 34712104 DOI: 10.1109/icassp39728.2021.9414754] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Classification with imbalanced data is a common and challenging problem in many practical machine learning problems. Ensemble learning is a popular solution where the results from multiple base classifiers are synthesized to reduce the effect of a possibly skewed distribution of the training set. In this paper, binary classifiers based on Gaussian processes are chosen as bases for inferring the predictive distributions of test latent variables. We apply a Gaussian process latent variable model where the outputs of the Gaussian processes are used for making the final decision. The tests of the new method in both synthetic and real data sets show improved performance over standard approaches.
Collapse
Affiliation(s)
- Liu Yang
- Department of Electrical and Computer Engineering, Stony Brook University, Stony Brook, NY 11794, USA
| | - Cassandra Heiselman
- Department of Obstetrics, Gynecology and Reproductive Medicine, Stony Brook, NY 11794, USA
| | - J Gerald Quirk
- Department of Obstetrics, Gynecology and Reproductive Medicine, Stony Brook, NY 11794, USA
| | - Petar M Djurić
- Department of Electrical and Computer Engineering, Stony Brook University, Stony Brook, NY 11794, USA
| |
Collapse
|
211
|
Melek M, Melek N. Roza: a new and comprehensive metric for evaluating classification systems. Comput Methods Biomech Biomed Engin 2021; 25:1015-1027. [PMID: 34693834 DOI: 10.1080/10255842.2021.1995721] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
Many metrics such as accuracy rate (ACC), area under curve (AUC), Jaccard index (JI), and Cohen's kappa coefficient are available to measure the success of the system in pattern recognition and machine/deep learning systems. However, the superiority of one system to one other cannot be determined based on the mentioned metrics. This is because such a system can be successful using one metric, but not the other ones. Moreover, such metrics are insufficient when the number of samples in the classes is unequal (imbalanced data). In this case, naturally, by using these metrics, a sensible comparison cannot be made between two given systems. In the present study, the comprehensive, fair, and accurate Roza (Roza means rose in Persian. When different permutations of the metrics used are superimposed in a polygon format, it looks like a flower, so we named it Roza.) metric is introduced for evaluating classification systems. This metric, which facilitates the comparison of systems, expresses the summary of many metrics with a single value. To verify the stability and validity of the metric and to conduct a comprehensive, fair, and accurate comparison between the systems, the Roza metric of the systems tested under the same conditions are calculated and comparisons are made. For this, systems tested with three different strategies on three different datasets are considered. The results show that the performance of the system can be summarized by a single value and the Roza metric can be used in all systems that include classification processes, as a powerful metric.
Collapse
Affiliation(s)
- Mesut Melek
- Department of Electronics and Automation, Gumushane University, Gumushane, Turkey
| | - Negin Melek
- Faculty of Engineering, Department of Electrical and Electronics Engineering, Avrasya University, Trabzon, Turkey
| |
Collapse
|
212
|
Sue KL, Tsai CF, Chiu A. The data sampling effect on financial distress prediction by single and ensemble learning techniques. COMMUN STAT-THEOR M 2021. [DOI: 10.1080/03610926.2021.1992439] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Affiliation(s)
- Kuen-Liang Sue
- Department of Information Management, National Central University, Taoyuan, Taiwan
| | - Chih-Fong Tsai
- Department of Information Management, National Central University, Taoyuan, Taiwan
| | - Andy Chiu
- Department of Information Management, National Central University, Taoyuan, Taiwan
| |
Collapse
|
213
|
RB-CCR: Radial-Based Combined Cleaning and Resampling algorithm for imbalanced data classification. Mach Learn 2021. [DOI: 10.1007/s10994-021-06012-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
AbstractReal-world classification domains, such as medicine, health and safety, and finance, often exhibit imbalanced class priors and have asynchronous misclassification costs. In such cases, the classification model must achieve a high recall without significantly impacting precision. Resampling the training data is the standard approach to improving classification performance on imbalanced binary data. However, the state-of-the-art methods ignore the local joint distribution of the data or correct it as a post-processing step. This can causes sub-optimal shifts in the training distribution, particularly when the target data distribution is complex. In this paper, we propose Radial-Based Combined Cleaning and Resampling (RB-CCR). RB-CCR utilizes the concept of class potential to refine the energy-based resampling approach of CCR. In particular, RB-CCR exploits the class potential to accurately locate sub-regions of the data-space for synthetic oversampling. The category sub-region for oversampling can be specified as an input parameter to meet domain-specific needs or be automatically selected via cross-validation. Our $$5\times 2$$
5
×
2
cross-validated results on 57 benchmark binary datasets with 9 classifiers show that RB-CCR achieves a better precision-recall trade-off than CCR and generally out-performs the state-of-the-art resampling methods in terms of AUC and G-mean.
Collapse
|
214
|
Qiao S, Han N, Huang F, Yue K, Wu T, Yi Y, Mao R, Yuan CA. LMNNB: Two-in-One imbalanced classification approach by combining metric learning and ensemble learning. APPL INTELL 2021. [DOI: 10.1007/s10489-021-02901-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
215
|
A New Integrated Approach for Landslide Data Balancing and Spatial Prediction Based on Generative Adversarial Networks (GAN). REMOTE SENSING 2021. [DOI: 10.3390/rs13194011] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Landslide susceptibility mapping has significantly progressed with improvements in machine learning techniques. However, the inventory/data imbalance (DI) problem remains one of the challenges in this domain. This problem exists as a good quality landslide inventory map, including a complete record of historical data, is difficult or expensive to collect. As such, this can considerably affect one’s ability to obtain a sufficient inventory or representative samples. This research developed a new approach based on generative adversarial networks (GAN) to correct imbalanced landslide datasets. The proposed method was tested at Chukha Dzongkhag, Bhutan, one of the most frequent landslide prone areas in the Himalayan region. The proposed approach was then compared with the standard methods such as the synthetic minority oversampling technique (SMOTE), dense imbalanced sampling, and sparse sampling (i.e., producing non-landslide samples as many as landslide samples). The comparisons were based on five machine learning models, including artificial neural networks (ANN), random forests (RF), decision trees (DT), k-nearest neighbours (kNN), and the support vector machine (SVM). The model evaluation was carried out based on overall accuracy (OA), Kappa Index, F1-score, and area under receiver operating characteristic curves (AUROC). The spatial database was established with a total of 269 landslides and 10 conditioning factors, including altitude, slope, aspect, total curvature, slope length, lithology, distance from the road, distance from the stream, topographic wetness index (TWI), and sediment transport index (STI). The findings of this study have shown that both GAN and SMOTE data balancing approaches have helped to improve the accuracy of machine learning models. According to AUROC, the GAN method was able to boost the models by reaching the maximum accuracy of ANN (0.918), RF (0.933), DT (0.927), kNN (0.878), and SVM (0.907) when default parameters used. With the optimum parameters, all models performed best with GAN at their highest accuracy of ANN (0.927), RF (0.943), DT (0.923) and kNN (0.889), except SVM obtained the highest accuracy of (0.906) with SMOTE. Our finding suggests that RF balanced with GAN can provide the most reasonable criterion for landslide prediction. This research indicates that landslide data balancing may substantially affect the predictive capabilities of machine learning models. Therefore, the issue of DI in the spatial prediction of landslides should not be ignored. Future studies could explore other generative models for landslide data balancing. By using state-of-the-art GAN, the proposed model can be considered in the areas where the data are limited or imbalanced.
Collapse
|
216
|
Yao L, Lin TB. Evolutionary Mahalanobis Distance-Based Oversampling for Multi-Class Imbalanced Data Classification. SENSORS (BASEL, SWITZERLAND) 2021; 21:6616. [PMID: 34640936 PMCID: PMC8512012 DOI: 10.3390/s21196616] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/23/2021] [Revised: 09/14/2021] [Accepted: 09/29/2021] [Indexed: 11/18/2022]
Abstract
The number of sensing data are often imbalanced across data classes, for which oversampling on the minority class is an effective remedy. In this paper, an effective oversampling method called evolutionary Mahalanobis distance oversampling (EMDO) is proposed for multi-class imbalanced data classification. EMDO utilizes a set of ellipsoids to approximate the decision regions of the minority class. Furthermore, multi-objective particle swarm optimization (MOPSO) is integrated with the Gustafson-Kessel algorithm in EMDO to learn the size, center, and orientation of every ellipsoid. Synthetic minority samples are generated based on Mahalanobis distance within every ellipsoid. The number of synthetic minority samples generated by EMDO in every ellipsoid is determined based on the density of minority samples in every ellipsoid. The results of computer simulations conducted herein indicate that EMDO outperforms most of the widely used oversampling schemes.
Collapse
Affiliation(s)
- Leehter Yao
- Department of Electrical Engineering, National Taipei University of Technology, Taipei 10618, Taiwan;
| | | |
Collapse
|
217
|
Interval modelling in optimization of k‐NN classifiers for large number of attributes in data sets on an example of DNA microarrays. INT J INTELL SYST 2021. [DOI: 10.1002/int.22679] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
|
218
|
|
219
|
Abstract
AbstractOver the years, a plethora of cost-sensitive methods have been proposed for learning on data when different types of misclassification errors incur different costs. Our contribution is a unifying framework that provides a comprehensive and insightful overview on cost-sensitive ensemble methods, pinpointing their differences and similarities via a fine-grained categorization. Our framework contains natural extensions and generalisations of ideas across methods, be it AdaBoost, Bagging or Random Forest, and as a result not only yields all methods known to date but also some not previously considered.
Collapse
|
220
|
Mohammed SA, Mohammed AR, Cote D, Shirmohammadi S. A Machine-Learning-Based Action Recommender for Network Operation Centers. IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT 2021. [DOI: 10.1109/tnsm.2021.3095463] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
221
|
Juez-Gil M, Arnaiz-González Á, Rodríguez JJ, García-Osorio C. Experimental evaluation of ensemble classifiers for imbalance in Big Data. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2021.107447] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
222
|
Zou JY, Sun MX, Liu KH, Wu QQ. The design of dynamic ensemble selection strategy for the error-correcting output codes family. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2021.04.038] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
223
|
Xu Z, Shen D, Nie T, Kou Y, Yin N, Han X. A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2021.02.056] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
224
|
SecProCT: In Silico Prediction of Human Secretory Proteins Based on Capsule Network and Transformer. Int J Mol Sci 2021; 22:ijms22169054. [PMID: 34445760 PMCID: PMC8396571 DOI: 10.3390/ijms22169054] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2021] [Revised: 08/12/2021] [Accepted: 08/20/2021] [Indexed: 12/23/2022] Open
Abstract
Identifying secretory proteins from blood, saliva or other body fluids has become an effective method of diagnosing diseases. Existing secretory protein prediction methods are mainly based on conventional machine learning algorithms and are highly dependent on the feature set from the protein. In this article, we propose a deep learning model based on the capsule network and transformer architecture, SecProCT, to predict secretory proteins using only amino acid sequences. The proposed model was validated using cross-validation and achieved 0.921 and 0.892 accuracy for predicting blood-secretory proteins and saliva-secretory proteins, respectively. Meanwhile, the proposed model was validated on an independent test set and achieved 0.917 and 0.905 accuracy for predicting blood-secretory proteins and saliva-secretory proteins, respectively, which are better than conventional machine learning methods and other deep learning methods for biological sequence analysis. The main contributions of this article are as follows: (1) a deep learning model based on a capsule network and transformer architecture is proposed for predicting secretory proteins. The results of this model are better than the those of existing conventional machine learning methods and deep learning methods for biological sequence analysis; (2) only amino acid sequences are used in the proposed model, which overcomes the high dependence of existing methods on the annotated protein features; (3) the proposed model can accurately predict most experimentally verified secretory proteins and cancer protein biomarkers in blood and saliva.
Collapse
|
225
|
Lo CL, Yang YH, Tseng HT. A Fact-Finding Procedure Integrating Machine Learning and AHP Technique to Predict Delayed Diagnosis of Bladder Patients with Hematuria. JOURNAL OF HEALTHCARE ENGINEERING 2021; 2021:3831453. [PMID: 34462648 PMCID: PMC8403036 DOI: 10.1155/2021/3831453] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/21/2021] [Revised: 07/23/2021] [Accepted: 08/12/2021] [Indexed: 01/04/2023]
Abstract
Bladder cancer, the ninth most common cancer worldwide, requires fast diagnosis and treatment to prevent disease progression and improve patient survival. However, patients with bladder cancer often experience considerable delays in diagnosis. One reason for such delays is that hematuria, a major symptom of bladder cancer, has a high probability of also being a warning sign for urinary tract diseases. Another reason is that the sensitivity of the body parts affected by bladder cancer deters patients from undergoing cystoscopy and influences patients' "physician shopping" behavior. In this study, the analytic hierarchy process was used to determine critical variables influencing delayed diagnosis; moreover, the variables were used to construct models for predicting delayed diagnosis in patients with hematuria by using several machine learning techniques. Furthermore, the critical variables associated with delayed diagnosis of bladder cancer in patients with hematuria were evaluated using GainRatio technology. The study sample was selected from a population-based database. The model evaluation results indicated that the prediction model established using decision tree algorithms outperformed the other models. The critical risk factors for delayed diagnosis of bladder cancer were as follows: (1) cystoscopy performed 6 months after hematuria diagnosis and (2) physician shopping.
Collapse
Affiliation(s)
- Chia-Lun Lo
- Department of Health-Business Administration, Fooyin University, Kaohsiung 83102, Taiwan
| | - Ya-Hui Yang
- Department of Health-Business Administration, Fooyin University, Kaohsiung 83102, Taiwan
| | - Hsiao-Ting Tseng
- Department of Information Management, National Central University, Taoyuan 32001, Taiwan
| |
Collapse
|
226
|
|
227
|
Handling Class-Imbalance with KNN (Neighbourhood) Under-Sampling for Software Defect Prediction. Artif Intell Rev 2021. [DOI: 10.1007/s10462-021-10044-w] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
|
228
|
Spatial Modeling of Asthma-Prone Areas Using Remote Sensing and Ensemble Machine Learning Algorithms. REMOTE SENSING 2021. [DOI: 10.3390/rs13163222] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
In this study, asthma-prone area modeling of Tehran, Iran was provided by employing three ensemble machine learning algorithms (Bootstrap aggregating (Bagging), Adaptive Boosting (AdaBoost), and Stacking). First, a spatial database was created with 872 locations of asthma patients and affecting factors (particulate matter (PM10 and PM2.5), ozone (O3), sulfur dioxide (SO2), carbon monoxide (CO), nitrogen dioxide (NO2), rainfall, wind speed, humidity, temperature, distance to street, traffic volume, and a normalized difference vegetation index (NDVI)). We created four factors using remote sensing (RS) imagery, including air pollution (O3, SO2, CO, and NO2), altitude, and NDVI. All criteria were prepared using a geographic information system (GIS). For modeling and validation, 70% and 30% of the data were used, respectively. The weight of evidence (WOE) model was used to assess the spatial relationship between the dependent and independent data. Finally, three ensemble algorithms were used to perform asthma-prone areas mapping. According to the Gini index, the most influential factors on asthma occurrence were distance to the street, NDVI, and traffic volume. The area under the curve (AUC) of receiver operating characteristic (ROC) values for the AdaBoost, Bagging, and Stacking algorithms was 0.849, 0.82, and 0.785, respectively. According to the findings, the AdaBoost algorithm outperforms the Bagging and Stacking algorithms in spatial modeling of asthma-prone areas.
Collapse
|
229
|
Mohammed M, Mwambi H, Mboya IB, Elbashir MK, Omolo B. A stacking ensemble deep learning approach to cancer type classification based on TCGA data. Sci Rep 2021; 11:15626. [PMID: 34341396 PMCID: PMC8329290 DOI: 10.1038/s41598-021-95128-x] [Citation(s) in RCA: 36] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2020] [Accepted: 07/19/2021] [Indexed: 12/13/2022] Open
Abstract
Cancer tumor classification based on morphological characteristics alone has been shown to have serious limitations. Breast, lung, colorectal, thyroid, and ovarian are the most commonly diagnosed cancers among women. Precise classification of cancers into their types is considered a vital problem for cancer diagnosis and therapy. In this paper, we proposed a stacking ensemble deep learning model based on one-dimensional convolutional neural network (1D-CNN) to perform a multi-class classification on the five common cancers among women based on RNASeq data. The RNASeq gene expression data was downloaded from Pan-Cancer Atlas using GDCquery function of the TCGAbiolinks package in the R software. We used least absolute shrinkage and selection operator (LASSO) as feature selection method. We compared the results of the new proposed model with and without LASSO with the results of the single 1D-CNN and machine learning methods which include support vector machines with radial basis function, linear, and polynomial kernels; artificial neural networks; k-nearest neighbors; bagging trees. The results show that the proposed model with and without LASSO has a better performance compared to other classifiers. Also, the results show that the machine learning methods (SVM-R, SVM-L, SVM-P, ANN, KNN, and bagging trees) with under-sampling have better performance than with over-sampling techniques. This is supported by the statistical significance test of accuracy where the p-values for differences between the SVM-R and SVM-P, SVM-R and ANN, SVM-R and KNN are found to be p = 0.003, p = < 0.001, and p = < 0.001, respectively. Also, SVM-L had a significant difference compared to ANN p = 0.009. Moreover, SVM-P and ANN, SVM-P and KNN are found to be significantly different with p-values p = < 0.001 and p = < 0.001, respectively. In addition, ANN and bagging trees, ANN and KNN were found to be significantly different with p-values p = < 0.001 and p = 0.004, respectively. Thus, the proposed model can help in the early detection and diagnosis of cancer in women, and hence aid in designing early treatment strategies to improve survival.
Collapse
Affiliation(s)
- Mohanad Mohammed
- School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Pietermaritzburg, Private Bag X01, Scottsville, 3209, South Africa.
| | - Henry Mwambi
- School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Pietermaritzburg, Private Bag X01, Scottsville, 3209, South Africa
| | - Innocent B Mboya
- School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Pietermaritzburg, Private Bag X01, Scottsville, 3209, South Africa
- Department of Epidemiology and Biostatistics, Kilimanjaro Christian Medical University College (KCMUCo), P. O. Box 2240, Moshi, Tanzania
| | - Murtada K Elbashir
- College of Computer and Information Sciences, Jouf University, Sakaka, 72441, Saudi Arabia
- Faculty of Mathematical and Computer Sciences, University of Gezira, Wad Madani, 11123, Sudan
| | - Bernard Omolo
- School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Pietermaritzburg, Private Bag X01, Scottsville, 3209, South Africa
- Division of Mathematics and Computer Science, University of South Carolina-Upstate, 800 University Way, Spartanburg, USA
- School of Public Health, Faculty of Health Sciences, University of Witwatersrand, Johannesburg, South Africa
| |
Collapse
|
230
|
Cost-Sensitive Classification for Evolving Data Streams with Concept Drift and Class Imbalance. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2021; 2021:8813806. [PMID: 34381499 PMCID: PMC8352686 DOI: 10.1155/2021/8813806] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/07/2020] [Revised: 07/04/2021] [Accepted: 07/21/2021] [Indexed: 11/17/2022]
Abstract
Class imbalance and concept drift are two primary principles that exist concurrently in data stream classification. Although the two issues have drawn enough attention separately, the joint treatment largely remains unexplored. Moreover, the class imbalance issue is further complicated if data streams with concept drift. A novel Cost-Sensitive based Data Stream (CSDS) classification is introduced to overcome the two issues simultaneously. The CSDS considers cost information during the procedures of data preprocessing and classification. During the data preprocessing, a cost-sensitive learning strategy is introduced into the ReliefF algorithm for alleviating the class imbalance at the data level. In the classification process, a cost-sensitive weighting schema is devised to enhance the overall performance of the ensemble. Besides, a change detection mechanism is embedded in our algorithm, which guarantees that an ensemble can capture and react to drift promptly. Experimental results validate that our method can obtain better classification results under different imbalanced concept drifting data stream scenarios.
Collapse
|
231
|
A novel rhinitis prediction method for class imbalance. Biomed Signal Process Control 2021. [DOI: 10.1016/j.bspc.2021.102821] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
232
|
Diallo M, Xiong S, Emiru ED, Fesseha A, Abdulsalami AO, Elaziz MA. A Hybrid MultiLayer Perceptron Under-Sampling with Bagging Dealing with a Real-Life Imbalanced Rice Dataset. INFORMATION 2021; 12:291. [DOI: 10.3390/info12080291] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/02/2023] Open
Abstract
Classification algorithms have shown exceptional prediction results in the supervised learning area. These classification algorithms are not always efficient when it comes to real-life datasets due to class distributions. As a result, datasets for real-life applications are generally imbalanced. Several methods have been proposed to solve the problem of class imbalance. In this paper, we propose a hybrid method combining the preprocessing techniques and those of ensemble learning. The original training set is undersampled by evaluating the samples by stochastic measurement (SM) and then training these samples selected by Multilayer Perceptron to return a balanced training set. The MLPUS (Multilayer perceptron undersampling) balanced training set is aggregated using the bagging ensemble method. We applied our method to the real-life Niger_Rice dataset and forty-four other imbalanced datasets from the KEEL repository in this study. We also compared our method with six other existing methods in the literature, such as the MLP classifier on the original imbalance dataset, MLPUS, UnderBagging (combining random under-sampling and bagging), RUSBoost, SMOTEBagging (Synthetic Minority Oversampling Technique and bagging), SMOTEBoost. The results show that our method is competitive compared to other methods. The Niger_Rice real-life dataset results are 75.6, 0.73, 0.76, and 0.86, respectively, for accuracy, F-measure, G-mean, and ROC with our proposed method. In contrast, the MLP classifier on the original imbalance Niger_Rice dataset gives results 72.44, 0.82, 0.59, and 0.76 respectively for accuracy, F-measure, G-mean, and ROC.
Collapse
|
233
|
Abstract
Class imbalance and high dimensionality are two major issues in several real-life applications, e.g., in the fields of bioinformatics, text mining and image classification. However, while both issues have been extensively studied in the machine learning community, they have mostly been treated separately, and little research has been thus far conducted on which approaches might be best suited to deal with datasets that are class-imbalanced and high-dimensional at the same time (i.e., with a large number of features). This work attempts to give a contribution to this challenging research area by studying the effectiveness of hybrid learning strategies that involve the integration of feature selection techniques, to reduce the data dimensionality, with proper methods that cope with the adverse effects of class imbalance (in particular, data balancing and cost-sensitive methods are considered). Extensive experiments have been carried out across datasets from different domains, leveraging a well-known classifier, the Random Forest, which has proven to be effective in high-dimensional spaces and has also been successfully applied to imbalanced tasks. Our results give evidence of the benefits of such a hybrid approach, when compared to using only feature selection or imbalance learning methods alone.
Collapse
|
234
|
Applying random forest in a health administrative data context: a conceptual guide. HEALTH SERVICES AND OUTCOMES RESEARCH METHODOLOGY 2021. [DOI: 10.1007/s10742-021-00255-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
235
|
On Combining Feature Selection and Over-Sampling Techniques for Breast Cancer Prediction. APPLIED SCIENCES-BASEL 2021. [DOI: 10.3390/app11146574] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Breast cancer prediction datasets are usually class imbalanced, where the number of data samples in the malignant and benign patient classes are significantly different. Over-sampling techniques can be used to re-balance the datasets to construct more effective prediction models. Moreover, some related studies have considered feature selection to remove irrelevant features from the datasets for further performance improvement. However, since the order of combining feature selection and over-sampling can result in different training sets to construct the prediction model, it is unknown which order performs better. In this paper, the information gain (IG) and genetic algorithm (GA) feature selection methods and the synthetic minority over-sampling technique (SMOTE) are used for different combinations. The experimental results based on two breast cancer datasets show that the combination of feature selection and over-sampling outperform the single usage of either feature selection and over-sampling for the highly class imbalanced datasets. In particular, performing IG first and SMOTE second is the better choice. For other datasets with a small class imbalance ratio and a smaller number of features, performing SMOTE is enough to construct an effective prediction model.
Collapse
|
236
|
Risky Driver Recognition with Class Imbalance Data and Automated Machine Learning Framework. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2021; 18:ijerph18147534. [PMID: 34299986 PMCID: PMC8305749 DOI: 10.3390/ijerph18147534] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/12/2021] [Revised: 06/26/2021] [Accepted: 07/03/2021] [Indexed: 11/26/2022]
Abstract
Identifying high-risk drivers before an accident happens is necessary for traffic accident control and prevention. Due to the class-imbalance nature of driving data, high-risk samples as the minority class are usually ill-treated by standard classification algorithms. Instead of applying preset sampling or cost-sensitive learning, this paper proposes a novel automated machine learning framework that simultaneously and automatically searches for the optimal sampling, cost-sensitive loss function, and probability calibration to handle class-imbalance problem in recognition of risky drivers. The hyperparameters that control sampling ratio and class weight, along with other hyperparameters, are optimized by Bayesian optimization. To demonstrate the performance of the proposed automated learning framework, we establish a risky driver recognition model as a case study, using video-extracted vehicle trajectory data of 2427 private cars on a German highway. Based on rear-end collision risk evaluation, only 4.29% of all drivers are labeled as risky drivers. The inputs of the recognition model are the discrete Fourier transform coefficients of target vehicle’s longitudinal speed, lateral speed, and the gap between the target vehicle and its preceding vehicle. Among 12 sampling methods, 2 cost-sensitive loss functions, and 2 probability calibration methods, the result of automated machine learning is consistent with manual searching but much more computation-efficient. We find that the combination of Support Vector Machine-based Synthetic Minority Oversampling TEchnique (SVMSMOTE) sampling, cost-sensitive cross-entropy loss function, and isotonic regression can significantly improve the recognition ability and reduce the error of predicted probability.
Collapse
|
237
|
A Review of Fuzzy and Pattern-Based Approaches for Class Imbalance Problems. APPLIED SCIENCES-BASEL 2021. [DOI: 10.3390/app11146310] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
The usage of imbalanced databases is a recurrent problem in real-world data such as medical diagnostic, fraud detection, and pattern recognition. Nevertheless, in class imbalance problems, the classifiers are commonly biased by the class with more objects (majority class) and ignore the class with fewer objects (minority class). There are different ways to solve the class imbalance problem, and there has been a trend towards the usage of patterns and fuzzy approaches due to the favorable results. In this paper, we provide an in-depth review of popular methods for imbalanced databases related to patterns and fuzzy approaches. The reviewed papers include classifiers, data preprocessing, and evaluation metrics. We identify different application domains and describe how the methods are used. Finally, we suggest further research directions according to the analysis of the reviewed papers and the trend of the state of the art.
Collapse
|
238
|
Vallone F, Ottaviani MM, Dedola F, Cutrone A, Romeni S, Panarese AM, Bernini F, Cracchiolo M, Strauss I, Gabisonia K, Gorgodze N, Mazzoni A, Recchia FA, Micera S. Simultaneous decoding of cardiovascular and respiratory functional changes from pig intraneural vagus nerve signals. J Neural Eng 2021; 18. [PMID: 34153949 DOI: 10.1088/1741-2552/ac0d42] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2020] [Accepted: 06/21/2021] [Indexed: 12/15/2022]
Abstract
Objective. Bioelectronic medicine is opening new perspectives for the treatment of some major chronic diseases through the physical modulation of autonomic nervous system activity. Being the main peripheral route for electrical signals between central nervous system and visceral organs, the vagus nerve (VN) is one of the most promising targets. Closed-loop VN stimulation (VNS) would be crucial to increase effectiveness of this approach. Therefore, the extrapolation of useful physiological information from VN electrical activity would represent an invaluable source for single-target applications. Here, we present an advanced decoding algorithm novel to VN studies and properly detecting different functional changes from VN signals.Approach. VN signals were recorded using intraneural electrodes in anaesthetized pigs during cardiovascular and respiratory challenges mimicking increases in arterial blood pressure, tidal volume and respiratory rate. We developed a decoding algorithm that combines discrete wavelet transformation, principal component analysis, and ensemble learning made of classification trees.Main results. The new decoding algorithm robustly achieved high accuracy levels in identifying different functional changes and discriminating among them. Interestingly our findings suggest that electrodes positioning plays an important role on decoding performances. We also introduced a new index for the characterization of recording and decoding performance of neural interfaces. Finally, by combining an anatomically validated hybrid neural model and discrimination analysis, we provided new evidence suggesting a functional topographical organization of VN fascicles.Significance. This study represents an important step towards the comprehension of VN signaling, paving the way for the development of effective closed-loop VNS systems.
Collapse
Affiliation(s)
- Fabio Vallone
- The BioRobotics Institute and Department of Excellence in Robotics and Artificial Intelligence, Scuola Superiore Sant'Anna, Pisa, Italy
| | - Matteo Maria Ottaviani
- The BioRobotics Institute and Department of Excellence in Robotics and Artificial Intelligence, Scuola Superiore Sant'Anna, Pisa, Italy.,Institute of Life Sciences, Scuola Superiore Sant'Anna, Pisa, Italy
| | - Francesca Dedola
- The BioRobotics Institute and Department of Excellence in Robotics and Artificial Intelligence, Scuola Superiore Sant'Anna, Pisa, Italy
| | - Annarita Cutrone
- The BioRobotics Institute and Department of Excellence in Robotics and Artificial Intelligence, Scuola Superiore Sant'Anna, Pisa, Italy
| | - Simone Romeni
- Bertarelli Foundation Chair in Translational Neural Engineering, Center for Neuroprosthetics and Institute of Bioengineering, Ecole Polytechnique Federale de Lausanne, Lausanne, Switzerland
| | - Adele Macrí Panarese
- The BioRobotics Institute and Department of Excellence in Robotics and Artificial Intelligence, Scuola Superiore Sant'Anna, Pisa, Italy
| | - Fabio Bernini
- Institute of Life Sciences, Scuola Superiore Sant'Anna, Pisa, Italy
| | - Marina Cracchiolo
- The BioRobotics Institute and Department of Excellence in Robotics and Artificial Intelligence, Scuola Superiore Sant'Anna, Pisa, Italy
| | - Ivo Strauss
- The BioRobotics Institute and Department of Excellence in Robotics and Artificial Intelligence, Scuola Superiore Sant'Anna, Pisa, Italy
| | - Khatia Gabisonia
- Institute of Life Sciences, Scuola Superiore Sant'Anna, Pisa, Italy.,Fondazione Toscana Gabriele Monasterio, Pisa, Italy
| | - Nikoloz Gorgodze
- Institute of Life Sciences, Scuola Superiore Sant'Anna, Pisa, Italy.,Fondazione Toscana Gabriele Monasterio, Pisa, Italy
| | - Alberto Mazzoni
- The BioRobotics Institute and Department of Excellence in Robotics and Artificial Intelligence, Scuola Superiore Sant'Anna, Pisa, Italy
| | - Fabio A Recchia
- Institute of Life Sciences, Scuola Superiore Sant'Anna, Pisa, Italy.,Fondazione Toscana Gabriele Monasterio, Pisa, Italy.,Department of Physiology, Cardiovascular Research Center, Lewis Katz School of Medicine at Temple University, Philadelphia, PA, United States of America
| | - Silvestro Micera
- The BioRobotics Institute and Department of Excellence in Robotics and Artificial Intelligence, Scuola Superiore Sant'Anna, Pisa, Italy.,Bertarelli Foundation Chair in Translational Neural Engineering, Center for Neuroprosthetics and Institute of Bioengineering, Ecole Polytechnique Federale de Lausanne, Lausanne, Switzerland
| |
Collapse
|
239
|
Dongdong L, Ziqiu C, Bolu W, Zhe W, Hai Y, Wenli D. Entropy‐based hybrid sampling ensemble learning for imbalanced data. INT J INTELL SYST 2021. [DOI: 10.1002/int.22388] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Affiliation(s)
- Li Dongdong
- Key Laboratory of Advanced Control and Optimization for Chemical Processes, Ministry of Education East China University of Science and Technology Shanghai China
- Department of Computer Science and Engineering East China University of Science and Technology Shanghai China
- Provincial Key Laboratory for Computer Information Processing Technology Soochow University Suzhou China
| | - Chi Ziqiu
- Department of Computer Science and Engineering East China University of Science and Technology Shanghai China
| | - Wang Bolu
- Department of Computer Science and Engineering East China University of Science and Technology Shanghai China
| | - Wang Zhe
- Key Laboratory of Advanced Control and Optimization for Chemical Processes, Ministry of Education East China University of Science and Technology Shanghai China
- Department of Computer Science and Engineering East China University of Science and Technology Shanghai China
| | - Yang Hai
- Key Laboratory of Advanced Control and Optimization for Chemical Processes, Ministry of Education East China University of Science and Technology Shanghai China
- Department of Computer Science and Engineering East China University of Science and Technology Shanghai China
| | - Du Wenli
- Key Laboratory of Advanced Control and Optimization for Chemical Processes, Ministry of Education East China University of Science and Technology Shanghai China
| |
Collapse
|
240
|
Kang Y, Jia N, Cui R, Deng J. A graph-based semi-supervised reject inference framework considering imbalanced data distribution for consumer credit scoring. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2021.107259] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
241
|
Using Hybrid Artificial Intelligence and Evolutionary Optimization Algorithms for Estimating Soybean Yield and Fresh Biomass Using Hyperspectral Vegetation Indices. REMOTE SENSING 2021. [DOI: 10.3390/rs13132555] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Recent advanced high-throughput field phenotyping combined with sophisticated big data analysis methods have provided plant breeders with unprecedented tools for a better prediction of important agronomic traits, such as yield and fresh biomass (FBIO), at early growth stages. This study aimed to demonstrate the potential use of 35 selected hyperspectral vegetation indices (HVI), collected at the R5 growth stage, for predicting soybean seed yield and FBIO. Two artificial intelligence algorithms, ensemble-bagging (EB) and deep neural network (DNN), were used to predict soybean seed yield and FBIO using HVI. Considering HVI as input variables, the coefficients of determination (R2) of 0.76 and 0.77 for yield and 0.91 and 0.89 for FBIO were obtained using DNN and EB, respectively. In this study, we also used hybrid DNN-SPEA2 to estimate the optimum HVI values in soybeans with maximized yield and FBIO productions. In addition, to identify the most informative HVI in predicting yield and FBIO, the feature recursive elimination wrapper method was used and the top ranking HVI were determined to be associated with red, 670 nm and near-infrared, 800 nm, regions. Overall, this study introduced hybrid DNN-SPEA2 as a robust mathematical tool for optimizing and using informative HVI for estimating soybean seed yield and FBIO at early growth stages, which can be employed by soybean breeders for discriminating superior genotypes in large breeding populations.
Collapse
|
242
|
Scalable kernel-based SVM classification algorithm on imbalance air quality data for proficient healthcare. COMPLEX INTELL SYST 2021. [DOI: 10.1007/s40747-021-00435-5] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
AbstractIn the last decade, we have seen drastic changes in the air pollution level, which has become a critical environmental issue. It should be handled carefully towards making the solutions for proficient healthcare. Reducing the impact of air pollution on human health is possible only if the data is correctly classified. In numerous classification problems, we are facing the class imbalance issue. Learning from imbalanced data is always a challenging task for researchers, and from time to time, possible solutions have been developed by researchers. In this paper, we are focused on dealing with the imbalanced class distribution in a way that the classification algorithm will not compromise its performance. The proposed algorithm is based on the concept of the adjusting kernel scaling (AKS) method to deal with the multi-class imbalanced dataset. The kernel function's selection has been evaluated with the help of weighting criteria and the chi-square test. All the experimental evaluation has been performed on sensor-based Indian Central Pollution Control Board (CPCB) dataset. The proposed algorithm with the highest accuracy of 99.66% wins the race among all the classification algorithms i.e. Adaboost (59.72%), Multi-Layer Perceptron (95.71%), GaussianNB (80.87%), and SVM (96.92). The results of the proposed algorithm are also better than the existing literature methods. It is also clear from these results that our proposed algorithm is efficient for dealing with class imbalance problems along with enhanced performance. Thus, accurate classification of air quality through our proposed algorithm will be useful for improving the existing preventive policies and will also help in enhancing the capabilities of effective emergency response in the worst pollution situation.
Collapse
|
243
|
Zhao D, Wang X, Mu Y, Wang L. Experimental Study and Comparison of Imbalance Ensemble Classifiers with Dynamic Selection Strategy. ENTROPY (BASEL, SWITZERLAND) 2021; 23:822. [PMID: 34203274 PMCID: PMC8307085 DOI: 10.3390/e23070822] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/04/2021] [Revised: 06/18/2021] [Accepted: 06/24/2021] [Indexed: 12/12/2022]
Abstract
Imbalance ensemble classification is one of the most essential and practical strategies for improving decision performance in data analysis. There is a growing body of literature about ensemble techniques for imbalance learning in recent years, the various extensions of imbalanced classification methods were established from different points of view. The present study is initiated in an attempt to review the state-of-the-art ensemble classification algorithms for dealing with imbalanced datasets, offering a comprehensive analysis for incorporating the dynamic selection of base classifiers in classification. By conducting 14 existing ensemble algorithms incorporating a dynamic selection on 56 datasets, the experimental results reveal that the classical algorithm with a dynamic selection strategy deliver a practical way to improve the classification performance for both a binary class and multi-class imbalanced datasets. In addition, by combining patch learning with a dynamic selection ensemble classification, a patch-ensemble classification method is designed, which utilizes the misclassified samples to train patch classifiers for increasing the diversity of base classifiers. The experiments' results indicate that the designed method has a certain potential for the performance of multi-class imbalanced classification.
Collapse
Affiliation(s)
- Dongxue Zhao
- School of Science, Dalian Maritime University, Dalian 116026, China
| | - Xin Wang
- School of Science, Dalian Maritime University, Dalian 116026, China
| | - Yashuang Mu
- School of Artificial Intelligence and Big Data, Henan University of Technology, Zhengzhou 450001, China
| | - Lidong Wang
- School of Science, Dalian Maritime University, Dalian 116026, China
| |
Collapse
|
244
|
On the Improvement of the Isolation Forest Algorithm for Outlier Detection with Streaming Data. ELECTRONICS 2021. [DOI: 10.3390/electronics10131534] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
In recent years, detecting anomalies in real-world computer networks has become a more and more challenging task due to the steady increase of high-volume, high-speed and high-dimensional streaming data, for which ground truth information is not available. Efficient detection schemes applied on networked embedded devices need to be fast and memory-constrained, and must be capable of dealing with concept drifts when they occur. Different approaches for unsupervised online outlier detection have been designed to deal with these circumstances in order to reliably detect malicious activity. In this paper, we introduce a novel framework called PCB-iForest, which generalized, is able to incorporate any ensemble-based online OD method to function on streaming data. Carefully engineered requirements are compared to the most popular state-of-the-art online methods with an in-depth focus on variants based on the widely accepted isolation forest algorithm, thereby highlighting the lack of a flexible and efficient solution which is satisfied by PCB-iForest. Therefore, we integrate two variants into PCB-iForest—an isolation forest improvement called extended isolation forest and a classic isolation forest variant equipped with the functionality to score features according to their contributions to a sample’s anomalousness. Extensive experiments were performed on 23 different multi-disciplinary and security-related real-world datasets in order to comprehensively evaluate the performance of our implementation compared with off-the-shelf methods. The discussion of results, including AUC, F1 score and averaged execution time metric, shows that PCB-iForest clearly outperformed the state-of-the-art competitors in 61% of cases and even achieved more promising results in terms of the tradeoff between classification and computational costs.
Collapse
|
245
|
Lin E, Kuo PH, Lin WY, Liu YL, Yang AC, Tsai SJ. Prediction of Probable Major Depressive Disorder in the Taiwan Biobank: An Integrated Machine Learning and Genome-Wide Analysis Approach. J Pers Med 2021; 11:597. [PMID: 34202750 PMCID: PMC8308113 DOI: 10.3390/jpm11070597] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2021] [Revised: 06/14/2021] [Accepted: 06/22/2021] [Indexed: 12/16/2022] Open
Abstract
In light of recent advancements in machine learning, personalized medicine using predictive algorithms serves as an essential paradigmatic methodology. Our goal was to explore an integrated machine learning and genome-wide analysis approach which targets the prediction of probable major depressive disorder (MDD) using 9828 individuals in the Taiwan Biobank. In our analysis, we reported a genome-wide significant association with probable MDD that has not been previously identified: FBN1 on chromosome 15. Furthermore, we pinpointed 17 single nucleotide polymorphisms (SNPs) which show evidence of both associations with probable MDD and potential roles as expression quantitative trait loci (eQTLs). To predict the status of probable MDD, we established prediction models with random undersampling and synthetic minority oversampling using 17 eQTL SNPs and eight clinical variables. We utilized five state-of-the-art models: logistic ridge regression, support vector machine, C4.5 decision tree, LogitBoost, and random forests. Our data revealed that random forests had the highest performance (area under curve = 0.8905 ± 0.0088; repeated 10-fold cross-validation) among the predictive algorithms to infer complex correlations between biomarkers and probable MDD. Our study suggests that an integrated machine learning and genome-wide analysis approach may offer an advantageous method to establish bioinformatics tools for discriminating MDD patients from healthy controls.
Collapse
Affiliation(s)
- Eugene Lin
- Department of Biostatistics, University of Washington, Seattle, WA 98195, USA
- Department of Electrical & Computer Engineering, University of Washington, Seattle, WA 98195, USA
- Graduate Institute of Biomedical Sciences, China Medical University, Taichung 40402, Taiwan
| | - Po-Hsiu Kuo
- Department of Public Health, Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei 10617, Taiwan; (P.-H.K.); (W.-Y.L.)
| | - Wan-Yu Lin
- Department of Public Health, Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei 10617, Taiwan; (P.-H.K.); (W.-Y.L.)
| | - Yu-Li Liu
- Center for Neuropsychiatric Research, National Health Research Institutes, Miaoli County 35053, Taiwan;
| | - Albert C. Yang
- Division of Interdisciplinary Medicine and Biotechnology, Beth Israel Deaconess Medical Center/Harvard Medical School, Boston, MA 02215, USA;
- Institute of Brain Science, National Yang Ming Chiao Tung University, Taipei 112304, Taiwan
| | - Shih-Jen Tsai
- Department of Psychiatry, Taipei Veterans General Hospital, Taipei 11217, Taiwan
- Division of Psychiatry, National Yang Ming Chiao Tung University, Taipei 112304, Taiwan
| |
Collapse
|
246
|
Gnip P, Vokorokos L, Drotár P. Selective oversampling approach for strongly imbalanced data. PeerJ Comput Sci 2021; 7:e604. [PMID: 34239981 PMCID: PMC8237317 DOI: 10.7717/peerj-cs.604] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2021] [Accepted: 05/31/2021] [Indexed: 06/03/2023]
Abstract
Challenges posed by imbalanced data are encountered in many real-world applications. One of the possible approaches to improve the classifier performance on imbalanced data is oversampling. In this paper, we propose the new selective oversampling approach (SOA) that first isolates the most representative samples from minority classes by using an outlier detection technique and then utilizes these samples for synthetic oversampling. We show that the proposed approach improves the performance of two state-of-the-art oversampling methods, namely, the synthetic minority oversampling technique and adaptive synthetic sampling. The prediction performance is evaluated on four synthetic datasets and four real-world datasets, and the proposed SOA methods always achieved the same or better performance than other considered existing oversampling methods.
Collapse
Affiliation(s)
- Peter Gnip
- Department of Computers and Informatics, Technical University of Košice, Slovak Republic
| | - Liberios Vokorokos
- Department of Computers and Informatics, Technical University of Košice, Slovak Republic
| | - Peter Drotár
- Department of Computers and Informatics, Technical University of Košice, Slovak Republic
| |
Collapse
|
247
|
Tang J, Li J, Xu W, Tian Y, Ju X, Zhang J. Robust cost-sensitive kernel method with Blinex loss and its applications in credit risk evaluation. Neural Netw 2021; 143:327-344. [PMID: 34182234 DOI: 10.1016/j.neunet.2021.06.016] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2021] [Revised: 05/10/2021] [Accepted: 06/10/2021] [Indexed: 10/21/2022]
Abstract
Credit risk evaluation is a crucial yet challenging problem in financial analysis. It can not only help institutions reduce risk and ensure profitability, but also improve consumers' fair practices. The data-driven algorithms such as artificial intelligence techniques regard the evaluation as a classification problem and aim to classify transactions as default or non-default. Since non-default samples greatly outnumber default samples, it is a typical imbalanced learning problem and each class or each sample needs special treatment. Numerous data-level, algorithm-level and hybrid methods are presented, and cost-sensitive support vector machines (CSSVMs) are representative algorithm-level methods. Based on the minimization of symmetric and unbounded loss functions, CSSVMs impose higher penalties on the misclassification costs of minority instances using domain specific parameters. However, such loss functions as error measurement cannot have an obvious cost-sensitive generalization. In this paper, we propose a robust cost-sensitive kernel method with Blinex loss (CSKB), which can be applied in credit risk evaluation. By inheriting the elegant merits of Blinex loss function, i.e., asymmetry and boundedness, CSKB not only flexibly controls distinct costs for both classes, but also enjoys noise robustness. As a data-driven decision-making paradigm of credit risk evaluation, CSKB can achieve the "win-win" situation for both the financial institutions and consumers. We solve linear and nonlinear CSKB by Nesterov accelerated gradient algorithm and Pegasos algorithm respectively. Moreover, the generalization capability of CSKB is theoretically analyzed. Comprehensive experiments on synthetic, UCI and credit risk evaluation datasets demonstrate that CSKB compares more favorably than other benchmark methods in terms of various measures.
Collapse
Affiliation(s)
- Jingjing Tang
- School of Business Administration, Faculty of Business Administration, Southwestern University of Finance and Economics, Chengdu 611130, China.
| | - Jiahui Li
- School of Business Administration, Faculty of Business Administration, Southwestern University of Finance and Economics, Chengdu 611130, China.
| | - Weiqi Xu
- School of Business Administration, Faculty of Business Administration, Southwestern University of Finance and Economics, Chengdu 611130, China.
| | - Yingjie Tian
- School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China; Research Center on Fictitious Economy and Data Science, Chinese Academy of Sciences, Beijing 100190, China; Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing 100190, China.
| | - Xuchan Ju
- College of Mathematics and Statistics, Shenzhen University, Shenzhen 518060, China.
| | - Jie Zhang
- Alibaba Group, Beijing 100102, China.
| |
Collapse
|
248
|
Multi-Nyström Method Based on Multiple Kernel Learning for Large Scale Imbalanced Classification. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2021; 2021:9911871. [PMID: 34234824 PMCID: PMC8216788 DOI: 10.1155/2021/9911871] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/09/2021] [Accepted: 05/27/2021] [Indexed: 11/17/2022]
Abstract
Extensions of kernel methods for the class imbalance problems have been extensively studied. Although they work well in coping with nonlinear problems, the high computation and memory costs severely limit their application to real-world imbalanced tasks. The Nyström method is an effective technique to scale kernel methods. However, the standard Nyström method needs to sample a sufficiently large number of landmark points to ensure an accurate approximation, which seriously affects its efficiency. In this study, we propose a multi-Nyström method based on mixtures of Nyström approximations to avoid the explosion of subkernel matrix, whereas the optimization to mixture weights is embedded into the model training process by multiple kernel learning (MKL) algorithms to yield more accurate low-rank approximation. Moreover, we select subsets of landmark points according to the imbalance distribution to reduce the model's sensitivity to skewness. We also provide a kernel stability analysis of our method and show that the model solution error is bounded by weighted approximate errors, which can help us improve the learning process. Extensive experiments on several large scale datasets show that our method can achieve a higher classification accuracy and a dramatical speedup of MKL algorithms.
Collapse
|
249
|
Wani MA, Roy KK. Development and validation of consensus machine learning-based models for the prediction of novel small molecules as potential anti-tubercular agents. Mol Divers 2021; 26:1345-1356. [PMID: 34110578 DOI: 10.1007/s11030-021-10238-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2021] [Accepted: 05/27/2021] [Indexed: 11/30/2022]
Abstract
Tuberculosis (TB) is an infectious disease and the leading cause of death globally. The rapidly emerging cases of drug resistance among pathogenic mycobacteria have been a global threat urging the need of new drug discovery and development. However, considering the fact that the new drug discovery and development is commonly lengthy and costly processes, strategic use of the cutting-edge machine learning (ML) algorithms may be very supportive in reducing both the cost and time involved. Considering the urgency of new drugs for TB, herein, we have attempted to develop predictive ML algorithms-based models useful in the selection of novel potential small molecules for subsequent in vitro validation. For this purpose, we used the GlaxoSmithKline (GSK) TCAMS TB dataset comprising a total of 776 hits that were made publicly available to the wider scientific community through the ChEMBL Neglected Tropical Diseases (ChEMBL-NTD) database. After exploring the different ML classifiers, viz. decision trees (DT), support vector machine (SVM), random forest (RF), Bernoulli Naive Bayes (BNB), K-nearest neighbors (k-NN), and linear logistic regression (LLR), and ensemble learning models (bagging and Adaboost) for training the model using the GSK dataset, we concluded with three best models, viz. Adaboost decision tree (ABDT), RF classifier, and k-NN models that gave the top prediction results for both the training and test sets. However, during the prediction of the external set of known anti-tubercular compounds/drugs, it was realized that each of these models had some limitations. The ABDT model correctly predicted 22 molecules as actives, while both the RF and k-NN models predicted 18 molecules correctly as actives; a number of molecules were predicted as actives by two of these models, while the third model predicted these compounds as inactives. Therefore, we concluded that while deciding the anti-tubercular potential of a new molecule, one should rely on the use of consensus predictions using these three models; it may lessen the attrition rate during the in vitro validation. We believe that this study may assist the wider anti-tuberculosis research community by providing a platform for predicting small molecules with subsequent validation for drug discovery and development.
Collapse
Affiliation(s)
- Mushtaq Ahmad Wani
- Department of Pharmacoinformatics, National Institute of Pharmaceutical Education and Research, Kolkata, West Bengal, 700054, India
| | - Kuldeep K Roy
- Department of Pharmaceutical Technology, School of Medical Sciences, Adamas University, Kolkata, West Bengal, 700126, India.
| |
Collapse
|
250
|
Esposito C, Landrum GA, Schneider N, Stiefl N, Riniker S. GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data in Machine Learning. J Chem Inf Model 2021; 61:2623-2640. [PMID: 34100609 DOI: 10.1021/acs.jcim.1c00160] [Citation(s) in RCA: 46] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Machine learning classifiers trained on class imbalanced data are prone to overpredict the majority class. This leads to a larger misclassification rate for the minority class, which in many real-world applications is the class of interest. For binary data, the classification threshold is set by default to 0.5 which, however, is often not ideal for imbalanced data. Adjusting the decision threshold is a good strategy to deal with the class imbalance problem. In this work, we present two different automated procedures for the selection of the optimal decision threshold for imbalanced classification. A major advantage of our procedures is that they do not require retraining of the machine learning models or resampling of the training data. The first approach is specific for random forest (RF), while the second approach, named GHOST, can be potentially applied to any machine learning classifier. We tested these procedures on 138 public drug discovery data sets containing structure-activity data for a variety of pharmaceutical targets. We show that both thresholding methods improve significantly the performance of RF. We tested the use of GHOST with four different classifiers in combination with two molecular descriptors, and we found that most classifiers benefit from threshold optimization. GHOST also outperformed other strategies, including random undersampling and conformal prediction. Finally, we show that our thresholding procedures can be effectively applied to real-world drug discovery projects, where the imbalance and characteristics of the data vary greatly between the training and test sets.
Collapse
Affiliation(s)
- Carmen Esposito
- Laboratory of Physical Chemistry, ETH Zurich, Vladimir-Prelog-Weg 2, 8093 Zurich, Switzerland
| | - Gregory A Landrum
- Laboratory of Physical Chemistry, ETH Zurich, Vladimir-Prelog-Weg 2, 8093 Zurich, Switzerland.,T5 Informatics GmbH, Spalenring 11, 4055 Basel, Switzerland
| | - Nadine Schneider
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Novartis Campus, 4002 Basel, Switzerland
| | - Nikolaus Stiefl
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Novartis Campus, 4002 Basel, Switzerland
| | - Sereina Riniker
- Laboratory of Physical Chemistry, ETH Zurich, Vladimir-Prelog-Weg 2, 8093 Zurich, Switzerland
| |
Collapse
|