351
|
Yang J, Wu X, Liang J, Sun X, Cheng MM, Rosin PL, Wang L. Self-Paced Balance Learning for Clinical Skin Disease Recognition. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:2832-2846. [PMID: 31199274 DOI: 10.1109/tnnls.2019.2917524] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Class imbalance is a challenging problem in many classification tasks. It induces biased classification results for minority classes that contain less training samples than others. Most existing approaches aim to remedy the imbalanced number of instances among categories by resampling the majority and minority classes accordingly. However, the imbalanced level of difficulty of recognizing different categories is also crucial, especially for distinguishing samples with many classes. For example, in the task of clinical skin disease recognition, several rare diseases have a small number of training samples, but they are easy to diagnose because of their distinct visual properties. On the other hand, some common skin diseases, e.g., eczema, are hard to recognize due to the lack of special symptoms. To address this problem, we propose a self-paced balance learning (SPBL) algorithm in this paper. Specifically, we introduce a comprehensive metric termed the complexity of image category that is a combination of both sample number and recognition difficulty. First, the complexity is initialized using the model of the first pace, where the pace indicates one iteration in the self-paced learning paradigm. We then assign each class a penalty weight that is larger for more complex categories and smaller for easier ones, after which the curriculum is reconstructed by rearranging the training samples. Consequently, the model can iteratively learn discriminative representations via balancing the complexity in each pace. Experimental results on the SD-198 and SD-260 benchmark data sets demonstrate that the proposed SPBL algorithm performs favorably against the state-of-the-art methods. We also demonstrate the effectiveness of the SPBL algorithm's generalization capacity on various tasks, such as indoor scene image recognition and object classification.
Collapse
|
352
|
Diaz-Vico D, Dorronsoro JR. Deep Least Squares Fisher Discriminant Analysis. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:2752-2763. [PMID: 30990447 DOI: 10.1109/tnnls.2019.2906302] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
While being one of the first and most elegant tools for dimensionality reduction, Fisher linear discriminant analysis (FLDA) is not currently considered among the top methods for feature extraction or classification. In this paper, we will review two recent approaches to FLDA, namely, least squares Fisher discriminant analysis (LSFDA) and regularized kernel FDA (RKFDA) and propose deep FDA (DFDA), a straightforward nonlinear extension of LSFDA that takes advantage of the recent advances on deep neural networks. We will compare the performance of RKFDA and DFDA on a large number of two-class and multiclass problems, many of them involving class-imbalanced data sets and some having quite large sample sizes; we will use, for this, the areas under the receiver operating characteristics (ROCs) curve of the classifiers considered. As we shall see, the classification performance of both methods is often very similar and particularly good on imbalanced problems, but building DFDA models is considerably much faster than doing so for RKFDA, particularly in problems with quite large sample sizes.
Collapse
|
353
|
Bao F, Deng Y, Kong Y, Ren Z, Suo J, Dai Q. Learning Deep Landmarks for Imbalanced Classification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:2691-2704. [PMID: 31395564 DOI: 10.1109/tnnls.2019.2927647] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
We introduce a deep imbalanced learning framework called learning DEep Landmarks in laTent spAce (DELTA). Our work is inspired by the shallow imbalanced learning approaches to rebalance imbalanced samples before feeding them to train a discriminative classifier. Our DELTA advances existing works by introducing the new concept of rebalancing samples in a deeply transformed latent space, where latent points exhibit several desired properties including compactness and separability. In general, DELTA simultaneously conducts feature learning, sample rebalancing, and discriminative learning in a joint, end-to-end framework. The framework is readily integrated with other sophisticated learning concepts including latent points oversampling and ensemble learning. More importantly, DELTA offers the possibility to conduct imbalanced learning with the assistancy of structured feature extractor. We verify the effectiveness of DELTA not only on several benchmark data sets but also on more challenging real-world tasks including click-through-rate (CTR) prediction, multi-class cell type classification, and sentiment analysis with sequential inputs.
Collapse
|
354
|
Joint Learning of Temporal Models to Handle Imbalanced Data for Human Activity Recognition. APPLIED SCIENCES-BASEL 2020. [DOI: 10.3390/app10155293] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Human activity recognition has become essential to a wide range of applications, such as smart home monitoring, health-care, surveillance. However, it is challenging to deliver a sufficiently robust human activity recognition system from raw sensor data with noise in a smart environment setting. Moreover, imbalanced human activity datasets with less frequent activities create extra challenges for accurate activity recognition. Deep learning algorithms have achieved promising results on balanced datasets, but their performance on imbalanced datasets without explicit algorithm design cannot be promised. Therefore, we aim to realise an activity recognition system using multi-modal sensors to address the issue of class imbalance in deep learning and improve recognition accuracy. This paper proposes a joint diverse temporal learning framework using Long Short Term Memory and one-dimensional Convolutional Neural Network models to improve human activity recognition, especially for less represented activities. We extensively evaluate the proposed method for Activities of Daily Living recognition using binary sensors dataset. A comparative study on five smart home datasets demonstrate that our proposed approach outperforms the existing individual temporal models and their hybridization. Furthermore, this is particularly the case for minority classes in addition to reasonable improvement on the majority classes of human activities.
Collapse
|
355
|
Abstract
Class overlap and class imbalance are two data complexities that challenge the design of effective classifiers in Pattern Recognition and Data Mining as they may cause a significant loss in performance. Several solutions have been proposed to face both data difficulties, but most of these approaches tackle each problem separately. In this paper, we propose a two-stage under-sampling technique that combines the DBSCAN clustering algorithm to remove noisy samples and clean the decision boundary with a minimum spanning tree algorithm to face the class imbalance, thus handling class overlap and imbalance simultaneously with the aim of improving the performance of classifiers. An extensive experimental study shows a significantly better behavior of the new algorithm as compared to 12 state-of-the-art under-sampling methods using three standard classification models (nearest neighbor rule, J48 decision tree, and support vector machine with a linear kernel) on both real-life and synthetic databases.
Collapse
|
356
|
Vuttipittayamongkol P, Elyan E. Improved Overlap-based Undersampling for Imbalanced Dataset Classification with Application to Epilepsy and Parkinson’s Disease. Int J Neural Syst 2020; 30:2050043. [DOI: 10.1142/s0129065720500434] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Classification of imbalanced datasets has attracted substantial research interest over the past decades. Imbalanced datasets are common in several domains such as health, finance, security and others. A wide range of solutions to handle imbalanced datasets focus mainly on the class distribution problem and aim at providing more balanced datasets by means of resampling. However, existing literature shows that class overlap has a higher negative impact on the learning process than class distribution. In this paper, we propose overlap-based undersampling methods for maximizing the visibility of the minority class instances in the overlapping region. This is achieved by the use of soft clustering and the elimination threshold that is adaptable to the overlap degree to identify and eliminate negative instances in the overlapping region. For more accurate clustering and detection of overlapped negative instances, the presence of the minority class at the borderline areas is emphasized by means of oversampling. Extensive experiments using simulated and real-world datasets covering a wide range of imbalance and overlap scenarios including extreme cases were carried out. Results show significant improvement in sensitivity and competitive performance with well-established and state-of-the-art methods.
Collapse
Affiliation(s)
| | - Eyad Elyan
- School of Computing Science and Digital Media, Robert Gordon University, Aberdeen, AB10 7GJ, UK
| |
Collapse
|
357
|
Kirtania R, Mitra S, Shankar BU. A novel adaptive k-NN classifier for handling imbalance: Application to brain MRI. INTELL DATA ANAL 2020. [DOI: 10.3233/ida-194647] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
358
|
Abstract
AbstractTerrorist attacks have been becoming one of the severe threats to national public security and world peace. Ascertaining whether the behaviors of terrorist attacks will threaten the lives of innocent people is vital in dealing with terrorist attacks, which has a profound impact on the resource optimization configuration. For this purpose, we propose an XGBoost-based casualty prediction algorithm, namely RP-GA-XGBoost, to predict whether terrorist attacks will cause the casualties of innocent civilians. In the proposed RP-GA-XGBoost algorithm, a novel method that incorporates random forest (RF) and principal component analysis (PCA) is devised for selecting features, and a genetic algorithm is used to tune the hyperparameters of XGBoost. The proposed method is evaluated on the public dataset (Global Terrorism Database, GTD) and the terrorist attack dataset in China. Experimental results demonstrate that the proposed algorithm achieves area under curve (AUC) of 87.00%, and accuracy of 86.33% for the public dataset, and sensitivity of 94.00%, AUC of 94.90% for the terrorist attack dataset in China, which proves the superiority and higher generalization ability of the proposed algorithm. Our study, to the best of our knowledge, is the first to apply machine learning in the management of terrorist attacks, which can provide early warning and decision support information for terrorist attack management.
Collapse
|
359
|
Gado JE, Beckham GT, Payne CM. Improving Enzyme Optimum Temperature Prediction with Resampling Strategies and Ensemble Learning. J Chem Inf Model 2020; 60:4098-4107. [DOI: 10.1021/acs.jcim.0c00489] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Affiliation(s)
- Japheth E. Gado
- Department of Chemical and Materials Engineering, University of Kentucky, Lexington, Kentucky 40506, United States
- National Bioenergy Center, National Renewable Energy Laboratory, Golden, Colorado 80401, United States
| | - Gregg T. Beckham
- National Bioenergy Center, National Renewable Energy Laboratory, Golden, Colorado 80401, United States
| | - Christina M. Payne
- Department of Chemical and Materials Engineering, University of Kentucky, Lexington, Kentucky 40506, United States
| |
Collapse
|
360
|
Song Y, Wang Y, Ye X, Wang D, Yin Y, Wang Y. Multi-view ensemble learning based on distance-to-model and adaptive clustering for imbalanced credit risk assessment in P2P lending. Inf Sci (N Y) 2020. [DOI: 10.1016/j.ins.2020.03.027] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
361
|
Ilhan HO, Serbes G, Aydin N. Automated sperm morphology analysis approach using a directional masking technique. Comput Biol Med 2020; 122:103845. [DOI: 10.1016/j.compbiomed.2020.103845] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2020] [Revised: 06/03/2020] [Accepted: 06/03/2020] [Indexed: 11/16/2022]
|
362
|
Ye X, Li H, Imakura A, Sakurai T. An oversampling framework for imbalanced classification based on Laplacian eigenmaps. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2020.02.081] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
363
|
Efficient matrixized classification learning with separated solution process. Neural Comput Appl 2020. [DOI: 10.1007/s00521-019-04595-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
364
|
Cao Z, Du W, Li G, Cao H. DEEPSMP: A deep learning model for predicting the ectodomain shedding events of membrane proteins. J Bioinform Comput Biol 2020; 18:2050017. [PMID: 32576054 DOI: 10.1142/s0219720020500171] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Membrane proteins play essential roles in modern medicine. In recent studies, some membrane proteins involved in ectodomain shedding events have been reported as the potential drug targets and biomarkers of some serious diseases. However, there are few effective tools for identifying the shedding event of membrane proteins. So, it is necessary to design an effective tool for predicting shedding event of membrane proteins. In this study, we design an end-to-end prediction model using deep neural networks with long short-term memory (LSTM) units and attention mechanism, to predict the ectodomain shedding events of membrane proteins only by sequence information. Firstly, the evolutional profiles are encoded from original sequences of these proteins by Position-Specific Iterated BLAST (PSI-BLAST) on Uniref50 database. Then, the LSTM units which contain memory cells are used to hold information from past inputs to the network and the attention mechanism is applied to detect sorting signals in proteins regardless of their position in the sequence. Finally, a fully connected dense layer and a softmax layer are used to obtain the final prediction results. Additionally, we also try to reduce overfitting of the model by using dropout, L2 regularization, and bagging ensemble learning in the model training process. In order to ensure the fairness of performance comparison, firstly we use cross validation process on training dataset obtained from an existing paper. The average accuracy and area under a receiver operating characteristic curve (AUC) of five-fold cross-validation are 81.19% and 0.835 using our proposed model, compared to 75% and 0.78 by a previously published tool, respectively. To better validate the performance of the proposed model, we also evaluate the performance of the proposed model on independent test dataset. The accuracy, sensitivity, and specificity are 83.14%, 84.08%, and 81.63% using our proposed model, compared to 70.20%, 71.97%, and 67.35% by the existing model. The experimental results validate that the proposed model can be regarded as a general tool for predicting ectodomain shedding events of membrane proteins. The pipeline of the model and prediction results can be accessed at the following URL: http://www.csbg-jlu.info/DeepSMP/.
Collapse
Affiliation(s)
- Zhongbo Cao
- Key Laboratory of Symbolic Computation and Knowledge, Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, P. R. China.,School of Management Science and Information Engineering, Jilin University of Finance and Economics, Changchun 130117, P. R. China
| | - Wei Du
- Key Laboratory of Symbolic Computation and Knowledge, Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, P. R. China
| | - Gaoyang Li
- Key Laboratory of Symbolic Computation and Knowledge, Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, P. R. China
| | - Huansheng Cao
- Center for Fundamental and Applied Microbiomics, Biodesign Institute, Arizona State University, Tempe, AZ 85287, USA
| |
Collapse
|
365
|
Pei W, Xue B, Shang L, Zhang M. Genetic programming for high-dimensional imbalanced classification with a new fitness function and program reuse mechanism. Soft comput 2020. [DOI: 10.1007/s00500-020-05056-7] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
366
|
Du W, Sun Y, Li G, Cao H, Pang R, Li Y. CapsNet-SSP: multilane capsule network for predicting human saliva-secretory proteins. BMC Bioinformatics 2020; 21:237. [PMID: 32517646 PMCID: PMC7285745 DOI: 10.1186/s12859-020-03579-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2020] [Accepted: 06/01/2020] [Indexed: 01/24/2023] Open
Abstract
Background Compared with disease biomarkers in blood and urine, biomarkers in saliva have distinct advantages in clinical tests, as they can be conveniently examined through noninvasive sample collection. Therefore, identifying human saliva-secretory proteins and further detecting protein biomarkers in saliva have significant value in clinical medicine. There are only a few methods for predicting saliva-secretory proteins based on conventional machine learning algorithms, and all are highly dependent on annotated protein features. Unlike conventional machine learning algorithms, deep learning algorithms can automatically learn feature representations from input data and thus hold promise for predicting saliva-secretory proteins. Results We present a novel end-to-end deep learning model based on multilane capsule network (CapsNet) with differently sized convolution kernels to identify saliva-secretory proteins only from sequence information. The proposed model CapsNet-SSP outperforms existing methods based on conventional machine learning algorithms. Furthermore, the model performs better than other state-of-the-art deep learning architectures mostly used to analyze biological sequences. In addition, we further validate the effectiveness of CapsNet-SSP by comparison with human saliva-secretory proteins from existing studies and known salivary protein biomarkers of cancer. Conclusions The main contributions of this study are as follows: (1) an end-to-end model based on CapsNet is proposed to identify saliva-secretory proteins from the sequence information; (2) the proposed model achieves better performance and outperforms existing models; and (3) the saliva-secretory proteins predicted by our model are statistically significant compared with existing cancer biomarkers in saliva. In addition, a web server of CapsNet-SSP is developed for saliva-secretory protein identification, and it can be accessed at the following URL: http://www.csbg-jlu.info/CapsNet-SSP/. We believe that our model and web server will be useful for biomedical researchers who are interested in finding salivary protein biomarkers, especially when they have identified candidate proteins for analyzing diseased tissues near or distal to salivary glands using transcriptome or proteomics.
Collapse
Affiliation(s)
- Wei Du
- Key Laboratory of Symbol Computation and Knowledge Engineering of the Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, 130012, China
| | - Yu Sun
- Key Laboratory of Symbol Computation and Knowledge Engineering of the Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, 130012, China
| | - Gaoyang Li
- Key Laboratory of Symbol Computation and Knowledge Engineering of the Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, 130012, China
| | - Huansheng Cao
- Center for Fundamental and Applied Microbiomics, Biodesign Institute, Arizona State University, Tempe, AZ, 85287, USA
| | - Ran Pang
- Key Laboratory of Symbol Computation and Knowledge Engineering of the Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, 130012, China
| | - Ying Li
- Key Laboratory of Symbol Computation and Knowledge Engineering of the Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, 130012, China.
| |
Collapse
|
367
|
Fujiwara K, Huang Y, Hori K, Nishioji K, Kobayashi M, Kamaguchi M, Kano M. Over- and Under-sampling Approach for Extremely Imbalanced and Small Minority Data Problem in Health Record Analysis. Front Public Health 2020; 8:178. [PMID: 32509717 PMCID: PMC7248318 DOI: 10.3389/fpubh.2020.00178] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2020] [Accepted: 04/22/2020] [Indexed: 11/23/2022] Open
Abstract
A considerable amount of health record (HR) data has been stored due to recent advances in the digitalization of medical systems. However, it is not always easy to analyze HR data, particularly when the number of persons with a target disease is too small in comparison with the population. This situation is called the imbalanced data problem. Over-sampling and under-sampling are two approaches for redressing an imbalance between minority and majority examples, which can be combined into ensemble algorithms. However, these approaches do not function when the absolute number of minority examples is small, which is called the extremely imbalanced and small minority (EISM) data problem. The present work proposes a new algorithm called boosting combined with heuristic under-sampling and distribution-based sampling (HUSDOS-Boost) to solve the EISM data problem. To make an artificially balanced dataset from the original imbalanced datasets, HUSDOS-Boost uses both under-sampling and over-sampling to eliminate redundant majority examples based on prior boosting results and to generate artificial minority examples by following the minority class distribution. The performance and characteristics of HUSDOS-Boost were evaluated through application to eight imbalanced datasets. In addition, the algorithm was applied to original clinical HR data to detect patients with stomach cancer. These results showed that HUSDOS-Boost outperformed current imbalanced data handling methods, particularly when the data are EISM. Thus, the proposed HUSDOS-Boost is a useful methodology of HR data analysis.
Collapse
Affiliation(s)
- Koichi Fujiwara
- Department of Material Process Engineering, Nagoya University, Nagoya, Japan
| | - Yukun Huang
- Department of Systems Science, Kyoto University, Kyoto, Japan
| | - Kentaro Hori
- Department of Systems Science, Kyoto University, Kyoto, Japan
| | - Kenichi Nishioji
- Health Care Division, Japanese Red Cross Kyoto Daini Hospital, Kyoto, Japan
| | - Masao Kobayashi
- Health Care Division, Japanese Red Cross Kyoto Daini Hospital, Kyoto, Japan
| | - Mai Kamaguchi
- Health Care Division, Japanese Red Cross Kyoto Daini Hospital, Kyoto, Japan
| | - Manabu Kano
- Department of Systems Science, Kyoto University, Kyoto, Japan
| |
Collapse
|
368
|
Lin E, Lin CH, Hung CC, Lane HY. An Ensemble Approach to Predict Schizophrenia Using Protein Data in the N-methyl-D-Aspartate Receptor (NMDAR) and Tryptophan Catabolic Pathways. Front Bioeng Biotechnol 2020; 8:569. [PMID: 32582679 PMCID: PMC7287032 DOI: 10.3389/fbioe.2020.00569] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2019] [Accepted: 05/11/2020] [Indexed: 12/22/2022] Open
Abstract
In the wake of recent advances in artificial intelligence research, precision psychiatry using machine learning techniques represents a new paradigm. The D-amino acid oxidase (DAO) protein and its interaction partner, the D-amino acid oxidase activator (DAOA, also known as G72) protein, have been implicated as two key proteins in the N-methyl-D-aspartate receptor (NMDAR) pathway for schizophrenia. Another potential biomarker in regard to the etiology of schizophrenia is melatonin in the tryptophan catabolic pathway. To develop an ensemble boosting framework with random undersampling for determining disease status of schizophrenia, we established a prediction approach resulting from the analysis of genomic and demographic variables such as DAO levels, G72 levels, melatonin levels, age, and gender of 355 schizophrenia patients and 86 unrelated healthy individuals in the Taiwanese population. We compared our ensemble boosting framework with other state-of-the-art algorithms such as support vector machine, multilayer feedforward neural networks, logistic regression, random forests, naive Bayes, and C4.5 decision tree. The analysis revealed that the ensemble boosting model with random undersampling [area under the receiver operating characteristic curve (AUC) = 0.9242 ± 0.0652; sensitivity = 0.8580 ± 0.0770; specificity = 0.8594 ± 0.0760] performed maximally among predictive models to infer the complicated relationship between schizophrenia disease status and biomarkers. In addition, we identified a causal link between DAO and G72 protein levels in influencing schizophrenia disease status. The study indicates that the ensemble boosting framework with random undersampling may provide a suitable method to establish a tool for distinguishing schizophrenia patients from healthy controls using molecules in the NMDAR and tryptophan catabolic pathways.
Collapse
Affiliation(s)
- Eugene Lin
- Department of Biostatistics, University of Washington, Seattle, WA, United States
- Department of Electrical & Computer Engineering, University of Washington, Seattle, WA, United States
- Graduate Institute of Biomedical Sciences, China Medical University, Taichung, Taiwan
| | - Chieh-Hsin Lin
- Graduate Institute of Biomedical Sciences, China Medical University, Taichung, Taiwan
- Department of Psychiatry, Kaohsiung Chang Gung Memorial Hospital, Chang Gung University College of Medicine, Kaohsiung, Taiwan
- School of Medicine, Chang Gung University, Taoyuan, Taiwan
| | - Chung-Chieh Hung
- Department of Psychiatry, China Medical University Hospital, Taichung, Taiwan
| | - Hsien-Yuan Lane
- Graduate Institute of Biomedical Sciences, China Medical University, Taichung, Taiwan
- Department of Psychiatry, China Medical University Hospital, Taichung, Taiwan
- Brain Disease Research Center, China Medical University Hospital, Taichung, Taiwan
- Department of Psychology, College of Medical and Health Sciences, Asia University, Taichung, Taiwan
| |
Collapse
|
369
|
Nishijima M, Nieuwenhoff N, Pires R, Oliveira PR. Movie films consumption in Brazil: an analysis of support vector machine classification. AI & SOCIETY 2020. [DOI: 10.1007/s00146-019-00899-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
370
|
Yan J, Zhang Z, Lin K, Yang F, Luo X. A hybrid scheme-based one-vs-all decision trees for multi-class classification tasks. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2020.105922] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
371
|
|
372
|
Solihah B, Azhari A, Musdholifah A. Enhancement of conformational B-cell epitope prediction using CluSMOTE. PeerJ Comput Sci 2020; 6:e275. [PMID: 33816926 PMCID: PMC7924438 DOI: 10.7717/peerj-cs.275] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2019] [Accepted: 04/15/2020] [Indexed: 06/12/2023]
Abstract
BACKGROUND A conformational B-cell epitope is one of the main components of vaccine design. It contains separate segments in its sequence, which are spatially close in the antigen chain. The availability of Ag-Ab complex data on the Protein Data Bank allows for the development predictive methods. Several epitope prediction models also have been developed, including learning-based methods. However, the performance of the model is still not optimum. The main problem in learning-based prediction models is class imbalance. METHODS This study proposes CluSMOTE, which is a combination of a cluster-based undersampling method and Synthetic Minority Oversampling Technique. The approach is used to generate other sample data to ensure that the dataset of the conformational epitope is balanced. The Hierarchical DBSCAN algorithm is performed to identify the cluster in the majority class. Some of the randomly selected data is taken from each cluster, considering the oversampling degree, and combined with the minority class data. The balance data is utilized as the training dataset to develop a conformational epitope prediction. Furthermore, two binary classification methods, Support Vector Machine and Decision Tree, are separately used to develop model prediction and to evaluate the performance of CluSMOTE in predicting conformational B-cell epitope. The experiment is focused on determining the best parameter for optimal CluSMOTE. Two independent datasets are used to compare the proposed prediction model with state of the art methods. The first and the second datasets represent the general protein and the glycoprotein antigens respectively. RESULT The experimental result shows that CluSMOTE Decision Tree outperformed the Support Vector Machine in terms of AUC and Gmean as performance measurements. The mean AUC of CluSMOTE Decision Tree in the Kringelum and the SEPPA 3 test sets are 0.83 and 0.766, respectively. This shows that CluSMOTE Decision Tree is better than other methods in the general protein antigen, though comparable with SEPPA 3 in the glycoprotein antigen.
Collapse
Affiliation(s)
- Binti Solihah
- Department of Computer Science and Electronics, Faculty of Mathematics and Natural Sciences, Universitas Gadjah Mada, Yogyakarta, Indonesia
- Department of Informatics Engineering, Universitas Trisakti, Grogol, Jakarta Barat, Indonesia
| | - Azhari Azhari
- Department of Computer Science and Electronics, Faculty of Mathematics and Natural Sciences, Universitas Gadjah Mada, Yogyakarta, Indonesia
| | - Aina Musdholifah
- Department of Computer Science and Electronics, Faculty of Mathematics and Natural Sciences, Universitas Gadjah Mada, Yogyakarta, Indonesia
| |
Collapse
|
373
|
El-Alfy ESM, Al-Azani S. Empirical study on imbalanced learning of Arabic sentiment polarity with neural word embedding. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2020. [DOI: 10.3233/jifs-179703] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- El-Sayed M. El-Alfy
- Information and Computer Science Department, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia
| | - Sadam Al-Azani
- Information and Computer Science Department, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia
| |
Collapse
|
374
|
Abstract
In the field of machine learning, an ensemble approach is often utilized as an effective means of improving on the accuracy of multiple weak base classifiers. A concern associated with these ensemble algorithms is that they can suffer from the Curse of Conflict, where a classifier’s true prediction is negated by another classifier’s false prediction during the consensus period. Another concern of the ensemble technique is that it cannot effectively mitigate the problem of Imbalanced Classification, where an ensemble classifier usually presents a similar magnitude of bias to the same class as its imbalanced base classifiers. We proposed an improved ensemble algorithm called “Sieve” that overcomes the aforementioned shortcomings through the establishment of the novel concept of Global Consensus. The proposed Sieve ensemble approach was benchmarked against various ensemble classifiers, and was trained using different ensemble algorithms with the same base classifiers. The results demonstrate that better accuracy and stability was achieved.
Collapse
|
375
|
Siddiqui MK, Morales-Menendez R, Huang X, Hussain N. A review of epileptic seizure detection using machine learning classifiers. Brain Inform 2020; 7:5. [PMID: 32451639 PMCID: PMC7248143 DOI: 10.1186/s40708-020-00105-1] [Citation(s) in RCA: 100] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2019] [Accepted: 05/09/2020] [Indexed: 01/13/2023] Open
Abstract
Epilepsy is a serious chronic neurological disorder, can be detected by analyzing the brain signals produced by brain neurons. Neurons are connected to each other in a complex way to communicate with human organs and generate signals. The monitoring of these brain signals is commonly done using Electroencephalogram (EEG) and Electrocorticography (ECoG) media. These signals are complex, noisy, non-linear, non-stationary and produce a high volume of data. Hence, the detection of seizures and discovery of the brain-related knowledge is a challenging task. Machine learning classifiers are able to classify EEG data and detect seizures along with revealing relevant sensible patterns without compromising performance. As such, various researchers have developed number of approaches to seizure detection using machine learning classifiers and statistical features. The main challenges are selecting appropriate classifiers and features. The aim of this paper is to present an overview of the wide varieties of these techniques over the last few years based on the taxonomy of statistical features and machine learning classifiers-'black-box' and 'non-black-box'. The presented state-of-the-art methods and ideas will give a detailed understanding about seizure detection and classification, and research directions in the future.
Collapse
Affiliation(s)
- Mohammad Khubeb Siddiqui
- School of Engineering and Sciences, Tecnologico de Monterrey, Av. E. Garza Sada 2501, Monterrey, Nuevo Leon Mexico
| | - Ruben Morales-Menendez
- School of Engineering and Sciences, Tecnologico de Monterrey, Av. E. Garza Sada 2501, Monterrey, Nuevo Leon Mexico
| | - Xiaodi Huang
- School of Computing and Mathematics, Charles Sturt University, 2640 Albury, NSW Australia
| | - Nasir Hussain
- College of Applied Studies and Community Service, King Saud University, Riyadh, Kingdom of Saudi Arabia
| |
Collapse
|
376
|
A design of information granule-based under-sampling method in imbalanced data classification. Soft comput 2020. [DOI: 10.1007/s00500-020-05023-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
377
|
BPBSAM: Body part-specific burn severity assessment model. Burns 2020; 46:1407-1423. [PMID: 32376068 DOI: 10.1016/j.burns.2020.03.007] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2019] [Revised: 02/23/2020] [Accepted: 03/20/2020] [Indexed: 11/23/2022]
Abstract
BACKGROUND AND OBJECTIVE Burns are a serious health problem leading to several thousand deaths annually, and despite the growth of science and technology, automated burns diagnosis still remains a major challenge. Researchers have been exploring visual images-based automated approaches for burn diagnosis. Noting that the impact of a burn on a particular body part can be related to the skin thickness factor, we propose a deep convolutional neural network based body part-specific burns severity assessment model (BPBSAM). METHOD Considering skin anatomy, BPBSAM estimates burn severity using body part-specific support vector machines trained with CNN features extracted from burnt body part images. Thus BPBSAM first identifies the body part of the burn images using a convolutional neural network in training of which the challenge of limited availability of burnt body part images is successfully addressed by using available larger-size datasets of non-burn images of different body parts considered (face, hand, back, and inner forearm). We prepared a rich labelled burn images datasets: BI & UBI and trained several deep learning models with existing models as pipeline for body part classification and feature extraction for severity estimation. RESULTS The proposed novel BPBSAM method classified the severity of burn from color images of burn injury with an overall average F1 score of 77.8% and accuracy of 84.85% for the test BI dataset and 87.2% and 91.53% for the UBI dataset, respectively. For burn images body part classification, the average accuracy of around 93% is achieved, and for burn severity assessment, the proposed BPBSAM outperformed the generic method in terms of overall average accuracy by 10.61%, 4.55%, and 3.03% with pipelines ResNet50, VGG16, and VGG19, respectively. CONCLUSIONS The main contributions of this work along with burn images labelled datasets creation is that the proposed customized body part-specific burn severity assessment model can significantly improve the performance in spite of having small burn images dataset. This highly innovative customized body part-specific approach could also be used to deal with the burn region segmentation problem. Moreover, fine tuning on pre-trained non-burn body part images network has proven to be robust and reliable.
Collapse
|
378
|
Lázaro M, Herrera F, Figueiras-Vidal AR. Ensembles of cost-diverse Bayesian neural learners for imbalanced binary classification. Inf Sci (N Y) 2020. [DOI: 10.1016/j.ins.2019.12.050] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
379
|
Chen Z, Duan J, Yang C, Kang L, Qiu G. SMLBoost-adopting a soft-margin like strategy in boosting. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2020.105705] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
380
|
|
381
|
Gautheron L, Habrard A, Morvant E, Sebban M. Metric Learning from Imbalanced Data with Generalization Guarantees. Pattern Recognit Lett 2020. [DOI: 10.1016/j.patrec.2020.03.008] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
382
|
Sun Z, Zhang J, Sun H, Zhu X. Collaborative filtering based recommendation of sampling methods for software defect prediction. Appl Soft Comput 2020. [DOI: 10.1016/j.asoc.2020.106163] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
383
|
Feng S, Zhao C, Fu P. A cluster-based hybrid sampling approach for imbalanced data classification. THE REVIEW OF SCIENTIFIC INSTRUMENTS 2020; 91:055101. [PMID: 32486749 DOI: 10.1063/5.0008935] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/27/2020] [Accepted: 04/15/2020] [Indexed: 06/11/2023]
Abstract
When processing instrumental data by using classification approaches, the imbalanced dataset problem is usually challenging. As the minority class instances could be overwhelmed by the majority class instances, training a typical classifier with such a dataset directly might get poor results in classifying the minority class. We propose a cluster-based hybrid sampling approach CUSS (Cluster-based Under-sampling and SMOTE) for imbalanced dataset classification, which belongs to the type of data-level methods and is different from previously proposed hybrid methods. A new cluster-based under-sampling method is designed for CUSS, and a new strategy to set the expected instance number according to data distribution in the original training dataset is also proposed in this paper. The proposed method is compared with five other popular resampling methods on 15 datasets with different instance numbers and different imbalance ratios. The experimental results show that the CUSS method has good performance and outperforms other state-of-the-art methods.
Collapse
Affiliation(s)
- Shou Feng
- College of Information and Communication Engineering, Harbin Engineering University, Harbin 150001, China
| | - Chunhui Zhao
- College of Information and Communication Engineering, Harbin Engineering University, Harbin 150001, China
| | - Ping Fu
- School of Electronics and Information Engineering, Harbin Institute of Technology, Harbin 150001, China
| |
Collapse
|
384
|
Sangphukieo A, Laomettachit T, Ruengjitchatchawalya M. Photosynthetic protein classification using genome neighborhood-based machine learning feature. Sci Rep 2020; 10:7108. [PMID: 32346070 PMCID: PMC7189237 DOI: 10.1038/s41598-020-64053-w] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2019] [Accepted: 04/07/2020] [Indexed: 11/08/2022] Open
Abstract
Identification of novel photosynthetic proteins is important for understanding and improving photosynthetic efficiency. Synergistically, genome neighborhood can provide additional useful information to identify photosynthetic proteins. We, therefore, expected that applying a computational approach, particularly machine learning (ML) with the genome neighborhood-based feature should facilitate the photosynthetic function assignment. Our results revealed a functional relationship between photosynthetic genes and their conserved neighboring genes observed by 'Phylo score', indicating their functions could be inferred from the genome neighborhood profile. Therefore, we created a new method for extracting patterns based on the genome neighborhood network (GNN) and applied them for the photosynthetic protein classification using ML algorithms. Random forest (RF) classifier using genome neighborhood-based features achieved the highest accuracy up to 87% in the classification of photosynthetic proteins and also showed better performance (Mathew's correlation coefficient = 0.718) than other available tools including the sequence similarity search (0.447) and ML-based method (0.361). Furthermore, we demonstrated the ability of our model to identify novel photosynthetic proteins compared to the other methods. Our classifier is available at http://bicep2.kmutt.ac.th/photomod_standalone, https://bit.ly/2S0I2Ox and DockerHub: https://hub.docker.com/r/asangphukieo/photomod.
Collapse
Affiliation(s)
- Apiwat Sangphukieo
- Bioinformatics and Systems Biology Program, School of Bioresources and Technology, King Mongkut's University of Technology Thonburi (KMUTT), Bang Khun Thian, Bangkok, 10150, Thailand
- School of Information Technology, KMUTT, Bang Mod, Thung Khru, Bangkok, 10140, Thailand
| | - Teeraphan Laomettachit
- Bioinformatics and Systems Biology Program, School of Bioresources and Technology, King Mongkut's University of Technology Thonburi (KMUTT), Bang Khun Thian, Bangkok, 10150, Thailand
| | - Marasri Ruengjitchatchawalya
- Bioinformatics and Systems Biology Program, School of Bioresources and Technology, King Mongkut's University of Technology Thonburi (KMUTT), Bang Khun Thian, Bangkok, 10150, Thailand.
- Biotechnology program, School of Bioresources and Technology, KMUTT, Bang Khun Thian, Bangkok, 10150, Thailand.
- Algal Biotechnology Research Group, Pilot Plant Development and Training Institute (PDTI), KMUTT, Bang Khun Thian, Bangkok, 10150, Thailand.
| |
Collapse
|
385
|
Wu Y, Zhang W, Zhang L, Qiao Y, Yang J, Cheng C. A Multi-Clustering Algorithm to Solve Driving Cycle Prediction Problems Based on Unbalanced Data Sets: A Chinese Case Study. SENSORS 2020; 20:s20092448. [PMID: 32344855 PMCID: PMC7248886 DOI: 10.3390/s20092448] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/07/2020] [Revised: 04/18/2020] [Accepted: 04/21/2020] [Indexed: 11/16/2022]
Abstract
Vehicle evaluation parameters, which are increasingly of concern for governments and consumers, quantify performance indicators, such as vehicle performance, emissions, and driving experience to help guide consumers in purchasing cars. While past approaches for driving cycle prediction have been proven effective and used in many countries, these algorithms are difficult to use in China with its complex traffic environment and increasingly high frequency of traffic jams. Meanwhile, we found that the vehicle dataset used by the driving cycle prediction problem is usually unbalanced in real cases, which means that there are more medium and high speed samples and very few samples at low and ultra-high speeds. If the ordinary clustering algorithm is directly applied to the unbalanced data, it will have a huge impact on the performance to build driving cycle maps, and the parameters of the map will deviate considerable from actual ones. In order to address these issues, this paper propose a novel driving cycle map algorithm framework based on an ensemble learning method named multi-clustering algorithm, to improve the performance of traditional clustering algorithms on unbalanced data sets. It is noteworthy that our model framework can be easily extended to other complicated structure areas due to its flexible modular design and parameter configuration. Finally, we tested our method based on actual traffic data generated in Fujian Province in China. The results prove the multi-clustering algorithm has excellent performance on our dataset.
Collapse
Affiliation(s)
- Yuewei Wu
- Correspondence: ; Tel.: +86-135-2020-2168
| | | | | | | | | | | |
Collapse
|
386
|
Li H, Sze K, Lu G, Ballester PJ. Machine‐learning scoring functions for structure‐based virtual screening. WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL MOLECULAR SCIENCE 2020. [DOI: 10.1002/wcms.1478] [Citation(s) in RCA: 37] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Affiliation(s)
- Hongjian Li
- Cancer Research Center of Marseille (INSERM U1068, Institut Paoli‐Calmettes, Aix‐Marseille Université UM105, CNRS UMR7258) Marseille France
- CUHK‐SDU Joint Laboratory on Reproductive Genetics, School of Biomedical Sciences Chinese University of Hong Kong Shatin Hong Kong
| | - Kam‐Heung Sze
- CUHK‐SDU Joint Laboratory on Reproductive Genetics, School of Biomedical Sciences Chinese University of Hong Kong Shatin Hong Kong
| | - Gang Lu
- CUHK‐SDU Joint Laboratory on Reproductive Genetics, School of Biomedical Sciences Chinese University of Hong Kong Shatin Hong Kong
| | - Pedro J. Ballester
- Cancer Research Center of Marseille (INSERM U1068, Institut Paoli‐Calmettes, Aix‐Marseille Université UM105, CNRS UMR7258) Marseille France
| |
Collapse
|
387
|
Avanzo M, Pirrone G, Vinante L, Caroli A, Stancanello J, Drigo A, Massarut S, Mileto M, Urbani M, Trovo M, El Naqa I, De Paoli A, Sartor G. Electron Density and Biologically Effective Dose (BED) Radiomics-Based Machine Learning Models to Predict Late Radiation-Induced Subcutaneous Fibrosis. Front Oncol 2020; 10:490. [PMID: 32373520 PMCID: PMC7186445 DOI: 10.3389/fonc.2020.00490] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2020] [Accepted: 03/18/2020] [Indexed: 12/24/2022] Open
Abstract
Purpose: to predict the occurrence of late subcutaneous radiation induced fibrosis (RIF) after partial breast irradiation (PBI) for breast carcinoma by using machine learning (ML) models and radiomic features from 3D Biologically Effective Dose (3D-BED) and Relative Electron Density (3D-RED). Methods: 165 patients underwent external PBI following a hypo-fractionation protocol consisting of 40 Gy/10 fractions, 35 Gy/7 fractions, and 28 Gy/4 fractions, for 73, 60, and 32 patients, respectively. Physicians evaluated toxicity at regular intervals by the Common Terminology Adverse Events (CTAE) version 4.0. RIF was assessed every 3 months after the completion of radiation course and scored prospectively. RIF was experienced by 41 (24.8%) patients after average 5 years of follow up. The Hounsfield Units (HU) of the CT-images were converted into relative electron density (3D-RED) and Dose maps into Biologically Effective Dose (3D-BED), respectively. Shape, first-order and textural features of 3D-RED and 3D-BED were calculated in the planning target volume (PTV) and breast. Clinical and demographic variables were also considered (954 features in total). Imbalance of the dataset was addressed by data augmentation using ADASYN technique. A subset of non-redundant features that best predict the data was identified by sequential feature selection. Support Vector Machines (SVM), ensemble machine learning (EML) using various aggregation algorithms and Naive Bayes (NB) classifiers were trained on patient dataset to predict RIF occurrence. Models were assessed using sensitivity and specificity of the ML classifiers and the area under the receiver operator characteristic curve (AUC) of the score functions in repeated 5-fold cross validation on the augmented dataset. Results: The SVM model with seven features was preferred for RIF prediction and scored sensitivity 0.83 (95% CI 0.80-0.86), specificity 0.75 (95% CI 0.71-0.77) and AUC of the score function 0.86 (0.85-0.88) on cross-validation. The selected features included cluster shade and Run Length Non-uniformity of breast 3D-BED, kurtosis and cluster shade from PTV 3D-RED, and 10th percentile of PTV 3D-BED. Conclusion: Textures extracted from 3D-BED and 3D-RED in the breast and PTV can predict late RIF and may help better select patient candidates to exclusive PBI.
Collapse
Affiliation(s)
- Michele Avanzo
- Department of Medical Physics, Centro di Riferimento Oncologico di Aviano (CRO) IRCCS, Aviano, Italy
| | - Giovanni Pirrone
- Department of Medical Physics, Centro di Riferimento Oncologico di Aviano (CRO) IRCCS, Aviano, Italy
| | - Lorenzo Vinante
- Department of Radiation Oncology, Centro di Riferimento Oncologico di Aviano (CRO) IRCCS, Aviano, Italy
| | - Angela Caroli
- Department of Radiation Oncology, Centro di Riferimento Oncologico di Aviano (CRO) IRCCS, Aviano, Italy
| | | | - Annalisa Drigo
- Department of Medical Physics, Centro di Riferimento Oncologico di Aviano (CRO) IRCCS, Aviano, Italy
| | - Samuele Massarut
- Breast Surgery Unit, Centro di Riferimento Oncologico di Aviano (CRO) IRCCS, Aviano, Italy
| | - Mario Mileto
- Breast Surgery Unit, Centro di Riferimento Oncologico di Aviano (CRO) IRCCS, Aviano, Italy
| | - Martina Urbani
- Department of Radiology, Centro di Riferimento Oncologico di Aviano (CRO) IRCCS, Aviano, Italy
| | - Marco Trovo
- Department of Radiation Oncology, Udine General Hospital, Udine, Italy
| | - Issam El Naqa
- Department of Radiation Oncology, University of Michigan, Ann Arbor, MI, United States
| | - Antonino De Paoli
- Department of Radiation Oncology, Centro di Riferimento Oncologico di Aviano (CRO) IRCCS, Aviano, Italy
| | - Giovanna Sartor
- Department of Medical Physics, Centro di Riferimento Oncologico di Aviano (CRO) IRCCS, Aviano, Italy
| |
Collapse
|
388
|
Kim H, Na SH. Uniformly Interpolated Balancing for Robust Prediction in Translation Quality Estimation. ACM T ASIAN LOW-RESO 2020. [DOI: 10.1145/3365916] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
There has been growing interest among researchers in quality estimation (QE), which attempts to automatically predict the quality of machine translation (MT) outputs. Most existing works on QE are based on supervised approaches using quality-annotated training data. However, QE training data quality scores readily become
imbalanced
or
skewed
: QE data are mostly composed of high translation quality sentence pairs but the data lack low translation quality sentence pairs. The use of imbalanced data with an induced quality estimator tends to produce
biased
translation quality scores with “high” translation quality scores assigned even to poorly translated sentences. To address the data imbalance, this article proposes a simple, efficient procedure called
uniformly interpolated balancing
to construct more
balanced
QE training data by inserting greater uniformness to training data. The proposed uniformly interpolated balancing procedure is based on the preparation of two different types of manually annotated QE data: (1)
default skewed data
and (2)
near-uniform data
. First, we obtain default skewed data in a naive manner without considering the imbalance by manually annotating qualities on MT outputs. Second, we obtain near-uniform data in a selective manner by manually annotating a subset only, which is selected from the automatically quality-estimated sentence pairs. Finally, we create
uniformly interpolated balanced data
by combining these two types of data, where one half originates from the default skewed data and the other half originates from the near-uniform data. We expect that uniformly interpolated balancing reflects the intrinsic skewness of the true quality distribution and manages the imbalance problem. Experimental results on an English-Korean quality estimation task show that the proposed uniformly interpolated balancing leads to robustness on both skewed and uniformly distributed quality test sets when compared to the test sets of other non-balanced datasets.
Collapse
Affiliation(s)
- Hyun Kim
- Electronics and Telecommunications Research Institute (ETRI), Yuseong-gu, Daejeon, Republic of Korea
| | - Seung-Hoon Na
- Jeonbuk National University, Baekje-daero, deokjin-gu, Jeonju, Republic of Korea
| |
Collapse
|
389
|
Machine Learning in Python: Main Developments and Technology Trends in Data Science, Machine Learning, and Artificial Intelligence. INFORMATION 2020. [DOI: 10.3390/info11040193] [Citation(s) in RCA: 48] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
Smarter applications are making better use of the insights gleaned from data, having an impact on every industry and research discipline. At the core of this revolution lies the tools and the methods that are driving it, from processing the massive piles of data generated each day to learning from and taking useful action. Deep neural networks, along with advancements in classical machine learning and scalable general-purpose graphics processing unit (GPU) computing, have become critical components of artificial intelligence, enabling many of these astounding breakthroughs and lowering the barrier to adoption. Python continues to be the most preferred language for scientific computing, data science, and machine learning, boosting both performance and productivity by enabling the use of low-level libraries and clean high-level APIs. This survey offers insight into the field of machine learning with Python, taking a tour through important topics to identify some of the core hardware and software paradigms that have enabled it. We cover widely-used libraries and concepts, collected together for holistic comparison, with the goal of educating the reader and driving the field of Python machine learning forward.
Collapse
|
390
|
Yang K, Yu Z, Wen X, Cao W, Chen CLP, Wong HS, You J. Hybrid Classifier Ensemble for Imbalanced Data. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:1387-1400. [PMID: 31265410 DOI: 10.1109/tnnls.2019.2920246] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
The class imbalance problem has become a leading challenge. Although conventional imbalance learning methods are proposed to tackle this problem, they have some limitations: 1) undersampling methods suffer from losing important information and 2) cost-sensitive methods are sensitive to outliers and noise. To address these issues, we propose a hybrid optimal ensemble classifier framework that combines density-based undersampling and cost-effective methods through exploring state-of-the-art solutions using multi-objective optimization algorithm. Specifically, we first develop a density-based undersampling method to select informative samples from the original training data with probability-based data transformation, which enables to obtain multiple subsets following a balanced distribution across classes. Second, we exploit the cost-sensitive classification method to address the incompleteness of information problem via modifying weights of misclassified minority samples rather than the majority ones. Finally, we introduce a multi-objective optimization procedure and utilize connections between samples to self-modify the classification result using an ensemble classifier framework. Extensive comparative experiments conducted on real-world data sets demonstrate that our method outperforms the majority of imbalance and ensemble classification approaches.
Collapse
|
391
|
|
392
|
Zhu Z, Wang Z, Li D, Zhu Y, Du W. Geometric Structural Ensemble Learning for Imbalanced Problems. IEEE TRANSACTIONS ON CYBERNETICS 2020; 50:1617-1629. [PMID: 30418931 DOI: 10.1109/tcyb.2018.2877663] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
The classification on imbalanced data sets is a great challenge in machine learning. In this paper, a geometric structural ensemble (GSE) learning framework is proposed to address the issue. It is known that the traditional ensemble methods train and combine a series of basic classifiers according to various weights, which might lack the geometric meaning. Oppositely, the GSE partitions and eliminates redundant majority samples by generating hyper-sphere through the Euclidean metric and learns basic classifiers to enclose the minority samples, which achieves higher efficiency in the training process and seems easier to understand. In detail, the current weak classifier builds boundaries between the majority and the minority samples and removes the former. Then, the remaining samples are used to train the next. When the training process is done, all of the majority samples could be cleaned and the combination of all basic classifiers is obtained. To further improve the generalization, two relaxation techniques are proposed. Theoretically, the computational complexity of GSE could approach O(ndlog(nmin)log(n maj)) . The comprehensive experiments validate both the effectiveness and efficiency of GSE.
Collapse
|
393
|
Raval PD, Pandya A. A Novel Fault Classification Technique in Series Compensated Transmission Line using Ensemble Method. INT J PATTERN RECOGN 2020. [DOI: 10.1142/s0218001420500093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The paper presents a novel approach for classification of faults in Extra High Voltage (EHV) transmission line with series compensation. The proposed algorithm utilizes single end currents extracted from three phases of a transmission line given to a fault detection and classification system. A detailed model is constructed to analyze fault patterns occurring on a dual feed system with multiple series compensation provided on EHV transmission line. The algorithm uses a full cycle of post-fault currents. A Multiresolution Analysis Wavelet-based decomposition technique is used to provide a joint time frequency analysis. Extensive study is done by designing different fault classifiers using Support Vector Machine (SVM) with different feature vector groups. The SVMs with different parameters are compared to show performance of SVMs in given feature space of faults. A new algorithm is proposed with Ensemble method having group of different classifiers, namely, Artificial Neural Network (ANN), K-Nearest Neighborhood (KNN) and SVM. A combination of feature selection with Ensemble Classifiers is trained and tested on a wide range of fault patterns. A significant improvement in performance is obtained with different combination of features in Ensemble Classifier using subspace partition method.The proposed Ensemble Classifier provides an accuracy of 99.5%.
Collapse
|
394
|
Predicting Short-term Survival after Liver Transplantation using Machine Learning. Sci Rep 2020; 10:5654. [PMID: 32221367 PMCID: PMC7101323 DOI: 10.1038/s41598-020-62387-z] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2018] [Accepted: 03/10/2020] [Indexed: 02/07/2023] Open
Abstract
Liver transplantation is one of the most effective treatments for end-stage liver disease, but the demand for livers is much higher than the available donor livers. Model for End-stage Liver Disease (MELD) score is a commonly used approach to prioritize patients, but previous studies have indicated that MELD score may fail to predict well for the postoperative patients. This work proposes to use data-driven approach to devise a predictive model to predict postoperative survival within 30 days based on patient’s preoperative physiological measurement values. We use random forest (RF) to select important features, including clinically used features and new features discovered from physiological measurement values. Moreover, we propose a new imputation method to deal with the problem of missing values and the results show that it outperforms the other alternatives. In the predictive model, we use patients’ blood test data within 1–9 days before surgery to construct the model to predict postoperative patients’ survival. The experimental results on a real data set indicate that RF outperforms the other alternatives. The experimental results on the temporal validation set show that our proposed model achieves area under the curve (AUC) of 0.771 and specificity of 0.815, showing superior discrimination power in predicting postoperative survival.
Collapse
|
395
|
Felix EA, Lee SP. Predicting the number of defects in a new software version. PLoS One 2020; 15:e0229131. [PMID: 32187181 PMCID: PMC7080245 DOI: 10.1371/journal.pone.0229131] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2019] [Accepted: 01/30/2020] [Indexed: 11/25/2022] Open
Abstract
Predicting the number of defects in software at the method level is important. However, little or no research has focused on method-level defect prediction. Therefore, considerable efforts are still required to demonstrate how method-level defect prediction can be achieved for a new software version. In the current study, we present an analysis of the relevant information obtained from the current version of a software product to construct regression models to predict the estimated number of defects in a new version using the variables of defect density, defect velocity and defect introduction time, which show considerable correlation with the number of method-level defects. These variables also show a mathematical relationship between defect density and defect acceleration at the method level, further indicating that the increase in the number of defects and the defect density are functions of the defect acceleration. We report an experiment conducted on the Finding Faults Using Ensemble Learners (ELFF) open-source Java projects, which contain 289,132 methods. The results show correlation coefficients of 60% for the defect density, -4% for the defect introduction time, and 93% for the defect velocity. These findings indicate that the average defect velocity shows a firm and considerable correlation with the number of defects at the method level. The proposed approach also motivates an investigation and comparison of the average performances of classifiers before and after method-level data preprocessing and of the level of entropy in the datasets.
Collapse
Affiliation(s)
| | - Sai Peck Lee
- Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Malaysia
- * E-mail:
| |
Collapse
|
396
|
Schmid I, Rudolph KE, Nguyen TQ, Hong H, Seamans MJ, Ackerman B, Stuart EA. Comparing the performance of statistical methods that generalize effect estimates from randomized controlled trials to much larger target populations. COMMUN STAT-SIMUL C 2020; 51:4326-4348. [PMID: 36419543 PMCID: PMC9678349 DOI: 10.1080/03610918.2020.1741621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2019] [Revised: 09/20/2019] [Accepted: 03/06/2020] [Indexed: 01/03/2023]
Abstract
Policymakers use results from randomized controlled trials to inform decisions about whether to implement treatments in target populations. Various methods - including inverse probability weighting, outcome modeling, and Targeted Maximum Likelihood Estimation - that use baseline data available in both the trial and target population have been proposed to generalize the trial treatment effect estimate to the target population. Often the target population is significantly larger than the trial sample, which can cause estimation challenges. We conduct simulations to compare the performance of these methods in this setting. We vary the size of the target population, the proportion of the target population selected into the trial, and the complexity of the true selection and outcome models. All methods performed poorly when the trial size was only 2% of the target population size or the target population included only 1,000 units. When the target population or the proportion of units selected into the trial was larger, some methods, such as outcome modeling using Bayesian Additive Regression Trees, performed well. We caution against generalizing using these existing approaches when the target population is much larger than the trial sample and advocate future research strives to improve methods for generalizing to large target populations.
Collapse
Affiliation(s)
- Ian Schmid
- Department of Mental Health, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, U.S.A
| | - Kara E Rudolph
- Department of Emergency Medicine, University of California, Davis, Sacramento, California, U.S.A
| | - Trang Quynh Nguyen
- Department of Mental Health, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, U.S.A
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, U.S.A
| | - Hwanhee Hong
- Department of Mental Health, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, U.S.A
- Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, North Carolina, U.S.A
| | - Marissa J Seamans
- Department of Mental Health, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, U.S.A
| | - Benjamin Ackerman
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, U.S.A
| | - Elizabeth A Stuart
- Department of Mental Health, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, U.S.A
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, U.S.A
- Department of Health Policy and Management, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, U.S.A
| |
Collapse
|
397
|
Clark NJ, Owada K, Ruberanziza E, Ortu G, Umulisa I, Bayisenge U, Mbonigaba JB, Mucaca JB, Lancaster W, Fenwick A, Soares Magalhães RJ, Mbituyumuremyi A. Parasite associations predict infection risk: incorporating co-infections in predictive models for neglected tropical diseases. Parasit Vectors 2020; 13:138. [PMID: 32178706 PMCID: PMC7077138 DOI: 10.1186/s13071-020-04016-2] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2019] [Accepted: 03/10/2020] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Schistosomiasis and infection by soil-transmitted helminths are some of the world's most prevalent neglected tropical diseases. Infection by more than one parasite (co-infection) is common and can contribute to clinical morbidity in children. Geostatistical analyses of parasite infection data are key for developing mass drug administration strategies, yet most methods ignore co-infections when estimating risk. Infection status for multiple parasites can act as a useful proxy for data-poor individual-level or environmental risk factors while avoiding regression dilution bias. Conditional random fields (CRF) is a multivariate graphical network method that opens new doors in parasite risk mapping by (i) predicting co-infections with high accuracy; (ii) isolating associations among parasites; and (iii) quantifying how these associations change across landscapes. METHODS We built a spatial CRF to estimate infection risks for Ascaris lumbricoides, Trichuris trichiura, hookworms (Ancylostoma duodenale and Necator americanus) and Schistosoma mansoni using data from a national survey of Rwandan schoolchildren. We used an ensemble learning approach to generate spatial predictions by simulating from the CRF's posterior distribution with a multivariate boosted regression tree that captured non-linear relationships between predictors and covariance in infection risks. This CRF ensemble was compared against single parasite gradient boosted machines to assess each model's performance and prediction uncertainty. RESULTS Parasite co-infections were common, with 19.57% of children infected with at least two parasites. The CRF ensemble achieved higher predictive power than single-parasite models by improving estimates of co-infection prevalence at the individual level and classifying schools into World Health Organization treatment categories with greater accuracy. The CRF uncovered important environmental and demographic predictors of parasite infection probabilities. Yet even after capturing demographic and environmental risk factors, the presences or absences of other parasites were strong predictors of individual-level infection risk. Spatial predictions delineated high-risk regions in need of anthelminthic treatment interventions, including areas with higher than expected co-infection prevalence. CONCLUSIONS Monitoring studies routinely screen for multiple parasites, yet statistical models generally ignore this multivariate data when assessing risk factors and designing treatment guidelines. Multivariate approaches can be instrumental in the global effort to reduce and eventually eliminate neglected helminth infections in developing countries.
Collapse
Affiliation(s)
- Nicholas J. Clark
- UQ Spatial Epidemiology Laboratory, School of Veterinary Science, The University of Queensland, Gatton, QLD 4343 Australia
| | - Kei Owada
- UQ Spatial Epidemiology Laboratory, School of Veterinary Science, The University of Queensland, Gatton, QLD 4343 Australia
- Children Health and Environment Program, Child Health Research Centre, The University of Queensland, South Brisbane, QLD 4101 Australia
| | - Eugene Ruberanziza
- Neglected Tropical Diseases and Other Parasitic Diseases Unit, Malaria and Other Parasitic Diseases Division, Rwanda Biomedical Center, Kigali, Rwanda
| | - Giuseppina Ortu
- Schistosomiasis Control Initiative (SCI), Department of Infectious Diseases Epidemiology, Imperial College, London, UK
| | - Irenee Umulisa
- Neglected Tropical Diseases and Other Parasitic Diseases Unit, Malaria and Other Parasitic Diseases Division, Rwanda Biomedical Center, Kigali, Rwanda
| | - Ursin Bayisenge
- Neglected Tropical Diseases and Other Parasitic Diseases Unit, Malaria and Other Parasitic Diseases Division, Rwanda Biomedical Center, Kigali, Rwanda
| | - Jean Bosco Mbonigaba
- Neglected Tropical Diseases and Other Parasitic Diseases Unit, Malaria and Other Parasitic Diseases Division, Rwanda Biomedical Center, Kigali, Rwanda
| | - Jean Bosco Mucaca
- Microbiology Unit, National Reference Laboratory (NRL) Division, Rwanda Biomedical Center, Ministry of Health, Kigali, Rwanda
| | | | - Alan Fenwick
- Schistosomiasis Control Initiative (SCI), Department of Infectious Diseases Epidemiology, Imperial College, London, UK
| | - Ricardo J. Soares Magalhães
- UQ Spatial Epidemiology Laboratory, School of Veterinary Science, The University of Queensland, Gatton, QLD 4343 Australia
- Children Health and Environment Program, Child Health Research Centre, The University of Queensland, South Brisbane, QLD 4101 Australia
| | - Aimable Mbituyumuremyi
- Malaria and Other Parasitic Diseases Division, Rwanda Biomedical Center, Ministry of Health, Kigali, Rwanda
| |
Collapse
|
398
|
Zhu F, Li X, Tang H, He Z, Zhang C, Hung GU, Chiu PY, Zhou W. Machine Learning for the Preliminary Diagnosis of Dementia. SCIENTIFIC PROGRAMMING 2020; 2020:5629090. [PMID: 38486686 PMCID: PMC10938949 DOI: 10.1155/2020/5629090] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 03/17/2024]
Abstract
Objective The reliable diagnosis remains a challenging issue in the early stages of dementia. We aimed to develop and validate a new method based on machine learning to help the preliminary diagnosis of normal, mild cognitive impairment (MCI), very mild dementia (VMD), and dementia using an informant-based questionnaire. Methods We enrolled 5,272 individuals who filled out a 37-item questionnaire. In order to select the most important features, three different techniques of feature selection were tested. Then, the top features combined with six classification algorithms were used to develop the diagnostic models. Results Information Gain was the most effective among the three feature selection methods. The Naive Bayes algorithm performed the best (accuracy = 0.81, precision = 0.82, recall = 0.81, and F-measure = 0.81) among the six classification models. Conclusion The diagnostic model proposed in this paper provides a powerful tool for clinicians to diagnose the early stages of dementia.
Collapse
Affiliation(s)
- Fubao Zhu
- School of Computer and Communication Engineering, Zhengzhou University of Light Industry, Zhengzhou, Henan, USA
| | - Xiaonan Li
- School of Computer and Communication Engineering, Zhengzhou University of Light Industry, Zhengzhou, Henan, USA
| | - Haipeng Tang
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS, USA
| | - Zhuo He
- College of Computing, Michigan Technological University, Houghton, MI, USA
| | - Chaoyang Zhang
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS, USA
| | - Guang-Uei Hung
- Department of Nuclear Medicine, Chang Bing Show Chwan Memorial Hospital, Changhua, Taiwan
| | - Pai-Yi Chiu
- Department of Neurology, Show Chwan Memorial Hospital, Changhua, Taiwan
| | - Weihua Zhou
- College of Computing, Michigan Technological University, Houghton, MI, USA
| |
Collapse
|
399
|
Class Imbalance Reduction (CIR): A Novel Approach to Software Defect Prediction in the Presence of Class Imbalance. Symmetry (Basel) 2020. [DOI: 10.3390/sym12030407] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Software defect prediction (SDP) is the technique used to predict the occurrences of defects in the early stages of software development process. Early prediction of defects will reduce the overall cost of software and also increase its reliability. Most of the defect prediction methods proposed in the literature suffer from the class imbalance problem. In this paper, a novel class imbalance reduction (CIR) algorithm is proposed to create a symmetry between the defect and non-defect records in the imbalance datasets by considering distribution properties of the datasets and is compared with SMOTE (synthetic minority oversampling technique), a built-in package of many machine learning tools that is considered a benchmark in handling class imbalance problems, and with K-Means SMOTE. We conducted the experiment on forty open source software defect datasets from PRedict or Models in Software Engineering (PROMISE) repository using eight different classifiers and evaluated with six performance measures. The results show that the proposed CIR method shows improved performance over SMOTE and K-Means SMOTE.
Collapse
|
400
|
|