651
|
Kotsampasakou E, Brenner S, Jäger W, Ecker GF. Identification of Novel Inhibitors of Organic Anion Transporting Polypeptides 1B1 and 1B3 (OATP1B1 and OATP1B3) Using a Consensus Vote of Six Classification Models. Mol Pharm 2015; 12:4395-404. [PMID: 26469880 PMCID: PMC4674819 DOI: 10.1021/acs.molpharmaceut.5b00583] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
![]()
Organic anion transporting polypeptides
1B1 and 1B3 are transporters
selectively expressed on the basolateral membrane of the hepatocyte.
Several studies reveal that they are involved in drug–drug
interactions, cancer, and hyperbilirubinemia. In this study, we developed
a set of classification models for OATP1B1 and 1B3 inhibition based
on more than 1700 carefully curated compounds from literature, which
were validated via cross-validation and by use of an external test
set. After combining several sets of descriptors and classifiers,
the 6 best models were selected according to their statistical performance
and were used for virtual screening of DrugBank. Consensus scoring
of the screened compounds resulted in the selection and purchase of
nine compounds as potential dual inhibitors and of one compound as
potential selective OATP1B3 inhibitor. Biological testing of the compounds
confirmed the validity of the models, yielding an accuracy of 90%
for OATP1B1 and 80% for OATP1B3, respectively. Moreover, at least
half of the new identified inhibitors are associated with hyperbilirubinemia
or hepatotoxicity, implying a relationship between OATP inhibition
and these severe side effects.
Collapse
Affiliation(s)
- Eleni Kotsampasakou
- Department of Pharmaceutical Chemistry, University of Vienna , Althanstrasse 14, 1090 Vienna, Austria
| | - Stefan Brenner
- Department of Pharmaceutical Chemistry, University of Vienna , Althanstrasse 14, 1090 Vienna, Austria
| | - Walter Jäger
- Department of Pharmaceutical Chemistry, University of Vienna , Althanstrasse 14, 1090 Vienna, Austria
| | - Gerhard F Ecker
- Department of Pharmaceutical Chemistry, University of Vienna , Althanstrasse 14, 1090 Vienna, Austria
| |
Collapse
|
652
|
Bae SH, Yoon KJ. Polyp Detection via Imbalanced Learning and Discriminative Feature Learning. IEEE TRANSACTIONS ON MEDICAL IMAGING 2015; 34:2379-2393. [PMID: 26011864 DOI: 10.1109/tmi.2015.2434398] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Recent achievement of the learning-based classification leads to the noticeable performance improvement in automatic polyp detection. Here, building large good datasets is very crucial for learning a reliable detector. However, it is practically challenging due to the diversity of polyp types, expensive inspection, and labor-intensive labeling tasks. For this reason, the polyp datasets usually tend to be imbalanced, i.e., the number of non-polyp samples is much larger than that of polyp samples, and learning with those imbalanced datasets results in a detector biased toward a non-polyp class. In this paper, we propose a data sampling-based boosting framework to learn an unbiased polyp detector from the imbalanced datasets. In our learning scheme, we learn multiple weak classifiers with the datasets rebalanced by up/down sampling, and generate a polyp detector by combining them. In addition, for enhancing discriminability between polyps and non-polyps that have similar appearances, we propose an effective feature learning method using partial least square analysis, and use it for learning compact and discriminative features. Experimental results using challenging datasets show obvious performance improvement over other detectors. We further prove effectiveness and usefulness of the proposed methods with extensive evaluation.
Collapse
|
653
|
Charte F, Rivera AJ, del Jesus MJ, Herrera F. MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation. Knowl Based Syst 2015. [DOI: 10.1016/j.knosys.2015.07.019] [Citation(s) in RCA: 100] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
654
|
Ng WWY, Hu J, Yeung DS, Yin S, Roli F. Diversified Sensitivity-Based Undersampling for Imbalance Classification Problems. IEEE TRANSACTIONS ON CYBERNETICS 2015; 45:2402-2412. [PMID: 25474818 DOI: 10.1109/tcyb.2014.2372060] [Citation(s) in RCA: 66] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Undersampling is a widely adopted method to deal with imbalance pattern classification problems. Current methods mainly depend on either random resampling on the majority class or resampling at the decision boundary. Random-based undersampling fails to take into consideration informative samples in the data while resampling at the decision boundary is sensitive to class overlapping. Both techniques ignore the distribution information of the training dataset. In this paper, we propose a diversified sensitivity-based undersampling method. Samples of the majority class are clustered to capture the distribution information and enhance the diversity of the resampling. A stochastic sensitivity measure is applied to select samples from both clusters of the majority class and the minority class. By iteratively clustering and sampling, a balanced set of samples yielding high classifier sensitivity is selected. The proposed method yields a good generalization capability for 14 UCI datasets.
Collapse
|
655
|
Fazlollahi A, Meriaudeau F, Giancardo L, Villemagne VL, Rowe CC, Yates P, Salvado O, Bourgeat P. Computer-aided detection of cerebral microbleeds in susceptibility-weighted imaging. Comput Med Imaging Graph 2015; 46 Pt 3:269-76. [PMID: 26560677 DOI: 10.1016/j.compmedimag.2015.10.001] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2014] [Revised: 08/03/2015] [Accepted: 10/02/2015] [Indexed: 10/22/2022]
Abstract
Susceptibility-weighted imaging (SWI) is recognized as the preferred MRI technique for visualizing cerebral vasculature and related pathologies such as cerebral microbleeds (CMBs). Manual identification of CMBs is time-consuming, has limited reliability and reproducibility, and is prone to misinterpretation. In this paper, a novel computer-aided microbleed detection technique based on machine learning is presented: First, spherical-like objects (potential CMB candidates) with their corresponding bounding boxes were detected using a novel multi-scale Laplacian of Gaussian technique. A set of robust 3-dimensional Radon- and Hessian-based shape descriptors within each bounding box were then extracted to train a cascade of binary random forests (RF). The cascade consists of consecutive independent RF classifiers with low to high posterior probability constraints to handle imbalanced training sets (CMBs and non-CMBs), and to progressively improve detection rates. The proposed method was validated on 66 subjects whose CMBs were manually stratified into "possible" and "definite" by two medical experts. The proposed technique achieved a sensitivity of 87% and an average false detection rate of 27.1 CMBs per subject on the "possible and definite" set. A sensitivity of 93% and false detection rate of 10 CMBs per subject was also achieved on the "definite" set. The proposed automated approach outperforms state of the art methods, and promises to enhance manual expert screening. Benefits include improved reliability, minimization of intra-rater variability and a reduction in assessment time.
Collapse
Affiliation(s)
- Amir Fazlollahi
- CSIRO Digital Productivity Flagship, The Australian e-Health Research Centre, Herston, QLD, Australia; Le2I, University of Burgundy, Le Creusot, France.
| | | | | | - Victor L Villemagne
- Department of Nuclear Medicine and Centre for PET, Austin Hospital, Melbourne, VIC, Australia
| | - Christopher C Rowe
- Department of Nuclear Medicine and Centre for PET, Austin Hospital, Melbourne, VIC, Australia
| | - Paul Yates
- Department of Nuclear Medicine and Centre for PET, Austin Hospital, Melbourne, VIC, Australia
| | - Olivier Salvado
- CSIRO Digital Productivity Flagship, The Australian e-Health Research Centre, Herston, QLD, Australia
| | - Pierrick Bourgeat
- CSIRO Digital Productivity Flagship, The Australian e-Health Research Centre, Herston, QLD, Australia
| | | |
Collapse
|
656
|
ROSEFW-RF: The winner algorithm for the ECBDL’14 big data competition: An extremely imbalanced big data bioinformatics problem. Knowl Based Syst 2015. [DOI: 10.1016/j.knosys.2015.05.027] [Citation(s) in RCA: 105] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
657
|
Near-Bayesian Support Vector Machines for imbalanced data classification with equal or unequal misclassification costs. Neural Netw 2015. [DOI: 10.1016/j.neunet.2015.06.005] [Citation(s) in RCA: 107] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
658
|
Li GZ, He Z, Shao FF, Ou AH, Lin XZ. Patient classification of hypertension in Traditional Chinese Medicine using multi-label learning techniques. BMC Med Genomics 2015; 8 Suppl 3:S4. [PMID: 26399893 PMCID: PMC4582323 DOI: 10.1186/1755-8794-8-s3-s4] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Background Hypertension is one of the major risk factors for cardiovascular diseases. Research on the patient classification of hypertension has become an important topic because Traditional Chinese Medicine lies primarily in "treatment based on syndromes differentiation of the patients". Methods Clinical data of hypertension was collected with 12 syndromes and 129 symptoms including inspection, tongue, inquiry, and palpation symptoms. Syndromes differentiation was modeled as a patient classification problem in the field of data mining, and a new multi-label learning model BrSmoteSvm was built dealing with the class-imbalanced of the dataset. Results The experiments showed that the BrSmoteSvm had a better results comparing to other multi-label classifiers in the evaluation criteria of Average precision, Coverage, One-error, Ranking loss. Conclusions BrSmoteSvm can model the hypertension's syndromes differentiation better considering the imbalanced problem.
Collapse
|
659
|
Blagus R, Lusa L. Boosting for high-dimensional two-class prediction. BMC Bioinformatics 2015; 16:300. [PMID: 26390865 PMCID: PMC4578758 DOI: 10.1186/s12859-015-0723-9] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2015] [Accepted: 08/26/2015] [Indexed: 11/30/2022] Open
Abstract
Background In clinical research prediction models are used to accurately predict the outcome of the patients based on some of their characteristics. For high-dimensional prediction models (the number of variables greatly exceeds the number of samples) the choice of an appropriate classifier is crucial as it was observed that no single classification algorithm performs optimally for all types of data. Boosting was proposed as a method that combines the classification results obtained using base classifiers, where the sample weights are sequentially adjusted based on the performance in previous iterations. Generally boosting outperforms any individual classifier, but studies with high-dimensional data showed that the most standard boosting algorithm, AdaBoost.M1, cannot significantly improve the performance of its base classier. Recently other boosting algorithms were proposed (Gradient boosting, Stochastic Gradient boosting, LogitBoost); they were shown to perform better than AdaBoost.M1 but their performance was not evaluated for high-dimensional data. Results In this paper we use simulation studies and real gene-expression data sets to evaluate the performance of boosting algorithms when data are high-dimensional. Our results confirm that AdaBoost.M1 can perform poorly in this setting, often failing to improve the performance of its base classifier. We provide the explanation for this and propose a modification, AdaBoost.M1.ICV, which uses cross-validated estimates of the prediction errors and outperforms the original algorithm when data are high-dimensional. The use of AdaBoost.M1.ICV is advisable when the base classifier overfits the training data: the number of variables is large, the number of samples is small, and/or the difference between the classes is large. To a lesser extent also Gradient boosting suffers from similar problems. Contrary to the findings for the low-dimensional data, shrinkage does not improve the performance of Gradient boosting when data are high-dimensional, however it is beneficial for Stochastic Gradient boosting, which outperformed the other boosting algorithms in our analyses. LogitBoost suffers from overfitting and generally performs poorly. Conclusions The results show that boosting can substantially improve the performance of its base classifier also when data are high-dimensional. However, not all boosting algorithms perform equally well. LogitBoost, AdaBoost.M1 and Gradient boosting seem less useful for this type of data. Overall, Stochastic Gradient boosting with shrinkage and AdaBoost.M1.ICV seem to be the preferable choices for high-dimensional class-prediction. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0723-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Rok Blagus
- Institute for Biostatistics and Medical Informatics, University of Ljubljana, Vrazov trg 2, Ljubljana, Slovenia.
| | - Lara Lusa
- Institute for Biostatistics and Medical Informatics, University of Ljubljana, Vrazov trg 2, Ljubljana, Slovenia.
| |
Collapse
|
660
|
Díez-Pastor JF, Rodríguez JJ, García-Osorio C, Kuncheva LI. Random Balance: Ensembles of variable priors classifiers for imbalanced data. Knowl Based Syst 2015. [DOI: 10.1016/j.knosys.2015.04.022] [Citation(s) in RCA: 155] [Impact Index Per Article: 15.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
661
|
Kaya A, Can AB. A weighted rule based method for predicting malignancy of pulmonary nodules by nodule characteristics. J Biomed Inform 2015; 56:69-79. [DOI: 10.1016/j.jbi.2015.05.011] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2014] [Revised: 04/18/2015] [Accepted: 05/15/2015] [Indexed: 01/15/2023]
|
662
|
Krawczyk B, Schaefer G, Woźniak M. A hybrid cost-sensitive ensemble for imbalanced breast thermogram classification. Artif Intell Med 2015; 65:219-27. [PMID: 26319694 DOI: 10.1016/j.artmed.2015.07.005] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2013] [Revised: 07/15/2015] [Accepted: 07/23/2015] [Indexed: 12/19/2022]
Abstract
OBJECTIVES Early recognition of breast cancer, the most commonly diagnosed form of cancer in women, is of crucial importance, given that it leads to significantly improved chances of survival. Medical thermography, which uses an infrared camera for thermal imaging, has been demonstrated as a particularly useful technique for early diagnosis, because it detects smaller tumors than the standard modality of mammography. METHODS AND MATERIAL In this paper, we analyse breast thermograms by extracting features describing bilateral symmetries between the two breast areas, and present a classification system for decision making. Clearly, the costs associated with missing a cancer case are much higher than those for mislabelling a benign case. At the same time, datasets contain significantly fewer malignant cases than benign ones. Standard classification approaches fail to consider either of these aspects. In this paper, we introduce a hybrid cost-sensitive classifier ensemble to address this challenging problem. Our approach entails a pool of cost-sensitive decision trees which assign a higher misclassification cost to the malignant class, thereby boosting its recognition rate. A genetic algorithm is employed for simultaneous feature selection and classifier fusion. As an optimisation criterion, we use a combination of misclassification cost and diversity to achieve both a high sensitivity and a heterogeneous ensemble. Furthermore, we prune our ensemble by discarding classifiers that contribute minimally to the decision making. RESULTS For a challenging dataset of about 150 thermograms, our approach achieves an excellent sensitivity of 83.10%, while maintaining a high specificity of 89.44%. This not only signifies improved recognition of malignant cases, it also statistically outperforms other state-of-the-art algorithms designed for imbalanced classification, and hence provides an effective approach for analysing breast thermograms. CONCLUSIONS Our proposed hybrid cost-sensitive ensemble can facilitate a highly accurate early diagnostic of breast cancer based on thermogram features. It overcomes the difficulties posed by the imbalanced distribution of patients in the two analysed groups.
Collapse
Affiliation(s)
- Bartosz Krawczyk
- Department of Systems and Computer Networks, Wrocław University of Technology, Wyb. Wyspianskiego 27, 50-370 Wrocław, Poland.
| | - Gerald Schaefer
- Department of Computer Science, Loughborough University, Loughborough LE11 3TU, UK.
| | - Michał Woźniak
- Department of Systems and Computer Networks, Wrocław University of Technology, Wyb. Wyspianskiego 27, 50-370 Wrocław, Poland.
| |
Collapse
|
663
|
Matrixized Learning Machine with Feature-Clustering Interpolation. Neural Process Lett 2015. [DOI: 10.1007/s11063-015-9458-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
664
|
Guo P, Zeng F, Hu X, Zhang D, Zhu S, Deng Y, Hao Y. Improved Variable Selection Algorithm Using a LASSO-Type Penalty, with an Application to Assessing Hepatitis B Infection Relevant Factors in Community Residents. PLoS One 2015. [PMID: 26214802 PMCID: PMC4516242 DOI: 10.1371/journal.pone.0134151] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
OBJECTIVES In epidemiological studies, it is important to identify independent associations between collective exposures and a health outcome. The current stepwise selection technique ignores stochastic errors and suffers from a lack of stability. The alternative LASSO-penalized regression model can be applied to detect significant predictors from a pool of candidate variables. However, this technique is prone to false positives and tends to create excessive biases. It remains challenging to develop robust variable selection methods and enhance predictability. MATERIAL AND METHODS Two improved algorithms denoted the two-stage hybrid and bootstrap ranking procedures, both using a LASSO-type penalty, were developed for epidemiological association analysis. The performance of the proposed procedures and other methods including conventional LASSO, Bolasso, stepwise and stability selection models were evaluated using intensive simulation. In addition, methods were compared by using an empirical analysis based on large-scale survey data of hepatitis B infection-relevant factors among Guangdong residents. RESULTS The proposed procedures produced comparable or less biased selection results when compared to conventional variable selection models. In total, the two newly proposed procedures were stable with respect to various scenarios of simulation, demonstrating a higher power and a lower false positive rate during variable selection than the compared methods. In empirical analysis, the proposed procedures yielding a sparse set of hepatitis B infection-relevant factors gave the best predictive performance and showed that the procedures were able to select a more stringent set of factors. The individual history of hepatitis B vaccination, family and individual history of hepatitis B infection were associated with hepatitis B infection in the studied residents according to the proposed procedures. CONCLUSIONS The newly proposed procedures improve the identification of significant variables and enable us to derive a new insight into epidemiological association analysis.
Collapse
Affiliation(s)
- Pi Guo
- Department of Medical Statistics and Epidemiology and Health Information Research Center, School of Public Health, Sun Yat-sen University, Guangzhou, Guangdong, 510080, China
- Laboratory of Health Informatics, Guangdong Key Laboratory of Medicine, Sun Yat-sen University, Guangzhou, Guangdong, 510080, China
| | - Fangfang Zeng
- Department of Medical Statistics and Epidemiology and Health Information Research Center, School of Public Health, Sun Yat-sen University, Guangzhou, Guangdong, 510080, China
- Laboratory of Health Informatics, Guangdong Key Laboratory of Medicine, Sun Yat-sen University, Guangzhou, Guangdong, 510080, China
| | - Xiaomin Hu
- Department of Medical Statistics and Epidemiology and Health Information Research Center, School of Public Health, Sun Yat-sen University, Guangzhou, Guangdong, 510080, China
- Laboratory of Health Informatics, Guangdong Key Laboratory of Medicine, Sun Yat-sen University, Guangzhou, Guangdong, 510080, China
| | - Dingmei Zhang
- Department of Medical Statistics and Epidemiology and Health Information Research Center, School of Public Health, Sun Yat-sen University, Guangzhou, Guangdong, 510080, China
- Laboratory of Health Informatics, Guangdong Key Laboratory of Medicine, Sun Yat-sen University, Guangzhou, Guangdong, 510080, China
| | - Shuming Zhu
- Department of Medical Statistics and Epidemiology and Health Information Research Center, School of Public Health, Sun Yat-sen University, Guangzhou, Guangdong, 510080, China
- Laboratory of Health Informatics, Guangdong Key Laboratory of Medicine, Sun Yat-sen University, Guangzhou, Guangdong, 510080, China
| | - Yu Deng
- Department of Medical Statistics and Epidemiology and Health Information Research Center, School of Public Health, Sun Yat-sen University, Guangzhou, Guangdong, 510080, China
- Laboratory of Health Informatics, Guangdong Key Laboratory of Medicine, Sun Yat-sen University, Guangzhou, Guangdong, 510080, China
| | - Yuantao Hao
- Department of Medical Statistics and Epidemiology and Health Information Research Center, School of Public Health, Sun Yat-sen University, Guangzhou, Guangdong, 510080, China
- Laboratory of Health Informatics, Guangdong Key Laboratory of Medicine, Sun Yat-sen University, Guangzhou, Guangdong, 510080, China
- * E-mail:
| |
Collapse
|
665
|
Napierala K, Stefanowski J. Types of minority class examples and their influence on learning classifiers from imbalanced data. J Intell Inf Syst 2015. [DOI: 10.1007/s10844-015-0368-1] [Citation(s) in RCA: 69] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
666
|
Tomczak JM, Zięba M. Probabilistic combination of classification rules and its application to medical diagnosis. Mach Learn 2015. [DOI: 10.1007/s10994-015-5508-x] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
667
|
A survey of fingerprint classification Part I: Taxonomies on feature extraction methods and learning models. Knowl Based Syst 2015. [DOI: 10.1016/j.knosys.2015.02.008] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
668
|
Dai HL. Imbalanced Protein Data Classification Using Ensemble FTM-SVM. IEEE Trans Nanobioscience 2015; 14:350-359. [DOI: 10.1109/tnb.2015.2431292] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
669
|
Xu J, Yao L, Li L. Argumentation based joint learning: a novel ensemble learning approach. PLoS One 2015; 10:e0127281. [PMID: 25966359 PMCID: PMC4428879 DOI: 10.1371/journal.pone.0127281] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2014] [Accepted: 04/13/2015] [Indexed: 11/18/2022] Open
Abstract
Recently, ensemble learning methods have been widely used to improve classification performance in machine learning. In this paper, we present a novel ensemble learning method: argumentation based multi-agent joint learning (AMAJL), which integrates ideas from multi-agent argumentation, ensemble learning, and association rule mining. In AMAJL, argumentation technology is introduced as an ensemble strategy to integrate multiple base classifiers and generate a high performance ensemble classifier. We design an argumentation framework named Arena as a communication platform for knowledge integration. Through argumentation based joint learning, high quality individual knowledge can be extracted, and thus a refined global knowledge base can be generated and used independently for classification. We perform numerous experiments on multiple public datasets using AMAJL and other benchmark methods. The results demonstrate that our method can effectively extract high quality knowledge for ensemble classifier and improve the performance of classification.
Collapse
Affiliation(s)
- Junyi Xu
- Science and Technology on Information System and Engineering Laboratory, National University of Defense Technology, Changsha, Hunan, P.R. China
- * E-mail:
| | - Li Yao
- Science and Technology on Information System and Engineering Laboratory, National University of Defense Technology, Changsha, Hunan, P.R. China
| | - Le Li
- Science and Technology on Information System and Engineering Laboratory, National University of Defense Technology, Changsha, Hunan, P.R. China
| |
Collapse
|
670
|
Tan SC, Watada J, Ibrahim Z, Khalid M. Evolutionary fuzzy ARTMAP neural networks for classification of semiconductor defects. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2015; 26:933-950. [PMID: 25014967 DOI: 10.1109/tnnls.2014.2329097] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Wafer defect detection using an intelligent system is an approach of quality improvement in semiconductor manufacturing that aims to enhance its process stability, increase production capacity, and improve yields. Occasionally, only few records that indicate defective units are available and they are classified as a minority group in a large database. Such a situation leads to an imbalanced data set problem, wherein it engenders a great challenge to deal with by applying machine-learning techniques for obtaining effective solution. In addition, the database may comprise overlapping samples of different classes. This paper introduces two models of evolutionary fuzzy ARTMAP (FAM) neural networks to deal with the imbalanced data set problems in a semiconductor manufacturing operations. In particular, both the FAM models and hybrid genetic algorithms are integrated in the proposed evolutionary artificial neural networks (EANNs) to classify an imbalanced data set. In addition, one of the proposed EANNs incorporates a facility to learn overlapping samples of different classes from the imbalanced data environment. The classification results of the proposed evolutionary FAM neural networks are presented, compared, and analyzed using several classification metrics. The outcomes positively indicate the effectiveness of the proposed networks in handling classification problems with imbalanced data sets.
Collapse
|
671
|
|
672
|
Pan S, Wu J, Zhu X, Zhang C. Graph ensemble boosting for imbalanced noisy graph stream classification. IEEE TRANSACTIONS ON CYBERNETICS 2015; 45:940-954. [PMID: 25167562 DOI: 10.1109/tcyb.2014.2341031] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Many applications involve stream data with structural dependency, graph representations, and continuously increasing volumes. For these applications, it is very common that their class distributions are imbalanced with minority (or positive) samples being only a small portion of the population, which imposes significant challenges for learning models to accurately identify minority samples. This problem is further complicated with the presence of noise, because they are similar to minority samples and any treatment for the class imbalance may falsely focus on the noise and result in deterioration of accuracy. In this paper, we propose a classification model to tackle imbalanced graph streams with noise. Our method, graph ensemble boosting, employs an ensemble-based framework to partition graph stream into chunks each containing a number of noisy graphs with imbalanced class distributions. For each individual chunk, we propose a boosting algorithm to combine discriminative subgraph pattern selection and model learning as a unified framework for graph classification. To tackle concept drifting in graph streams, an instance level weighting mechanism is used to dynamically adjust the instance weight, through which the boosting framework can emphasize on difficult graph samples. The classifiers built from different graph chunks form an ensemble for graph stream classification. Experiments on real-life imbalanced graph streams demonstrate clear benefits of our boosting design for handling imbalanced noisy graph stream.
Collapse
|
673
|
Negative correlation learning for customer churn prediction: a comparison study. ScientificWorldJournal 2015; 2015:473283. [PMID: 25879060 PMCID: PMC4386545 DOI: 10.1155/2015/473283] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2014] [Revised: 08/23/2014] [Accepted: 09/07/2014] [Indexed: 11/28/2022] Open
Abstract
Recently, telecommunication companies have been paying more attention toward the problem of identification of customer churn behavior. In business, it is well known for service providers that attracting new customers is much more expensive than retaining existing ones. Therefore, adopting accurate models that are able to predict customer churn can effectively help in customer retention campaigns and maximizing the profit. In this paper we will utilize an ensemble of Multilayer perceptrons
(MLP) whose training is obtained using negative correlation learning
(NCL) for predicting customer churn in a telecommunication company.
Experiments results confirm that NCL based MLP ensemble can achieve
better generalization performance (high churn rate) compared with ensemble
of MLP without NCL (flat ensemble) and other common data
mining techniques used for churn analysis.
Collapse
|
674
|
|
675
|
Support vector machine-based optimized decision threshold adjustment strategy for classifying imbalanced data. Knowl Based Syst 2015. [DOI: 10.1016/j.knosys.2014.12.007] [Citation(s) in RCA: 65] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
676
|
Wang H, Xu Q, Zhou L. Large unbalanced credit scoring using Lasso-logistic regression ensemble. PLoS One 2015; 10:e0117844. [PMID: 25706988 PMCID: PMC4338292 DOI: 10.1371/journal.pone.0117844] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2014] [Accepted: 12/31/2014] [Indexed: 11/19/2022] Open
Abstract
Recently, various ensemble learning methods with different base classifiers have been proposed for credit scoring problems. However, for various reasons, there has been little research using logistic regression as the base classifier. In this paper, given large unbalanced data, we consider the plausibility of ensemble learning using regularized logistic regression as the base classifier to deal with credit scoring problems. In this research, the data is first balanced and diversified by clustering and bagging algorithms. Then we apply a Lasso-logistic regression learning ensemble to evaluate the credit risks. We show that the proposed algorithm outperforms popular credit scoring models such as decision tree, Lasso-logistic regression and random forests in terms of AUC and F-measure. We also provide two importance measures for the proposed model to identify important variables in the data.
Collapse
Affiliation(s)
- Hong Wang
- School of Mathematics & Statistics, Central South University, Changsha, Hunan, China
| | - Qingsong Xu
- School of Mathematics & Statistics, Central South University, Changsha, Hunan, China
| | - Lifeng Zhou
- School of Mathematics & Statistics, Central South University, Changsha, Hunan, China
| |
Collapse
|
677
|
Yu Z, Li L, Liu J, Han G. Hybrid adaptive classifier ensemble. IEEE TRANSACTIONS ON CYBERNETICS 2015; 45:177-190. [PMID: 24860045 DOI: 10.1109/tcyb.2014.2322195] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Traditional random subspace-based classifier ensemble approaches (RSCE) have several limitations, such as viewing the same importance for the base classifiers trained in different subspaces, not considering how to find the optimal random subspace set. In this paper, we design a general hybrid adaptive ensemble learning framework (HAEL), and apply it to address the limitations of RSCE. As compared with RSCE, HAEL consists of two adaptive processes, i.e., base classifier competition and classifier ensemble interaction, so as to adjust the weights of the base classifiers in each ensemble and to explore the optimal random subspace set simultaneously. The experiments on the real-world datasets from the KEEL dataset repository for the classification task and the cancer gene expression profiles show that: 1) HAEL works well on both the real-world KEEL datasets and the cancer gene expression profiles and 2) it outperforms most of the state-of-the-art classifier ensemble approaches on 28 out of 36 KEEL datasets and 6 out of 6 cancer datasets.
Collapse
|
678
|
|
679
|
|
680
|
Krawczyk B, Woźniak M. Hypertension Type Classification Using Hierarchical Ensemble of One-Class Classifiers for Imbalanced Data. ICT INNOVATIONS 2014 2015. [DOI: 10.1007/978-3-319-09879-1_34] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
|
681
|
Krawczyk B, Woźniak M. Cost-Sensitive Neural Network with ROC-Based Moving Threshold for Imbalanced Classification. INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING – IDEAL 2015 2015. [DOI: 10.1007/978-3-319-24834-9_6] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]
|
682
|
Piir G, Sild S, Maran U. Classifying bio-concentration factor with random forest algorithm, influence of the bio-accumulative vs. non-bio-accumulative compound ratio to modelling result, and applicability domain for random forest model. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2014; 25:967-81. [PMID: 25482723 DOI: 10.1080/1062936x.2014.969310] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/10/2014] [Accepted: 08/03/2014] [Indexed: 05/27/2023]
Abstract
In environmental risk assessment, the bio-concentration factor (BCF) is a widely used parameter in the estimation of the bio-accumulation potential of chemicals. BCF data often have an uneven distribution of classes (bio-accumulative vs. non-bio-accumulative), which could severely bias the classification results towards the prevailing class. The present study focuses on the influence of uneven distribution of the classes in training phase of Random Forest (RF) classification models. Three different training set designs were used and descriptors selected to the models based on the occurrence frequency in RF trees and considering the mechanistic aspects they reflect. Models were compared and their classification performance was analysed, indicating good predictive characteristics (sensitivity = 0.90 and specificity = 0.83) for the balanced set; also imbalanced sets have their strengths in certain application scenarios. The confidence of classifications was assessed with a new schema for the applicability domain that makes use of the RF proximity matrix by analysing the similarity between the predicted compound and the training set of the model. All developed models were made available in the transparent, accessible and reproducible way in QsarDB repository (http://dx.doi.org/10.15152/QDB.116).
Collapse
Affiliation(s)
- G Piir
- a Institute of Chemistry , University of Tartu , Tartu , Estonia
| | | | | |
Collapse
|
683
|
Antonelli M, Ducange P, Marcelloni F. An experimental study on evolutionary fuzzy classifiers designed for managing imbalanced datasets. Neurocomputing 2014. [DOI: 10.1016/j.neucom.2014.04.070] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
684
|
Peng L, Zhang H, Yang B, Chen Y. A new approach for imbalanced data classification based on data gravitation. Inf Sci (N Y) 2014. [DOI: 10.1016/j.ins.2014.04.046] [Citation(s) in RCA: 55] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
685
|
del Río S, López V, Benítez JM, Herrera F. On the use of MapReduce for imbalanced big data using Random Forest. Inf Sci (N Y) 2014. [DOI: 10.1016/j.ins.2014.03.043] [Citation(s) in RCA: 195] [Impact Index Per Article: 17.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
686
|
Qian Y, Liang Y, Li M, Feng G, Shi X. A resampling ensemble algorithm for classification of imbalance problems. Neurocomputing 2014. [DOI: 10.1016/j.neucom.2014.06.021] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
687
|
Cano A, Nguyen DT, Ventura S, Cios KJ. ur-CAIM: improved CAIM discretization for unbalanced and balanced data. Soft comput 2014. [DOI: 10.1007/s00500-014-1488-1] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
688
|
Prati RC, Batista GEAPA, Silva DF. Class imbalance revisited: a new experimental setup to assess the performance of treatment methods. Knowl Inf Syst 2014. [DOI: 10.1007/s10115-014-0794-3] [Citation(s) in RCA: 73] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
689
|
Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A, Benítez J, Herrera F. A review of microarray datasets and applied feature selection methods. Inf Sci (N Y) 2014. [DOI: 10.1016/j.ins.2014.05.042] [Citation(s) in RCA: 386] [Impact Index Per Article: 35.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
690
|
Lee PH. Resampling methods improve the predictive power of modeling in class-imbalanced datasets. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2014; 11:9776-89. [PMID: 25238271 PMCID: PMC4199049 DOI: 10.3390/ijerph110909776] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/20/2014] [Revised: 09/04/2014] [Accepted: 09/12/2014] [Indexed: 11/20/2022]
Abstract
In the medical field, many outcome variables are dichotomized, and the two possible values of a dichotomized variable are referred to as classes. A dichotomized dataset is class-imbalanced if it consists mostly of one class, and performance of common classification models on this type of dataset tends to be suboptimal. To tackle such a problem, resampling methods, including oversampling and undersampling can be used. This paper aims at illustrating the effect of resampling methods using the National Health and Nutrition Examination Survey (NHANES) wave 2009–2010 dataset. A total of 4677 participants aged ≥20 without self-reported diabetes and with valid blood test results were analyzed. The Classification and Regression Tree (CART) procedure was used to build a classification model on undiagnosed diabetes. A participant demonstrated evidence of diabetes according to WHO diabetes criteria. Exposure variables included demographics and socio-economic status. CART models were fitted using a randomly selected 70% of the data (training dataset), and area under the receiver operating characteristic curve (AUC) was computed using the remaining 30% of the sample for evaluation (testing dataset). CART models were fitted using the training dataset, the oversampled training dataset, the weighted training dataset, and the undersampled training dataset. In addition, resampling case-to-control ratio of 1:1, 1:2, and 1:4 were examined. Resampling methods on the performance of other extensions of CART (random forests and generalized boosted trees) were also examined. CARTs fitted on the oversampled (AUC = 0.70) and undersampled training data (AUC = 0.74) yielded a better classification power than that on the training data (AUC = 0.65). Resampling could also improve the classification power of random forests and generalized boosted trees. To conclude, applying resampling methods in a class-imbalanced dataset improved the classification power of CART, random forests, and generalized boosted trees.
Collapse
Affiliation(s)
- Paul H Lee
- School of Nursing, Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong.
| |
Collapse
|
691
|
|
692
|
|
693
|
|
694
|
Zhang CX, Wang GW, Zhang JS, Guo G, Ying QY. IRUSRT: A Novel Imbalanced Learning Technique by Combining Inverse Random Under Sampling and Random Tree. COMMUN STAT-SIMUL C 2014. [DOI: 10.1080/03610918.2013.765467] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
695
|
Xia SX, Meng FR, Liu B, Zhou Y. A Kernel Clustering-Based Possibilistic Fuzzy Extreme Learning Machine for Class Imbalance Learning. Cognit Comput 2014. [DOI: 10.1007/s12559-014-9256-1] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
696
|
Valentini G. Hierarchical ensemble methods for protein function prediction. ISRN BIOINFORMATICS 2014; 2014:901419. [PMID: 25937954 PMCID: PMC4393075 DOI: 10.1155/2014/901419] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/02/2014] [Accepted: 02/25/2014] [Indexed: 12/11/2022]
Abstract
Protein function prediction is a complex multiclass multilabel classification problem, characterized by multiple issues such as the incompleteness of the available annotations, the integration of multiple sources of high dimensional biomolecular data, the unbalance of several functional classes, and the difficulty of univocally determining negative examples. Moreover, the hierarchical relationships between functional classes that characterize both the Gene Ontology and FunCat taxonomies motivate the development of hierarchy-aware prediction methods that showed significantly better performances than hierarchical-unaware "flat" prediction methods. In this paper, we provide a comprehensive review of hierarchical methods for protein function prediction based on ensembles of learning machines. According to this general approach, a separate learning machine is trained to learn a specific functional term and then the resulting predictions are assembled in a "consensus" ensemble decision, taking into account the hierarchical relationships between classes. The main hierarchical ensemble methods proposed in the literature are discussed in the context of existing computational methods for protein function prediction, highlighting their characteristics, advantages, and limitations. Open problems of this exciting research area of computational biology are finally considered, outlining novel perspectives for future research.
Collapse
Affiliation(s)
- Giorgio Valentini
- AnacletoLab-Dipartimento di Informatica, Università degli Studi di Milano, Via Comelico 39, 20135 Milano, Italy
| |
Collapse
|
697
|
Kumar NS, Rao KN, Govardhan A, Reddy KS, Mahmood AM. Undersampled $$K$$ K -means approach for handling imbalanced distributed data. PROGRESS IN ARTIFICIAL INTELLIGENCE 2014. [DOI: 10.1007/s13748-014-0045-6] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
698
|
Zhang Y, Fu P, Liu W, Chen G. Imbalanced data classification based on scaling kernel-based support vector machine. Neural Comput Appl 2014. [DOI: 10.1007/s00521-014-1584-2] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
699
|
Cao P, Yang J, Li W, Zhao D, Zaiane O. Ensemble-based hybrid probabilistic sampling for imbalanced data learning in lung nodule CAD. Comput Med Imaging Graph 2014; 38:137-50. [DOI: 10.1016/j.compmedimag.2013.12.003] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2013] [Revised: 10/19/2013] [Accepted: 12/02/2013] [Indexed: 01/15/2023]
|
700
|
Galar M, Fernández A, Barrenechea E, Herrera F. Empowering difficult classes with a similarity-based aggregation in multi-class classification problems. Inf Sci (N Y) 2014. [DOI: 10.1016/j.ins.2013.12.053] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|