701
|
Yang P, Yoo PD, Fernando J, Zhou BB, Zhang Z, Zomaya AY. Sample Subset Optimization Techniques for Imbalanced and Ensemble Learning Problems in Bioinformatics Applications. IEEE TRANSACTIONS ON CYBERNETICS 2014; 44:445-55. [PMID: 24108722 DOI: 10.1109/tcyb.2013.2257480] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
Data sampling is a widely used technique in a broad range of machine learning problems. Traditional sampling approaches generally rely on random resampling from a given dataset. However, these approaches do not take into consideration additional information, such as sample quality and usefulness. We recently proposed a data sampling technique, called sample subset optimization (SSO). The SSO technique relies on a cross-validation procedure for identifying and selecting the most useful samples as subsets. In this paper, we describe the application of SSO techniques to imbalanced and ensemble learning problems, respectively. For imbalanced learning, the SSO technique is employed as an under-sampling technique for identifying a subset of highly discriminative samples in the majority class. In ensemble learning, the SSO technique is utilized as a generic ensemble technique where multiple optimized subsets of samples from each class are selected for building an ensemble classifier. We demonstrate the utilities and advantages of the proposed techniques on a variety of bioinformatics applications where class imbalance, small sample size, and noisy data are prevalent.
Collapse
|
702
|
Bouktif S, Hanna EM, Zaki N, Khousa EA. Ant colony optimization algorithm for interpretable Bayesian classifiers combination: application to medical predictions. PLoS One 2014; 9:e86456. [PMID: 24498276 PMCID: PMC3911928 DOI: 10.1371/journal.pone.0086456] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2013] [Accepted: 12/14/2013] [Indexed: 11/30/2022] Open
Abstract
Prediction and classification techniques have been well studied by machine learning researchers and developed for several real-word problems. However, the level of acceptance and success of prediction models are still below expectation due to some difficulties such as the low performance of prediction models when they are applied in different environments. Such a problem has been addressed by many researchers, mainly from the machine learning community. A second problem, principally raised by model users in different communities, such as managers, economists, engineers, biologists, and medical practitioners, etc., is the prediction models’ interpretability. The latter is the ability of a model to explain its predictions and exhibit the causality relationships between the inputs and the outputs. In the case of classification, a successful way to alleviate the low performance is to use ensemble classiers. It is an intuitive strategy to activate collaboration between different classifiers towards a better performance than individual classier. Unfortunately, ensemble classifiers method do not take into account the interpretability of the final classification outcome. It even worsens the original interpretability of the individual classifiers. In this paper we propose a novel implementation of classifiers combination approach that does not only promote the overall performance but also preserves the interpretability of the resulting model. We propose a solution based on Ant Colony Optimization and tailored for the case of Bayesian classifiers. We validate our proposed solution with case studies from medical domain namely, heart disease and Cardiotography-based predictions, problems where interpretability is critical to make appropriate clinical decisions. Availability The datasets, Prediction Models and software tool together with supplementary materials are available at http://faculty.uaeu.ac.ae/salahb/ACO4BC.htm.
Collapse
Affiliation(s)
- Salah Bouktif
- Software Development, College of Information Technology, United Arab Emirates University (UAEU), Al-Ain, UAE
- * E-mail:
| | - Eileen Marie Hanna
- Intelligent Systems, College of Information Technology, United Arab Emirates University (UAEU), Al-Ain, UAE
| | - Nazar Zaki
- Intelligent Systems, College of Information Technology, United Arab Emirates University (UAEU), Al-Ain, UAE
| | - Eman Abu Khousa
- Enterprise Systems, College of Information Technology, United Arab Emirates University (UAEU), Al-Ain, UAE
| |
Collapse
|
703
|
Maratea A, Petrosino A, Manzo M. Adjusted F-measure and kernel scaling for imbalanced data learning. Inf Sci (N Y) 2014. [DOI: 10.1016/j.ins.2013.04.016] [Citation(s) in RCA: 58] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
704
|
Bria A, Karssemeijer N, Tortorella F. Learning from unbalanced data: A cascade-based approach for detecting clustered microcalcifications. Med Image Anal 2014; 18:241-52. [DOI: 10.1016/j.media.2013.10.014] [Citation(s) in RCA: 57] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2013] [Revised: 10/18/2013] [Accepted: 10/31/2013] [Indexed: 11/29/2022]
|
705
|
Automatic denoising of functional MRI data: combining independent component analysis and hierarchical fusion of classifiers. Neuroimage 2014; 90:449-68. [PMID: 24389422 DOI: 10.1016/j.neuroimage.2013.11.046] [Citation(s) in RCA: 1237] [Impact Index Per Article: 112.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2013] [Revised: 11/05/2013] [Accepted: 11/27/2013] [Indexed: 01/24/2023] Open
Abstract
Many sources of fluctuation contribute to the fMRI signal, and this makes identifying the effects that are truly related to the underlying neuronal activity difficult. Independent component analysis (ICA) - one of the most widely used techniques for the exploratory analysis of fMRI data - has shown to be a powerful technique in identifying various sources of neuronally-related and artefactual fluctuation in fMRI data (both with the application of external stimuli and with the subject "at rest"). ICA decomposes fMRI data into patterns of activity (a set of spatial maps and their corresponding time series) that are statistically independent and add linearly to explain voxel-wise time series. Given the set of ICA components, if the components representing "signal" (brain activity) can be distinguished form the "noise" components (effects of motion, non-neuronal physiology, scanner artefacts and other nuisance sources), the latter can then be removed from the data, providing an effective cleanup of structured noise. Manual classification of components is labour intensive and requires expertise; hence, a fully automatic noise detection algorithm that can reliably detect various types of noise sources (in both task and resting fMRI) is desirable. In this paper, we introduce FIX ("FMRIB's ICA-based X-noiseifier"), which provides an automatic solution for denoising fMRI data via accurate classification of ICA components. For each ICA component FIX generates a large number of distinct spatial and temporal features, each describing a different aspect of the data (e.g., what proportion of temporal fluctuations are at high frequencies). The set of features is then fed into a multi-level classifier (built around several different classifiers). Once trained through the hand-classification of a sufficient number of training datasets, the classifier can then automatically classify new datasets. The noise components can then be subtracted from (or regressed out of) the original data, to provide automated cleanup. On conventional resting-state fMRI (rfMRI) single-run datasets, FIX achieved about 95% overall accuracy. On high-quality rfMRI data from the Human Connectome Project, FIX achieves over 99% classification accuracy, and as a result is being used in the default rfMRI processing pipeline for generating HCP connectomes. FIX is publicly available as a plugin for FSL.
Collapse
|
706
|
Boosted SVM for extracting rules from imbalanced data in application to prediction of the post-operative life expectancy in the lung cancer patients. Appl Soft Comput 2014. [DOI: 10.1016/j.asoc.2013.07.016] [Citation(s) in RCA: 89] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
707
|
Fernández-Navarro F, Campoy-Muñoz P, -de la Paz-Marín M. Addressing the EU sovereign ratings using an ordinal regression approach. IEEE TRANSACTIONS ON CYBERNETICS 2013; 43:2228-2240. [PMID: 24235262 DOI: 10.1109/tsmcc.2013.2247595] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
The current European debt crisis has drawn considerable attention to credit-rating agencies' news about sovereign ratings. From a technical point of view, credit rating constitutes a typical ordinal regression problem because credit-rating agencies generally present a scale of risk composed of several categories. This fact motivated the use of an ordinal regression approach to address the problem of sovereign credit rating in this paper. Therefore, the ranking of different classes will be taken into account for the design of the classifier. To do so, a novel model is introduced in order to replicate sovereign rating, based on the negative correlation learning framework. The methodology is fully described in this paper and applied to the classification of the 27 European countries' sovereign rating during the 2007-2010 period based on Standard and Poor's reports. The proposed technique seems to be competitive and robust enough to classify the sovereign ratings reported by this agency when compared with other existing well-known ordinal and nominal methods.
Collapse
|
708
|
López V, Fernández A, García S, Palade V, Herrera F. An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Inf Sci (N Y) 2013. [DOI: 10.1016/j.ins.2013.07.007] [Citation(s) in RCA: 932] [Impact Index Per Article: 77.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
709
|
|
710
|
Abstract
Extreme learning machine (ELM) is an efficient and practical learning algorithm used for training single hidden layer feed-forward neural networks (SLFNs). ELM can provide good generalization performance at extremely fast learning speed. However, ELM suffers from instability and over-fitting, especially on relatively large datasets. Based on probabilistic SLFNs, an approach of fusion of extreme learning machine (F-ELM) with fuzzy integral is proposed in this paper. The proposed algorithm consists of three stages. Firstly, the bootstrap technique is employed to generate several subsets of original dataset. Secondly, probabilistic SLFNs are trained with ELM algorithm on each subset. Finally, the trained probabilistic SLFNs are fused with fuzzy integral. The experimental results show that the proposed approach can alleviate to some extent the problems mentioned above, and can increase the prediction accuracy.
Collapse
Affiliation(s)
- JUNHAI ZHAI
- College of Mathematics and Computer Science, Hebei University, Baoding, 071002, Hebei, China
- Laboratory of Machine Learning and Computational Intelligence, Hebei University, Baoding 071002, China
| | - HONGYU XU
- College of Mathematics and Computer Science, Hebei University, Baoding, 071002, Hebei, China
| | - YAN LI
- College of Mathematics and Computer Science, Hebei University, Baoding, 071002, Hebei, China
- Laboratory of Machine Learning and Computational Intelligence, Hebei University, Baoding 071002, China
| |
Collapse
|
711
|
Voisin S, Pinto F, Morin‐Ducote G, Hudson KB, Tourassi GD. Predicting diagnostic error in radiology via eye‐tracking and image analytics: Preliminary investigation in mammography. Med Phys 2013; 40:101906. [DOI: 10.1118/1.4820536] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022] Open
Affiliation(s)
- Sophie Voisin
- Biomedical Science and Engineering Center, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831
| | - Frank Pinto
- School of Engineering, Science, and Technology, Virginia State University, Petersburg, Virginia 23806
| | - Garnetta Morin‐Ducote
- Department of Radiology, University of Tennessee Medical Center at Knoxville, Knoxville, Tennessee 37920
| | - Kathleen B. Hudson
- Department of Radiology, University of Tennessee Medical Center at Knoxville, Knoxville, Tennessee 37920
| | - Georgia D. Tourassi
- Biomedical Science and Engineering Center, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831
| |
Collapse
|
712
|
|
713
|
Abstract
AbstractBagging and boosting are two of the most well-known ensemble learning methods due to their theoretical performance guarantees and strong experimental results. Since bagging and boosting are an effective and open framework, several researchers have proposed their variants, some of which have turned out to have lower classification error than the original versions. This paper tried to summarize these variants and categorize them into groups. We hope that the references cited cover the major theoretical issues, and provide access to the main branches of the literature dealing with such methods, guiding the researcher in interesting research directions.
Collapse
|
714
|
Castro CL, Braga AP. Novel cost-sensitive approach to improve the multilayer perceptron performance on imbalanced data. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2013; 24:888-899. [PMID: 24808471 DOI: 10.1109/tnnls.2013.2246188] [Citation(s) in RCA: 69] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Traditional learning algorithms applied to complex and highly imbalanced training sets may not give satisfactory results when distinguishing between examples of the classes. The tendency is to yield classification models that are biased towards the overrepresented (majority) class. This paper investigates this class imbalance problem in the context of multilayer perceptron (MLP) neural networks. The consequences of the equal cost (loss) assumption on imbalanced data are formally discussed from a statistical learning theory point of view. A new cost-sensitive algorithm (CSMLP) is presented to improve the discrimination ability of (two-class) MLPs. The CSMLP formulation is based on a joint objective function that uses a single cost parameter to distinguish the importance of class errors. The learning rule extends the Levenberg-Marquadt's rule, ensuring the computational efficiency of the algorithm. In addition, it is theoretically demonstrated that the incorporation of prior information via the cost parameter may lead to balanced decision boundaries in the feature space. Based on the statistical analysis of results on real data, our approach shows a significant improvement of the area under the receiver operating characteristic curve and G-mean measures of regular MLPs.
Collapse
|
715
|
Lopez-Molina C, De Baets B, Bustince H, Sanz J, Barrenechea E. Multiscale edge detection based on Gaussian smoothing and edge tracking. Knowl Based Syst 2013. [DOI: 10.1016/j.knosys.2013.01.026] [Citation(s) in RCA: 64] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
716
|
Using the Choquet Integral in the Fuzzy Reasoning Method of Fuzzy Rule-Based Classification Systems. AXIOMS 2013. [DOI: 10.3390/axioms2020208] [Citation(s) in RCA: 52] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
717
|
Fernández A, López V, Galar M, del Jesus MJ, Herrera F. Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches. Knowl Based Syst 2013. [DOI: 10.1016/j.knosys.2013.01.018] [Citation(s) in RCA: 236] [Impact Index Per Article: 19.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
718
|
Strempel S, Nendza M, Scheringer M, Hungerbühler K. Using conditional inference trees and random forests to predict the bioaccumulation potential of organic chemicals. ENVIRONMENTAL TOXICOLOGY AND CHEMISTRY 2013; 32:1187-1195. [PMID: 23382013 DOI: 10.1002/etc.2150] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/22/2012] [Revised: 09/23/2012] [Accepted: 12/07/2012] [Indexed: 06/01/2023]
Abstract
The present study presents a data-oriented, tiered approach to assessing the bioaccumulation potential of chemicals according to the European chemicals regulation on Registration, Evaluation, Authorisation and Restriction of Chemicals (REACH). The authors compiled data for eight physicochemical descriptors (partition coefficients, degradation half-lives, polarity, and so forth) for a set of 713 organic chemicals for which experimental values of the bioconcentration factor (BCF) are available. The authors employed supervised machine learning methods (conditional inference trees and random forests) to derive relationships between the physicochemical descriptors and the BCF values. In a first tier, the authors established rules for classifying a chemical as bioaccumulative (B) or nonbioaccumulative (non-B). In a second tier, the authors developed a new tool for estimating numerical BCF values. For both cases the optimal set of relevant descriptors was determined; these are biotransformation half-life and octanol-water distribution coefficient (log D) for the classification rules and log D, biotransformation half-life, and topological polar surface area for the BCF estimation tool. The uncertainty of the BCF estimates obtained with the new estimation tool was quantified by comparing the estimated and experimental BCF values of the 713 chemicals. Comparison with existing BCF estimation methods indicates that the performance of this new BCF estimation tool is at least as high as that of existing methods. The authors recommend the present study's classification rules and BCF estimation tool for a consensus application in combination with existing BCF estimation methods.
Collapse
Affiliation(s)
- Sebastian Strempel
- Institute for Chemical and Bioengineering, Swiss Federal Institute of Technology (ETH) Zürich, Zürich, Switzerland
| | | | | | | |
Collapse
|
719
|
|
720
|
Błaszczyński J, Stefanowski J, Idkowiak Ł. Extending Bagging for Imbalanced Data. PROCEEDINGS OF THE 8TH INTERNATIONAL CONFERENCE ON COMPUTER RECOGNITION SYSTEMS CORES 2013 2013. [DOI: 10.1007/978-3-319-00969-8_26] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
|
721
|
Analyzing the presence of noise in multi-class problems: alleviating its influence with the One-vs-One decomposition. Knowl Inf Syst 2012. [DOI: 10.1007/s10115-012-0570-1] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
722
|
Sun Z, Song Q, Zhu X. Using Coding-Based Ensemble Learning to Improve Software Defect Prediction. ACTA ACUST UNITED AC 2012. [DOI: 10.1109/tsmcc.2012.2226152] [Citation(s) in RCA: 122] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
723
|
Boosting for class-imbalanced datasets using genetically evolved supervised non-linear projections. PROGRESS IN ARTIFICIAL INTELLIGENCE 2012. [DOI: 10.1007/s13748-012-0028-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
724
|
Lertampaiporn S, Thammarongtham C, Nukoolkit C, Kaewkamnerdpong B, Ruengjitchatchawalya M. Heterogeneous ensemble approach with discriminative features and modified-SMOTEbagging for pre-miRNA classification. Nucleic Acids Res 2012; 41:e21. [PMID: 23012261 PMCID: PMC3592496 DOI: 10.1093/nar/gks878] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
An ensemble classifier approach for microRNA precursor (pre-miRNA) classification was proposed based upon combining a set of heterogeneous algorithms including support vector machine (SVM), k-nearest neighbors (kNN) and random forest (RF), then aggregating their prediction through a voting system. Additionally, the proposed algorithm, the classification performance was also improved using discriminative features, self-containment and its derivatives, which have shown unique structural robustness characteristics of pre-miRNAs. These are applicable across different species. By applying preprocessing methods--both a correlation-based feature selection (CFS) with genetic algorithm (GA) search method and a modified-Synthetic Minority Oversampling Technique (SMOTE) bagging rebalancing method--improvement in the performance of this ensemble was observed. The overall prediction accuracies obtained via 10 runs of 5-fold cross validation (CV) was 96.54%, with sensitivity of 94.8% and specificity of 98.3%-this is better in trade-off sensitivity and specificity values than those of other state-of-the-art methods. The ensemble model was applied to animal, plant and virus pre-miRNA and achieved high accuracy, >93%. Exploiting the discriminative set of selected features also suggests that pre-miRNAs possess high intrinsic structural robustness as compared with other stem loops. Our heterogeneous ensemble method gave a relatively more reliable prediction than those using single classifiers. Our program is available at http://ncrna-pred.com/premiRNA.html.
Collapse
Affiliation(s)
- Supatcha Lertampaiporn
- Biological Engineering Program, King Mongkut's University of Technology Thonburi, Bang Mod, Thung Khru, Bangkok 10140, Thailand
| | | | | | | | | |
Collapse
|