1
|
Ortega Vázquez C, vanden Broucke S, De Weerdt J. A two-step anomaly detection based method for PU classification in imbalanced data sets. Data Min Knowl Discov 2023. [DOI: 10.1007/s10618-023-00925-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/07/2023]
|
2
|
Xia S, Chen B, Wang G, Zheng Y, Gao X, Giem E, Chen Z. mCRF and mRD: Two Classification Methods Based on a Novel Multiclass Label Noise Filtering Learning Framework. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:2916-2930. [PMID: 33428577 DOI: 10.1109/tnnls.2020.3047046] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Mitigating label noise is a crucial problem in classification. Noise filtering is an effective method of dealing with label noise which does not need to estimate the noise rate or rely on any loss function. However, most filtering methods focus mainly on binary classification, leaving the more difficult counterpart problem of multiclass classification relatively unexplored. To remedy this deficit, we present a definition for label noise in a multiclass setting and propose a general framework for a novel label noise filtering learning method for multiclass classification. Two examples of noise filtering methods for multiclass classification, multiclass complete random forest (mCRF) and multiclass relative density, are derived from their binary counterparts using our proposed framework. In addition, to optimize the NI_threshold hyperparameter in mCRF, we propose two new optimization methods: a new voting cross-validation method and an adaptive method that employs a 2-means clustering algorithm. Furthermore, we incorporate SMOTE into our label noise filtering learning framework to handle the ubiquitous problem of imbalanced data in multiclass classification. We report experiments on both synthetic data sets and UCI benchmarks to demonstrate our proposed methods are highly robust to label noise in comparison with state-of-the-art baselines. All code and data results are available at https://github.com/syxiaa/Multiclass-Label-Noise-Filtering-Learning.
Collapse
|
3
|
Lee D, Kim K. Improved noise-filtering algorithm for AdaBoost using the inter-and intra-class variability of imbalanced datasets. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2022. [DOI: 10.3233/jifs-213244] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Boosting methods are known to increase performance outcomes by using multiple learners connected sequentially. In particular, Adaptive boosting (AdaBoost) has been widely used owing to its comparatively improved predictive results for hard-to-learn samples based on misclassification costs. Each weak learner minimizes the expected risk by assigning high misclassification costs to suspect samples. The performance of AdaBoost depends on the distribution of noise samples because the algorithm tends to overfit noisy samples. Various studies have been conducted to address the noise sensitivity issue. Noise-filtering methods used in AdaBoost remove samples defined as noise based on the degree of misclassification to prevent overfitting to noisy samples. However, if the difference in the classification difficulty between classes is considerable, it is easy for samples from classes that are difficult to classify to be defined as noise. This situation is common with imbalanced datasets and can adversely affect performance outcomes. To solve this problem, this study proposes a new noise detection algorithm for AdaBoost that considers differences in the classification difficulty of classes and the characteristics of iteratively recalculated sample weight distributions. Experimental results on ten imbalanced datasets with various degrees of imbalanced ratios demonstrate that the proposed method defines noisy samples properly and improves the overall performance of AdaBoost.
Collapse
Affiliation(s)
- Dohyun Lee
- Department of Data Science, Seoul National University of Science& Technology (SeoulTech), Seoul, Republic of Korea
| | - Kyoungok Kim
- Department of Industrial Engineering, Seoul National University of Science & Technology (SeoulTech), Seoul, Republic of Korea
| |
Collapse
|
4
|
Moradi K, Aldarraji Z, Luthra M, Madison GP, Ascoli GA. Normalized unitary synaptic signaling of the hippocampus and entorhinal cortex predicted by deep learning of experimental recordings. Commun Biol 2022; 5:418. [PMID: 35513471 PMCID: PMC9072429 DOI: 10.1038/s42003-022-03329-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Accepted: 03/30/2022] [Indexed: 11/21/2022] Open
Abstract
Biologically realistic computer simulations of neuronal circuits require systematic data-driven modeling of neuron type-specific synaptic activity. However, limited experimental yield, heterogeneous recordings conditions, and ambiguous neuronal identification have so far prevented the consistent characterization of synaptic signals for all connections of any neural system. We introduce a strategy to overcome these challenges and report a comprehensive synaptic quantification among all known neuron types of the hippocampal-entorhinal network. First, we reconstructed >2600 synaptic traces from ∼1200 publications into a unified computational representation of synaptic dynamics. We then trained a deep learning architecture with the resulting parameters, each annotated with detailed metadata such as recording method, solutions, and temperature. The model learned to predict the synaptic properties of all 3,120 circuit connections in arbitrary conditions with accuracy approaching the intrinsic experimental variability. Analysis of data normalized and completed with the deep learning model revealed that synaptic signals are controlled by few latent variables associated with specific molecular markers and interrelating conductance, decay time constant, and short-term plasticity. We freely release the tools and full dataset of unitary synaptic values in 32 covariate settings. Normalized synaptic data can be used in brain simulations, and to predict and test experimental hypothesis. A deep learning model trained on roughly 2,600 synaptic traces from hippocampal electrophysiology datasets demonstrates how specific covariates influence synaptic signals.
Collapse
Affiliation(s)
- Keivan Moradi
- Interdisciplinary Neuroscience Program and Krasnow Institute for Advanced Study, George Mason University, Fairfax, VA, USA.,Department of Neurobiology, David Geffen School of Medicine, University of California, Los Angeles, CA, USA
| | - Zainab Aldarraji
- Bioengineering Department and Volgenau School of Engineering, George Mason University, Fairfax, VA, USA
| | - Megha Luthra
- Bioengineering Department and Volgenau School of Engineering, George Mason University, Fairfax, VA, USA
| | - Grey P Madison
- Chemistry and Biochemistry Department, College of Science, George Mason University, Fairfax, VA, USA
| | - Giorgio A Ascoli
- Interdisciplinary Neuroscience Program and Krasnow Institute for Advanced Study, George Mason University, Fairfax, VA, USA. .,Bioengineering Department and Volgenau School of Engineering, George Mason University, Fairfax, VA, USA.
| |
Collapse
|
5
|
Chou EP, Yang SP. A virtual multi-label approach to imbalanced data classification. COMMUN STAT-SIMUL C 2022. [DOI: 10.1080/03610918.2022.2049820] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Affiliation(s)
- Elizabeth P. Chou
- Department of Statistics, National Chengchi University, Taipei, Taiwan
| | - Shan-Ping Yang
- Department of Statistics, National Chengchi University, Taipei, Taiwan
| |
Collapse
|
6
|
Bi W, Zhang Q. Forecasting mergers and acquisitions failure based on partial-sigmoid neural network and feature selection. PLoS One 2021; 16:e0259575. [PMID: 34788332 PMCID: PMC8598039 DOI: 10.1371/journal.pone.0259575] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2021] [Accepted: 10/22/2021] [Indexed: 11/19/2022] Open
Abstract
Traditional forecasting methods in mergers and acquisitions (M&A) data have two limitations that significantly reduce forecasting accuracy: (1) the imbalance of data, that is, the failure cases of M&A are far fewer than the successful cases (82%/18% of our sample), and (2) both the bidder and the target of the merger have numerous descriptive features, making it difficult to choose which ones to forecast. This study proposes a neural network using partial-sigmoid (i.e., partial-sigmoid neural network [PSNN]) as the activation function of the output layer and compares three feature selection methods, namely, chi-square (chi2) test, information gain and gradient boosting decision tree (GBDT). Experimental results prove that our PSNN (improved up to 0.37 precision, 0.49 recall, 0.41 G-Mean and 0.23 F1-measure) and feature selection (improved 1.83%-13.16% accuracy) method can effectively improve the adverse effects of the defects of the above two merger data on forecasting. Scholars who studied the forecast of merger failure have overlooked three important features: assets of the previous year, market value and capital expenditure. The chi2 test feature selection method is the best among the three feature selection methods.
Collapse
Affiliation(s)
- Wenbin Bi
- School of Economics and Management, Beijing Jiaotong University, Beijing, China
| | - Qiusheng Zhang
- School of Economics and Management, Beijing Jiaotong University, Beijing, China
| |
Collapse
|
7
|
Dudjak M, Martinović G. An empirical study of data intrinsic characteristics that make learning from imbalanced data difficult. EXPERT SYSTEMS WITH APPLICATIONS 2021; 182:115297. [DOI: 10.1016/j.eswa.2021.115297] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/02/2023]
|
8
|
Shatnawi R. Software fault prediction using machine learning techniques with metric thresholds. INTERNATIONAL JOURNAL OF KNOWLEDGE-BASED AND INTELLIGENT ENGINEERING SYSTEMS 2021. [DOI: 10.3233/kes-210061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
BACKGROUND: Fault data is vital to predicting the fault-proneness in large systems. Predicting faulty classes helps in allocating the appropriate testing resources for future releases. However, current fault data face challenges such as unlabeled instances and data imbalance. These challenges degrade the performance of the prediction models. Data imbalance happens because the majority of classes are labeled as not faulty whereas the minority of classes are labeled as faulty. AIM: The research proposes to improve fault prediction using software metrics in combination with threshold values. Statistical techniques are proposed to improve the quality of the datasets and therefore the quality of the fault prediction. METHOD: Threshold values of object-oriented metrics are used to label classes as faulty to improve the fault prediction models The resulting datasets are used to build prediction models using five machine learning techniques. The use of threshold values is validated on ten large object-oriented systems. RESULTS: The models are built for the datasets with and without the use of thresholds. The combination of thresholds with machine learning has improved the fault prediction models significantly for the five classifiers. CONCLUSION: Threshold values can be used to label software classes as fault-prone and can be used to improve machine learners in predicting the fault-prone classes.
Collapse
|
9
|
Sabeti E, Drews J, Reamaroon N, Warner E, Sjoding MW, Gryak J, Najarian K. Learning Using Partially Available Privileged Information and Label Uncertainty: Application in Detection of Acute Respiratory Distress Syndrome. IEEE J Biomed Health Inform 2021; 25:784-796. [PMID: 32750956 DOI: 10.1109/jbhi.2020.3008601] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Acute respiratory distress syndrome (ARDS) is a fulminant inflammatory lung injury that develops in patients with critical illnesses, affecting 200,000 patients in the United States annually. However, a recent study suggests that most patients with ARDS are diagnosed late or missed completely and fail to receive life-saving treatments. This is primarily due to the dependency of current diagnosis criteria on chest x-ray, which is not necessarily available at the time of diagnosis. In machine learning, such an information is known as Privileged Information - information that is available at training but not at testing. However, in diagnosing ARDS, privileged information (chest x-rays) are sometimes only available for a portion of the training data. To address this issue, the Learning Using Partially Available Privileged Information (LUPAPI) paradigm is proposed. As there are multiple ways to incorporate partially available privileged information, three models built on classical SVM are described. Another complexity of diagnosing ARDS is the uncertainty in clinical interpretation of chest x-rays. To address this, the LUPAPI framework is then extended to incorporate label uncertainty, resulting in a novel and comprehensive machine learning paradigm - Learning Using Label Uncertainty and Partially Available Privileged Information (LULUPAPI). The proposed frameworks use Electronic Health Record (EHR) data as regular information, chest x-rays as partially available privileged information, and clinicians' confidence levels in ARDS diagnosis as a measure of label uncertainty. Experiments on an ARDS dataset demonstrate that both the LUPAPI and LULUPAPI models outperform SVM, with LULUPAPI performing better than LUPAPI.
Collapse
|
10
|
Chongomweru H, Kasem A. A novel ensemble method for classification in imbalanced datasets using split balancing technique based on instance hardness (sBal_IH). Neural Comput Appl 2021. [DOI: 10.1007/s00521-020-05570-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
11
|
Abstract
Class overlap and class imbalance are two data complexities that challenge the design of effective classifiers in Pattern Recognition and Data Mining as they may cause a significant loss in performance. Several solutions have been proposed to face both data difficulties, but most of these approaches tackle each problem separately. In this paper, we propose a two-stage under-sampling technique that combines the DBSCAN clustering algorithm to remove noisy samples and clean the decision boundary with a minimum spanning tree algorithm to face the class imbalance, thus handling class overlap and imbalance simultaneously with the aim of improving the performance of classifiers. An extensive experimental study shows a significantly better behavior of the new algorithm as compared to 12 state-of-the-art under-sampling methods using three standard classification models (nearest neighbor rule, J48 decision tree, and support vector machine with a linear kernel) on both real-life and synthetic databases.
Collapse
|
12
|
Nematzadeh Z, Ibrahim R, Selamat A. A hybrid model for class noise detection using k-means and classification filtering algorithms. SN APPLIED SCIENCES 2020. [DOI: 10.1007/s42452-020-3129-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
|
13
|
Abdulrauf Sharifai G, Zainol Z. Feature Selection for High-Dimensional and Imbalanced Biomedical Data Based on Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm. Genes (Basel) 2020; 11:genes11070717. [PMID: 32605144 PMCID: PMC7397300 DOI: 10.3390/genes11070717] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2019] [Revised: 12/19/2019] [Accepted: 01/07/2020] [Indexed: 11/16/2022] Open
Abstract
The training machine learning algorithm from an imbalanced data set is an inherently challenging task. It becomes more demanding with limited samples but with a massive number of features (high dimensionality). The high dimensional and imbalanced data set has posed severe challenges in many real-world applications, such as biomedical data sets. Numerous researchers investigated either imbalanced class or high dimensional data sets and came up with various methods. Nonetheless, few approaches reported in the literature have addressed the intersection of the high dimensional and imbalanced class problem due to their complicated interactions. Lately, feature selection has become a well-known technique that has been used to overcome this problem by selecting discriminative features that represent minority and majority class. This paper proposes a new method called Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm (rCBR-BGOA); rCBR-BGOA has employed an ensemble of multi-filters coupled with the Correlation-Based Redundancy method to select optimal feature subsets. A binary Grasshopper optimisation algorithm (BGOA) is used to construct the feature selection process as an optimisation problem to select the best (near-optimal) combination of features from the majority and minority class. The obtained results, supported by the proper statistical analysis, indicate that rCBR-BGOA can improve the classification performance for high dimensional and imbalanced datasets in terms of G-mean and the Area Under the Curve (AUC) performance metrics.
Collapse
Affiliation(s)
- Garba Abdulrauf Sharifai
- Department of Computer Sciences, Yusuf Maitama Sule University, 700222 Kofar Nassarawa, Kano, Nigeria
- School of Computer Sciences, Universiti Sains Malaysia, 11800 Gelugor, Malaysia;
- Correspondence: ; Tel.: +60-111-317-0481 or +60-194-004-327
| | - Zurinahni Zainol
- School of Computer Sciences, Universiti Sains Malaysia, 11800 Gelugor, Malaysia;
| |
Collapse
|
14
|
Bauder RA, Khoshgoftaar TM. A study on rare fraud predictions with big Medicare claims fraud data. INTELL DATA ANAL 2020. [DOI: 10.3233/ida-184415] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
15
|
Yang X, Wang Y, Byrne R, Schneider G, Yang S. Concepts of Artificial Intelligence for Computer-Assisted Drug Discovery. Chem Rev 2019; 119:10520-10594. [PMID: 31294972 DOI: 10.1021/acs.chemrev.8b00728] [Citation(s) in RCA: 421] [Impact Index Per Article: 70.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Artificial intelligence (AI), and, in particular, deep learning as a subcategory of AI, provides opportunities for the discovery and development of innovative drugs. Various machine learning approaches have recently (re)emerged, some of which may be considered instances of domain-specific AI which have been successfully employed for drug discovery and design. This review provides a comprehensive portrayal of these machine learning techniques and of their applications in medicinal chemistry. After introducing the basic principles, alongside some application notes, of the various machine learning algorithms, the current state-of-the art of AI-assisted pharmaceutical discovery is discussed, including applications in structure- and ligand-based virtual screening, de novo drug design, physicochemical and pharmacokinetic property prediction, drug repurposing, and related aspects. Finally, several challenges and limitations of the current methods are summarized, with a view to potential future directions for AI-assisted drug discovery and design.
Collapse
Affiliation(s)
- Xin Yang
- State Key Laboratory of Biotherapy and Cancer Center, West China Hospital , Sichuan University , Chengdu , Sichuan 610041 , China
| | - Yifei Wang
- State Key Laboratory of Biotherapy and Cancer Center, West China Hospital , Sichuan University , Chengdu , Sichuan 610041 , China
| | - Ryan Byrne
- ETH Zurich , Department of Chemistry and Applied Biosciences , Vladimir-Prelog-Weg 4 , CH-8093 Zurich , Switzerland
| | - Gisbert Schneider
- ETH Zurich , Department of Chemistry and Applied Biosciences , Vladimir-Prelog-Weg 4 , CH-8093 Zurich , Switzerland
| | - Shengyong Yang
- State Key Laboratory of Biotherapy and Cancer Center, West China Hospital , Sichuan University , Chengdu , Sichuan 610041 , China
| |
Collapse
|
16
|
Abstract
AbstractA large variety of issues influence the success of data mining on a given problem. Two primary and important issues are the representation and the quality of the dataset. Specifically, if much redundant and unrelated or noisy and unreliable information is presented, then knowledge discovery becomes a very difficult problem. It is well-known that data preparation steps require significant processing time in machine learning tasks. It would be very helpful and quite useful if there were various preprocessing algorithms with the same reliable and effective performance across all datasets, but this is impossible. To this end, we present the most well-known and widely used up-to-date algorithms for each step of data preprocessing in the framework of predictive data mining.
Collapse
|
17
|
Emerging topics and challenges of learning from noisy data in nonstandard classification: a survey beyond binary class noise. Knowl Inf Syst 2018. [DOI: 10.1007/s10115-018-1244-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
18
|
Predicting reference soil groups using legacy data: A data pruning and Random Forest approach for tropical environment (Dano catchment, Burkina Faso). Sci Rep 2018; 8:9959. [PMID: 29967391 PMCID: PMC6028482 DOI: 10.1038/s41598-018-28244-w] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2017] [Accepted: 06/18/2018] [Indexed: 12/02/2022] Open
Abstract
Predicting taxonomic classes can be challenging with dataset subject to substantial irregularities due to the involvement of many surveyors. A data pruning approach was used in the present study to reduce such source errors by exploring whether different data pruning methods, which result in different subsets of a major reference soil groups (RSG) – the Plinthosols – would lead to an increase in prediction accuracy of the minor soil groups by using Random Forest (RF). This method was compared to the random oversampling approach. Four datasets were used, including the entire dataset and the pruned dataset, which consisted of 80% and 90% respectively, and standard deviation core range of the Plinthosols data while cutting off all data points belonging to the outer range. The best prediction was achieved when RF was used with recursive feature elimination along with the non-oversampled 90% core range dataset. This model provided a substantial agreement to observation, with a kappa value of 0.57 along with 7% to 35% increase in prediction accuracy for smaller RSG. The reference soil groups in the Dano catchment appeared to be mainly influenced by the wetness index, a proxy for soil moisture distribution.
Collapse
|
19
|
|
20
|
A bi-objective hybrid algorithm for the classification of imbalanced noisy and borderline data sets. Pattern Anal Appl 2018. [DOI: 10.1007/s10044-018-0693-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
21
|
Luengo J, Shim SO, Alshomrani S, Altalhi A, Herrera F. CNC-NOS: Class noise cleaning by ensemble filtering and noise scoring. Knowl Based Syst 2018. [DOI: 10.1016/j.knosys.2017.10.026] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
22
|
|
23
|
Xiao G, Wu F, Zhou X, Li K. Probabilistic top-k range query processing for uncertain databases. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2016. [DOI: 10.3233/jifs-169040] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Guoqing Xiao
- College of Information Science and Engineering, Hunan University, Changsha, Hunan, China
| | - Fan Wu
- College of Information Science and Engineering, Hunan University, Changsha, Hunan, China
| | - Xu Zhou
- College of Information Science and Engineering, Hunan University, Changsha, Hunan, China
| | - Keqin Li
- College of Information Science and Engineering, Hunan University, Changsha, Hunan, China
- Department of Computer Science, State University of New York, New Paltz, NY, USA
| |
Collapse
|
24
|
Dealing with Data Difficulty Factors While Learning from Imbalanced Data. STUDIES IN COMPUTATIONAL INTELLIGENCE 2016. [DOI: 10.1007/978-3-319-18781-5_17] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
25
|
Chen KH, Wang KJ, Adrian AM, Wang KM, Teng NC. Diagnosis of Brain Metastases from Lung Cancer Using a Modified Electromagnetism like Mechanism Algorithm. J Med Syst 2015; 40:35. [PMID: 26573656 DOI: 10.1007/s10916-015-0367-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2015] [Accepted: 10/06/2015] [Indexed: 11/26/2022]
Abstract
Brain metastases are commonly found in patients that are diagnosed with primary malignancy on their lung. Lung cancer patients with brain metastasis tend to have a poor survivability, which is less than 6 months in median. Therefore, an early and effective detection system for such disease is needed to help prolong the patients' survivability and improved their quality of life. A modified electromagnetism-like mechanism (EM) algorithm, MEM-SVM, is proposed by combining EM algorithm with support vector machine (SVM) as the classifier and opposite sign test (OST) as the local search technique. The proposed method is applied to 44 UCI and IDA datasets, and 5 cancers microarray datasets as preliminary experiment. In addition, this method is tested on 4 lung cancer microarray public dataset. Further, we tested our method on a nationwide dataset of brain metastasis from lung cancer (BMLC) in Taiwan. Since the nature of real medical dataset to be highly imbalanced, the synthetic minority over-sampling technique (SMOTE) is utilized to handle this problem. The proposed method is compared against another 8 popular benchmark classifiers and feature selection methods. The performance evaluation is based on the accuracy and Kappa index. For the 44 UCI and IDA datasets and 5 cancer microarray datasets, a non-parametric statistical test confirmed that MEM-SVM outperformed the other methods. For the 4 lung cancer public microarray datasets, MEM-SVM still achieved the highest mean value for accuracy and Kappa index. Due to the imbalanced property on the real case of BMLC dataset, all methods achieve good accuracy without significance difference among the methods. However, on the balanced BMLC dataset, MEM-SVM appears to be the best method with higher accuracy and Kappa index. We successfully developed MEM-SVM to predict the occurrence of brain metastasis from lung cancer with the combination of SMOTE technique to handle the class imbalance properties. The results confirmed that MEM-SVM has good diagnosis power and can be applied as an alternative diagnosis tool in with other medical tests for the early detection of brain metastasis from lung cancer.
Collapse
Affiliation(s)
- Kun-Huang Chen
- Department of Industrial Management, National Taiwan University of Science and Technology, Daan District, Taipei 106, Taiwan, Republic of China.
| | - Kung-Jeng Wang
- Department of Industrial Management, National Taiwan University of Science and Technology, Daan District, Taipei 106, Taiwan, Republic of China.
| | - Angelia Melani Adrian
- Department of Industrial Management, National Taiwan University of Science and Technology, Daan District, Taipei 106, Taiwan, Republic of China.
| | - Kung-Min Wang
- Department of Surgery, Shin-Kong Wu Ho-Su Memorial Hospital, Shilin District, Taipei 111, Taiwan, Republic of China.
| | - Nai-Chia Teng
- School of Dentistry, College of Oral Medicine, Taipei Medical University, Taipei 110, Taiwan, Republic of China.
| |
Collapse
|
26
|
Bekhuis T, Tseytlin E, Mitchell KJ. A Prototype for a Hybrid System to Support Systematic Review Teams: A Case Study of Organ Transplantation. PROCEEDINGS. IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE 2015; 2015:940-947. [PMID: 26855824 PMCID: PMC4742277 DOI: 10.1109/bibm.2015.7359810] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
We describe a prototype for a hybrid system designed to reduce the number of citations needed to re-screen (NNRS) by systematic reviewers, where citations include titles, abstracts, and metadata. The system obviates the need for screening the entire set of citations a second time, which is typically done to control human error. The reference set is based on a complex review about organ transplantation (N=10,796 citations). Data were split into 50% training and test sets, randomly stratified for percentage eligible citations. The system consists of a rule-based module and a machine-learning (ML) module. The former substantially reduces the number of negative citations passed to the ML module and improves imbalance. Relative to the baseline, the system reduces classification error (5.6% vs 2.9%) thereby reducing NNRS by 47.3% (300 vs 158). We discuss the implications of de-emphasizing sensitivity (recall) in favor of specificity and negative predictive value to reduce screening burden.
Collapse
Affiliation(s)
- Tanja Bekhuis
- Department of Biomedical Informatics, School of Medicine, University
of Pittsburgh, USA
- Department of Dental Public Health, School of Dental Medicine,
University of Pittsburgh, USA
| | - Eugene Tseytlin
- Department of Biomedical Informatics, School of Medicine, University
of Pittsburgh, USA
| | - Kevin J. Mitchell
- Department of Biomedical Informatics, School of Medicine, University
of Pittsburgh, USA
| |
Collapse
|
27
|
Napierala K, Stefanowski J. Types of minority class examples and their influence on learning classifiers from imbalanced data. J Intell Inf Syst 2015. [DOI: 10.1007/s10844-015-0368-1] [Citation(s) in RCA: 69] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
28
|
Xue JH, Hall P. Why Does Rebalancing Class-Unbalanced Data Improve AUC for Linear Discriminant Analysis? IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2015; 37:1109-1112. [PMID: 26353332 DOI: 10.1109/tpami.2014.2359660] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Many established classifiers fail to identify the minority class when it is much smaller than the majority class. To tackle this problem, researchers often first rebalance the class sizes in the training dataset, through oversampling the minority class or undersampling the majority class, and then use the rebalanced data to train the classifiers. This leads to interesting empirical patterns. In particular, using the rebalanced training data can often improve the area under the receiver operating characteristic curve (AUC) for the original, unbalanced test data. The AUC is a widely-used quantitative measure of classification performance, but the property that it increases with rebalancing has, as yet, no theoretical explanation. In this note, using Gaussian-based linear discriminant analysis (LDA) as the classifier, we demonstrate that, at least for LDA, there is an intrinsic, positive relationship between the rebalancing of class sizes and the improvement of AUC. We show that the largest improvement of AUC is achieved, asymptotically, when the two classes are fully rebalanced to be of equal sizes.
Collapse
|
29
|
Pan S, Wu J, Zhu X, Zhang C. Graph ensemble boosting for imbalanced noisy graph stream classification. IEEE TRANSACTIONS ON CYBERNETICS 2015; 45:940-954. [PMID: 25167562 DOI: 10.1109/tcyb.2014.2341031] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Many applications involve stream data with structural dependency, graph representations, and continuously increasing volumes. For these applications, it is very common that their class distributions are imbalanced with minority (or positive) samples being only a small portion of the population, which imposes significant challenges for learning models to accurately identify minority samples. This problem is further complicated with the presence of noise, because they are similar to minority samples and any treatment for the class imbalance may falsely focus on the noise and result in deterioration of accuracy. In this paper, we propose a classification model to tackle imbalanced graph streams with noise. Our method, graph ensemble boosting, employs an ensemble-based framework to partition graph stream into chunks each containing a number of noisy graphs with imbalanced class distributions. For each individual chunk, we propose a boosting algorithm to combine discriminative subgraph pattern selection and model learning as a unified framework for graph classification. To tackle concept drifting in graph streams, an instance level weighting mechanism is used to dynamically adjust the instance weight, through which the boosting framework can emphasize on difficult graph samples. The classifiers built from different graph chunks form an ensemble for graph stream classification. Experiments on real-life imbalanced graph streams demonstrate clear benefits of our boosting design for handling imbalanced noisy graph stream.
Collapse
|
30
|
Kanj S, Abdallah F, Denœux T, Tout K. Editing training data for multi-label classification with the k-nearest neighbor rule. Pattern Anal Appl 2015. [DOI: 10.1007/s10044-015-0452-8] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
31
|
Young WA, Nykl SL, Weckman GR, Chelberg DM. Using Voronoi diagrams to improve classification performances when modeling imbalanced datasets. Neural Comput Appl 2014. [DOI: 10.1007/s00521-014-1780-0] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
32
|
Frénay B, Verleysen M. Classification in the presence of label noise: a survey. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2014; 25:845-869. [PMID: 24808033 DOI: 10.1109/tnnls.2013.2292894] [Citation(s) in RCA: 360] [Impact Index Per Article: 32.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Label noise is an important issue in classification, with many potential negative consequences. For example, the accuracy of predictions may decrease, whereas the complexity of inferred models and the number of necessary training samples may increase. Many works in the literature have been devoted to the study of label noise and the development of techniques to deal with label noise. However, the field lacks a comprehensive survey on the different types of label noise, their consequences and the algorithms that consider label noise. This paper proposes to fill this gap. First, the definitions and sources of label noise are considered and a taxonomy of the types of label noise is proposed. Second, the potential consequences of label noise are discussed. Third, label noise-robust, label noise cleansing, and label noise-tolerant algorithms are reviewed. For each category of approaches, a short discussion is proposed to help the practitioner to choose the most suitable technique in its own particular field of application. Eventually, the design of experiments is also discussed, what may interest the researchers who would like to test their own algorithms. In this paper, label noise consists of mislabeled instances: no additional information is assumed to be available like e.g., confidences on labels.
Collapse
|
33
|
WALD RANDALL, KHOSHGOFTAAR TAGHIM, SLOAN JOHNC. FEATURE SELECTION FOR OPTIMIZATION OF WAVELET PACKET DECOMPOSITION IN RELIABILITY ANALYSIS OF SYSTEMS. INT J ARTIF INTELL T 2013. [DOI: 10.1142/s0218213013600117] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
One of the most important types of signal found in the area of machine condition monitoring/prognostic health monitoring (MCM/PHM) is the vibration signal, a type of waveform. Many time-frequency domain techniques have been proposed to interpret such signals, including wavelet packet decomposition (WPD). Previous work has shown how to extend the WPD algorithm to operate on streaming signals, but the number of output variables becomes exponential in the number of levels of decomposition, hindering data mining in limited-memory environments. Feature selection techniques, well understood in other areas of data mining, can be used to greatly reduce the number of output variables and speed up the machine learning algorithms. This paper presents a case study comparing two versions of WPD both with and without feature selection, demonstrating that removing most of the features produced by the WPD does not impair its performance within the context of MCM/PHM.
Collapse
Affiliation(s)
- RANDALL WALD
- Department of Computer Science and Engineering, Florida Atlantic University, Boca Raton, Florida, USA
| | - TAGHI M. KHOSHGOFTAAR
- Department of Computer Science and Engineering, Florida Atlantic University, Boca Raton, Florida, USA
| | - JOHN C. SLOAN
- Department of Computer Science and Engineering, Florida Atlantic University, Boca Raton, Florida, USA
| |
Collapse
|
34
|
|
35
|
Janitza S, Strobl C, Boulesteix AL. An AUC-based permutation variable importance measure for random forests. BMC Bioinformatics 2013; 14:119. [PMID: 23560875 PMCID: PMC3626572 DOI: 10.1186/1471-2105-14-119] [Citation(s) in RCA: 99] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2012] [Accepted: 03/21/2013] [Indexed: 11/30/2022] Open
Abstract
Background The random forest (RF) method is a commonly used tool for classification with
high dimensional data as well as for ranking candidate predictors based on
the so-called random forest variable importance measures (VIMs). However the
classification performance of RF is known to be suboptimal in case of
strongly unbalanced data, i.e. data where response class sizes differ
considerably. Suggestions were made to obtain better classification
performance based either on sampling procedures or on cost sensitivity
analyses. However to our knowledge the performance of the VIMs has not yet
been examined in the case of unbalanced response classes. In this paper we
explore the performance of the permutation VIM for unbalanced data settings
and introduce an alternative permutation VIM based on the area under the
curve (AUC) that is expected to be more robust towards class imbalance. Results We investigated the performance of the standard permutation VIM and of our
novel AUC-based permutation VIM for different class imbalance levels using
simulated data and real data. The results suggest that the new AUC-based
permutation VIM outperforms the standard permutation VIM for unbalanced data
settings while both permutation VIMs have equal performance for balanced
data settings. Conclusions The standard permutation VIM loses its ability to discriminate between
associated predictors and predictors not associated with the response for
increasing class imbalance. It is outperformed by our new AUC-based
permutation VIM for unbalanced data settings, while the performance of both
VIMs is very similar in the case of balanced classes. The new AUC-based VIM
is implemented in the R package party for the unbiased RF variant based on
conditional inference trees. The codes implementing our study are available
from the companion website:
http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/070_drittmittel/janitza/index.html.
Collapse
Affiliation(s)
- Silke Janitza
- Department of Medical Informatics, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, D-81377, Munich, Germany.
| | | | | |
Collapse
|
36
|
Zhou L. Performance of corporate bankruptcy prediction models on imbalanced dataset: The effect of sampling methods. Knowl Based Syst 2013. [DOI: 10.1016/j.knosys.2012.12.007] [Citation(s) in RCA: 84] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
37
|
Newby D, Freitas AA, Ghafourian T. Coping with Unbalanced Class Data Sets in Oral Absorption Models. J Chem Inf Model 2013; 53:461-74. [DOI: 10.1021/ci300348u] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Affiliation(s)
- Danielle Newby
- Medway School of Pharmacy, Universities of Kent and Greenwich, Chatham, Kent,
ME4 4TB, U.K
| | - Alex A. Freitas
- School of
Computing, University of Kent, Canterbury,
Kent, CT2 7NZ, U.K
| | - Taravat Ghafourian
- Medway School of Pharmacy, Universities of Kent and Greenwich, Chatham, Kent,
ME4 4TB, U.K
- Drug
Applied Research Center and
Faculty of Pharmacy, Tabriz University of Medical Sciences, Tabriz, Iran
| |
Collapse
|
38
|
Tomašev N, Mladenić D. Hubness-aware shared neighbor distances for high-dimensional $$k$$ -nearest neighbor classification. Knowl Inf Syst 2013. [DOI: 10.1007/s10115-012-0607-5] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
39
|
Overlapping, Rare Examples and Class Decomposition in Learning Classifiers from Imbalanced Data. EMERGING PARADIGMS IN MACHINE LEARNING 2013. [DOI: 10.1007/978-3-642-28699-5_11] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/07/2022]
|
40
|
Classifying highly imbalanced ICU data. Health Care Manag Sci 2012; 16:119-28. [DOI: 10.1007/s10729-012-9216-9] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2012] [Accepted: 10/08/2012] [Indexed: 11/27/2022]
|
41
|
DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets. DATA KNOWL ENG 2012. [DOI: 10.1016/j.datak.2012.08.001] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
42
|
Van Hulse J, Khoshgoftaar TM, Napolitano A. Evaluating the Impact of Data Quality on Sampling. JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT 2012. [DOI: 10.1142/s021964921100295x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Learning from imbalanced training data can be a difficult endeavour, and the task is made even more challenging if the data is of low quality or the size of the training dataset is small. Data sampling is a commonly used method for improving learner performance when data is imbalanced. However, little effort has been put forth to investigate the performance of data sampling techniques when data is both noisy and imbalanced. In this work, we present a comprehensive empirical investigation of the impact of changes in four training dataset characteristics — dataset size, class distribution, noise level and noise distribution — on data sampling techniques. We present the performance of four common data sampling techniques using 11 learning algorithms. The results, which are based on an extensive suite of experiments for which over 15 million models were trained and evaluated, show that: (1) even for relatively clean datasets, class imbalance can still hurt learner performance, (2) data sampling, however, may not improve performance for relatively clean but imbalanced datasets, (3) data sampling can be very effective at dealing with the combined problems of noise and imbalance, (4) both the level and distribution of class noise among the classes are important, as either factor alone does not cause a significant impact, (5) when sampling does improve the learners (i.e. for noisy and imbalanced datasets), RUS and SMOTE are the most effective at improving the AUC, while SMOTE performed well relative to the F-measure, (6) there are significant differences in the empirical results depending on the performance measure used, and hence it is important to consider multiple metrics in this type of analysis, and (7) data sampling rarely hurt the AUC, but only significantly improved performance when data was at least moderately skewed or noisy, while for the F-measure, data sampling often resulted in significantly worse performance when applied to slightly skewed or noisy datasets, but did improve performance when data was either severely noisy or skewed, or contained moderate levels of both noise and imbalance.
Collapse
Affiliation(s)
- Jason Van Hulse
- Department of Computer and Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, USA
| | - Taghi M. Khoshgoftaar
- Department of Computer and Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, USA
| | - Amri Napolitano
- Department of Computer and Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, USA
| |
Collapse
|
43
|
Marques I, Graña M. Face recognition with lattice independent component analysis and extreme learning machines. Soft comput 2012. [DOI: 10.1007/s00500-012-0826-4] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
44
|
García V, Sánchez J, Mollineda R. On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl Based Syst 2012. [DOI: 10.1016/j.knosys.2011.06.013] [Citation(s) in RCA: 113] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
45
|
|
46
|
Segura-Bedmar I, Martínez P, de Pablo-Sánchez C. Using a shallow linguistic kernel for drug–drug interaction extraction. J Biomed Inform 2011; 44:789-804. [DOI: 10.1016/j.jbi.2011.04.005] [Citation(s) in RCA: 89] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2010] [Revised: 04/14/2011] [Accepted: 04/19/2011] [Indexed: 11/26/2022]
|
47
|
|
48
|
Khoshgoftaar TM, Van Hulse J, Napolitano A. Comparing Boosting and Bagging Techniques With Noisy and Imbalanced Data. ACTA ACUST UNITED AC 2011. [DOI: 10.1109/tsmca.2010.2084081] [Citation(s) in RCA: 185] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
49
|
|
50
|
|