1
|
Fu Y, Xiang L, Zahid Y, Ding G, Mei T, Shen Q, Han J. Long-tailed visual recognition with deep models: A methodological survey and evaluation. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.08.031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/15/2022]
|
2
|
Tahir M, Khan F, Hayat M, Alshehri MD. An effective machine learning-based model for the prediction of protein–protein interaction sites in health systems. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-07024-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
3
|
Aurelio YS, de Almeida GM, de Castro CL, Braga AP. Cost-Sensitive Learning based on Performance Metric for Imbalanced Data. Neural Process Lett 2022. [DOI: 10.1007/s11063-022-10756-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
4
|
Lan ZC, Huang GY, Li YP, Rho S, Vimal S, Chen BW. Conquering insufficient/imbalanced data learning for the Internet of Medical Things. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-06897-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
5
|
A whole-slide image grading benchmark and tissue classification for cervical cancer precursor lesions with inter-observer variability. Med Biol Eng Comput 2021; 59:1545-1561. [PMID: 34245400 DOI: 10.1007/s11517-021-02388-w] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2020] [Accepted: 06/03/2021] [Indexed: 10/20/2022]
Abstract
The cervical cancer developing from the precancerous lesions caused by the human papillomavirus (HPV) has been one of the preventable cancers with the help of periodic screening. Cervical intraepithelial neoplasia (CIN) and squamous intraepithelial lesion (SIL) are two types of grading conventions widely accepted by pathologists. On the other hand, inter-observer variability is an important issue for final diagnosis. In this paper, a whole-slide image grading benchmark for cervical cancer precursor lesions is created and the "Uterine Cervical Cancer Database" introduced in this article is the first publicly available cervical tissue microscopy image dataset. In addition, a morphological feature representing the angle between the basal membrane (BM) and the major axis of each nucleus in the tissue is proposed. The presence of papillae of the cervical epithelium and overlapping cell problems are also discussed. Besides that, the inter-observer variability is also evaluated by thorough comparisons among decisions of pathologists, as well as the final diagnosis.
Collapse
|
6
|
Multi-Nyström Method Based on Multiple Kernel Learning for Large Scale Imbalanced Classification. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2021; 2021:9911871. [PMID: 34234824 PMCID: PMC8216788 DOI: 10.1155/2021/9911871] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/09/2021] [Accepted: 05/27/2021] [Indexed: 11/17/2022]
Abstract
Extensions of kernel methods for the class imbalance problems have been extensively studied. Although they work well in coping with nonlinear problems, the high computation and memory costs severely limit their application to real-world imbalanced tasks. The Nyström method is an effective technique to scale kernel methods. However, the standard Nyström method needs to sample a sufficiently large number of landmark points to ensure an accurate approximation, which seriously affects its efficiency. In this study, we propose a multi-Nyström method based on mixtures of Nyström approximations to avoid the explosion of subkernel matrix, whereas the optimization to mixture weights is embedded into the model training process by multiple kernel learning (MKL) algorithms to yield more accurate low-rank approximation. Moreover, we select subsets of landmark points according to the imbalance distribution to reduce the model's sensitivity to skewness. We also provide a kernel stability analysis of our method and show that the model solution error is bounded by weighted approximate errors, which can help us improve the learning process. Extensive experiments on several large scale datasets show that our method can achieve a higher classification accuracy and a dramatical speedup of MKL algorithms.
Collapse
|
7
|
Lu Y, Cheung YM, Tang YY. Self-Adaptive Multiprototype-Based Competitive Learning Approach: A k-Means-Type Algorithm for Imbalanced Data Clustering. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:1598-1612. [PMID: 31150353 DOI: 10.1109/tcyb.2019.2916196] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Class imbalance problem has been extensively studied in the recent years, but imbalanced data clustering in unsupervised environment, that is, the number of samples among clusters is imbalanced, has yet to be well studied. This paper, therefore, studies the imbalanced data clustering problem within the framework of k -means-type competitive learning. We introduce a new method called self-adaptive multiprototype-based competitive learning (SMCL) for imbalanced clusters. It uses multiple subclusters to represent each cluster with an automatic adjustment of the number of subclusters. Then, the subclusters are merged into the final clusters based on a novel separation measure. We also propose a new internal clustering validation measure to determine the number of final clusters during the merging process for imbalanced clusters. The advantages of SMCL are threefold: 1) it inherits the advantages of competitive learning and meanwhile is applicable to the imbalanced data clustering; 2) the self-adaptive multiprototype mechanism uses a proper number of subclusters to represent each cluster with any arbitrary shape; and 3) it automatically determines the number of clusters for imbalanced clusters. SMCL is compared with the existing counterparts for imbalanced clustering on the synthetic and real datasets. The experimental results show the efficacy of SMCL for imbalanced clusters.
Collapse
|
8
|
Gultekin S, Saha A, Ratnaparkhi A, Paisley J. MBA: Mini-Batch AUC Optimization. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:5561-5574. [PMID: 32142457 DOI: 10.1109/tnnls.2020.2969527] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Area under the receiver operating characteristics curve (AUC) is an important metric for a wide range of machine-learning problems, and scalable methods for optimizing AUC have recently been proposed. However, handling very large data sets remains an open challenge for this problem. This article proposes a novel approach to AUC maximization based on sampling mini-batches of positive/negative instance pairs and computing U-statistics to approximate a global risk minimization problem. The resulting algorithm is simple, fast, and learning-rate free. We show that the number of samples required for good performance is independent of the number of pairs available, which is a quadratic function of the positive and negative instances. Extensive experiments show the practical utility of the proposed method.
Collapse
|
9
|
Liu Y, Yu Z, Chen C, Han Y, Yu B. Prediction of protein crotonylation sites through LightGBM classifier based on SMOTE and elastic net. Anal Biochem 2020; 609:113903. [DOI: 10.1016/j.ab.2020.113903] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2020] [Revised: 07/27/2020] [Accepted: 08/05/2020] [Indexed: 12/18/2022]
|
10
|
GT2FS-SMOTE: An Intelligent Oversampling Approach Based Upon General Type-2 Fuzzy Sets to Detect Web Spam. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING 2020. [DOI: 10.1007/s13369-020-04995-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
11
|
Lu Y, Cheung YM, Tang YY. Bayes Imbalance Impact Index: A Measure of Class Imbalanced Data Set for Classification Problem. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:3525-3539. [PMID: 31689217 DOI: 10.1109/tnnls.2019.2944962] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Recent studies of imbalanced data classification have shown that the imbalance ratio (IR) is not the only cause of performance loss in a classifier, as other data factors, such as small disjuncts, noise, and overlapping, can also make the problem difficult. The relationship between the IR and other data factors has been demonstrated, but to the best of our knowledge, there is no measurement of the extent to which class imbalance influences the classification performance of imbalanced data. In addition, it is also unknown which data factor serves as the main barrier for classification in a data set. In this article, we focus on the Bayes optimal classifier and examine the influence of class imbalance from a theoretical perspective. We propose an instance measure called the Individual Bayes Imbalance Impact Index (IBI3) and a data measure called the Bayes Imbalance Impact Index (BI3). IBI3 and BI3 reflect the extent of influence using only the imbalance factor, in terms of each minority class sample and the whole data set, respectively. Therefore, IBI3 can be used as an instance complexity measure of imbalance and BI3 as a criterion to demonstrate the degree to which imbalance deteriorates the classification of a data set. We can, therefore, use BI3 to access whether it is worth using imbalance recovery methods, such as sampling or cost-sensitive methods, to recover the performance loss of a classifier. The experiments show that IBI3 is highly consistent with the increase of the prediction score obtained by the imbalance recovery methods and that BI3 is highly consistent with the improvement in the F1 score obtained by the imbalance recovery methods on both synthetic and real benchmark data sets.
Collapse
|
12
|
Zhu YH, Hu J, Qi Y, Song XN, Yu DJ. Boosting Granular Support Vector Machines for the Accurate Prediction of Protein-Nucleotide Binding Sites. Comb Chem High Throughput Screen 2020; 22:455-469. [PMID: 31553288 DOI: 10.2174/1386207322666190925125524] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2019] [Revised: 06/21/2019] [Accepted: 08/23/2019] [Indexed: 11/22/2022]
Abstract
AIM AND OBJECTIVE The accurate identification of protein-ligand binding sites helps elucidate protein function and facilitate the design of new drugs. Machine-learning-based methods have been widely used for the prediction of protein-ligand binding sites. Nevertheless, the severe class imbalance phenomenon, where the number of nonbinding (majority) residues is far greater than that of binding (minority) residues, has a negative impact on the performance of such machine-learning-based predictors. MATERIALS AND METHODS In this study, we aim to relieve the negative impact of class imbalance by Boosting Multiple Granular Support Vector Machines (BGSVM). In BGSVM, each base SVM is trained on a granular training subset consisting of all minority samples and some reasonably selected majority samples. The efficacy of BGSVM for dealing with class imbalance was validated by benchmarking it with several typical imbalance learning algorithms. We further implemented a protein-nucleotide binding site predictor, called BGSVM-NUC, with the BGSVM algorithm. RESULTS Rigorous cross-validation and independent validation tests for five types of proteinnucleotide interactions demonstrated that the proposed BGSVM-NUC achieves promising prediction performance and outperforms several popular sequence-based protein-nucleotide binding site predictors. The BGSVM-NUC web server is freely available at http://csbio.njust.edu.cn/bioinf/BGSVM-NUC/ for academic use.
Collapse
Affiliation(s)
- Yi-Heng Zhu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Jun Hu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Yong Qi
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Xiao-Ning Song
- School of Internet of Things, Jiangnan University, Wuxi 214122, China
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| |
Collapse
|
13
|
Adnan A, Muhammed A, Abd Ghani AA, Abdullah A, Hakim F. Hyper-Heuristic Framework for Sequential Semi-Supervised Classification Based on Core Clustering. Symmetry (Basel) 2020; 12:1292. [DOI: 10.3390/sym12081292] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/01/2023] Open
Abstract
Existing stream data learning models with limited labeling have many limitations, most importantly, algorithms that suffer from a limited capability to adapt to the evolving nature of data, which is called concept drift. Hence, the algorithm must overcome the problem of dynamic update in the internal parameters or countering the concept drift. However, using neural network-based semi-supervised stream data learning is not adequate due to the need for capturing quickly the changes in the distribution and characteristics of various classes of the data whilst avoiding the effect of the outdated stored knowledge in neural networks (NN). This article presents a prominent framework that integrates each of the NN, a meta-heuristic based on evolutionary genetic algorithm (GA) and a core online-offline clustering (Core). The framework trains the NN on previously labeled data and its knowledge is used to calculate the error of the core online-offline clustering block. The genetic optimization is responsible for selecting the best parameters of the core model to minimize the error. This integration aims to handle the concept drift. We designated this model as hyper-heuristic framework for semi-supervised classification or HH-F. Experimental results of the application of HH-F on real datasets prove the superiority of the proposed framework over the existing state-of-the art approaches used in the literature for sequential classification data with evolving nature.
Collapse
|
14
|
Pei W, Xue B, Shang L, Zhang M. Genetic programming for high-dimensional imbalanced classification with a new fitness function and program reuse mechanism. Soft comput 2020. [DOI: 10.1007/s00500-020-05056-7] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
15
|
Kim H, Na SH. Uniformly Interpolated Balancing for Robust Prediction in Translation Quality Estimation. ACM T ASIAN LOW-RESO 2020. [DOI: 10.1145/3365916] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
There has been growing interest among researchers in quality estimation (QE), which attempts to automatically predict the quality of machine translation (MT) outputs. Most existing works on QE are based on supervised approaches using quality-annotated training data. However, QE training data quality scores readily become
imbalanced
or
skewed
: QE data are mostly composed of high translation quality sentence pairs but the data lack low translation quality sentence pairs. The use of imbalanced data with an induced quality estimator tends to produce
biased
translation quality scores with “high” translation quality scores assigned even to poorly translated sentences. To address the data imbalance, this article proposes a simple, efficient procedure called
uniformly interpolated balancing
to construct more
balanced
QE training data by inserting greater uniformness to training data. The proposed uniformly interpolated balancing procedure is based on the preparation of two different types of manually annotated QE data: (1)
default skewed data
and (2)
near-uniform data
. First, we obtain default skewed data in a naive manner without considering the imbalance by manually annotating qualities on MT outputs. Second, we obtain near-uniform data in a selective manner by manually annotating a subset only, which is selected from the automatically quality-estimated sentence pairs. Finally, we create
uniformly interpolated balanced data
by combining these two types of data, where one half originates from the default skewed data and the other half originates from the near-uniform data. We expect that uniformly interpolated balancing reflects the intrinsic skewness of the true quality distribution and manages the imbalance problem. Experimental results on an English-Korean quality estimation task show that the proposed uniformly interpolated balancing leads to robustness on both skewed and uniformly distributed quality test sets when compared to the test sets of other non-balanced datasets.
Collapse
Affiliation(s)
- Hyun Kim
- Electronics and Telecommunications Research Institute (ETRI), Yuseong-gu, Daejeon, Republic of Korea
| | - Seung-Hoon Na
- Jeonbuk National University, Baekje-daero, deokjin-gu, Jeonju, Republic of Korea
| |
Collapse
|
16
|
Pang Y, Peng L, Chen Z, Yang B, Zhang H. Imbalanced learning based on adaptive weighting and Gaussian function synthesizing with an application on Android malware detection. Inf Sci (N Y) 2019. [DOI: 10.1016/j.ins.2019.01.065] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
17
|
Fan Z, Lu J, Wei C, Huang H, Cai X, Chen X. A Hierarchical Image Matting Model for Blood Vessel Segmentation in Fundus Images. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2018; 28:2367-2377. [PMID: 30571623 DOI: 10.1109/tip.2018.2885495] [Citation(s) in RCA: 37] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
In this paper, a hierarchical image matting model is proposed to extract blood vessels from fundus images. More specifically, a hierarchical strategy is integrated into the image matting model for blood vessel segmentation. Normally the matting models require a user specified trimap, which separates the input image into three regions: the foreground, background and unknown regions. However, creating a user specified trimap is laborious for vessel segmentation tasks. In this paper, we propose a method that first generates trimap automatically by utilizing region features of blood vessels, then applies a hierarchical image matting model to extract the vessel pixels from the unknown regions. The proposed method has low calculation time and outperforms many other state-of-art supervised and unsupervised methods. It achieves a vessel segmentation accuracy of 96.0%, 95.7% and 95.1% in an average time of 10.72s, 15.74s and 50.71s on images from three publicly available fundus image datasets DRIVE, STARE, and CHASE DB1, respectively.
Collapse
|
18
|
Retinal Blood Vessel Segmentation by Using Matched Filtering and Fuzzy C-means Clustering with Integrated Level Set Method for Diabetic Retinopathy Assessment. J Med Biol Eng 2018. [DOI: 10.1007/s40846-018-0454-2] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
19
|
|
20
|
Mathew J, Pang CK, Luo M, Leong WH. Classification of Imbalanced Data by Oversampling in Kernel Space of Support Vector Machines. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018; 29:4065-4076. [PMID: 29028213 DOI: 10.1109/tnnls.2017.2751612] [Citation(s) in RCA: 69] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Historical data sets for fault stage diagnosis in industrial machines are often imbalanced and consist of multiple categories or classes. Learning discriminative models from such data sets is challenging due to the lack of representative data and the bias of traditional classifiers toward the majority class. Sampling methods like synthetic minority oversampling technique (SMOTE) have been traditionally used for such problems to artificially balance the data set before being trained by a classifier. This paper proposes a weighted kernel-based SMOTE (WK-SMOTE) that overcomes the limitation of SMOTE for nonlinear problems by oversampling in the feature space of support vector machine (SVM) classifier. The proposed oversampling algorithm along with a cost-sensitive SVM formulation is shown to improve performance when compared to other baseline methods on multiple benchmark imbalanced data sets. In addition, a hierarchical framework is developed for multiclass imbalanced problems that have a progressive class order. The proposed WK-SMOTE and hierarchical framework are validated on a real-world industrial fault detection problem to identify deterioration in insulation of high-voltage equipments.
Collapse
|
21
|
Aydogan EK, Ozmen M, Delice Y. CBR-PSO: cost-based rough particle swarm optimization approach for high-dimensional imbalanced problems. Neural Comput Appl 2018. [DOI: 10.1007/s00521-018-3469-2] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
22
|
Wang L, Zhu D, Dong M. Clustering over‐dispersed data with mixed feature types. Stat Anal Data Min 2018. [DOI: 10.1002/sam.11369] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Affiliation(s)
- Lu Wang
- Department of Computer Science Wayne State University Detroit Michigan
| | - Dongxiao Zhu
- Department of Computer Science Wayne State University Detroit Michigan
| | - Ming Dong
- Department of Computer Science Wayne State University Detroit Michigan
| |
Collapse
|
23
|
A compactness based saliency approach for leakages detection in fluorescein angiogram. INT J MACH LEARN CYB 2017. [DOI: 10.1007/s13042-016-0573-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
24
|
Du J, Vong CM, Pun CM, Wong PK, Ip WF. Post-boosting of classification boundary for imbalanced data using geometric mean. Neural Netw 2017; 96:101-114. [PMID: 28987974 DOI: 10.1016/j.neunet.2017.09.004] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2016] [Revised: 06/21/2017] [Accepted: 09/05/2017] [Indexed: 11/24/2022]
Abstract
In this paper, a novel imbalance learning method for binary classes is proposed, named as Post-Boosting of classification boundary for Imbalanced data (PBI), which can significantly improve the performance of any trained neural networks (NN) classification boundary. The procedure of PBI simply consists of two steps: an (imbalanced) NN learning method is first applied to produce a classification boundary, which is then adjusted by PBI under the geometric mean (G-mean). For data imbalance, the geometric mean of the accuracies of both minority and majority classes is considered, that is statistically more suitable than the common metric accuracy. PBI also has the following advantages over traditional imbalance methods: (i) PBI can significantly improve the classification accuracy on minority class while improving or keeping that on majority class as well; (ii) PBI is suitable for large data even with high imbalance ratio (up to 0.001). For evaluation of (i), a new metric called Majority loss/Minority advance ratio (MMR) is proposed that evaluates the loss ratio of majority class to minority class. Experiments have been conducted for PBI and several imbalance learning methods over benchmark datasets of different sizes, different imbalance ratios, and different dimensionalities. By analyzing the experimental results, PBI is shown to outperform other imbalance learning methods on almost all datasets.
Collapse
Affiliation(s)
- Jie Du
- Department of Computer and Information Science, University of Macau, Macau.
| | - Chi-Man Vong
- Department of Computer and Information Science, University of Macau, Macau.
| | - Chi-Man Pun
- Department of Computer and Information Science, University of Macau, Macau.
| | - Pak-Kin Wong
- Department of Electromechanical Engineering, University of Macau, Macau.
| | - Weng-Fai Ip
- Faculty of Science and Technology, University of Macau, Macau.
| |
Collapse
|
25
|
Retinal Vessel Segmentation via Structure Tensor Coloring and Anisotropy Enhancement. Symmetry (Basel) 2017. [DOI: 10.3390/sym9110276] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
|
26
|
Detecting representative data and generating synthetic samples to improve learning accuracy with imbalanced data sets. PLoS One 2017; 12:e0181853. [PMID: 28771522 PMCID: PMC5542532 DOI: 10.1371/journal.pone.0181853] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2016] [Accepted: 07/07/2017] [Indexed: 11/19/2022] Open
Abstract
It is difficult for learning models to achieve high classification performances with imbalanced data sets, because with imbalanced data sets, when one of the classes is much larger than the others, most machine learning and data mining classifiers are overly influenced by the larger classes and ignore the smaller ones. As a result, the classification algorithms often have poor learning performances due to slow convergence in the smaller classes. To balance such data sets, this paper presents a strategy that involves reducing the sizes of the majority data and generating synthetic samples for the minority data. In the reducing operation, we use the box-and-whisker plot approach to exclude outliers and the Mega-Trend-Diffusion method to find representative data from the majority data. To generate the synthetic samples, we propose a counterintuitive hypothesis to find the distributed shape of the minority data, and then produce samples according to this distribution. Four real datasets were used to examine the performance of the proposed approach. We used paired t-tests to compare the Accuracy, G-mean, and F-measure scores of the proposed data pre-processing (PPDP) method merging in the D3C method (PPDP+D3C) with those of the one-sided selection (OSS), the well-known SMOTEBoost (SB) study, and the normal distribution-based oversampling (NDO) approach, and the proposed data pre-processing (PPDP) method. The results indicate that the classification performance of the proposed approach is better than that of above-mentioned methods.
Collapse
|
27
|
Retinal Image Denoising via Bilateral Filter with a Spatial Kernel of Optimally Oriented Line Spread Function. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2017; 2017:1769834. [PMID: 28261320 PMCID: PMC5316463 DOI: 10.1155/2017/1769834] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/25/2016] [Revised: 11/30/2016] [Accepted: 12/13/2016] [Indexed: 11/18/2022]
Abstract
Filtering belongs to the most fundamental operations of retinal image processing and for which the value of the filtered image at a given location is a function of the values in a local window centered at this location. However, preserving thin retinal vessels during the filtering process is challenging due to vessels' small area and weak contrast compared to background, caused by the limited resolution of imaging and less blood flow in the vessel. In this paper, we present a novel retinal image denoising approach which is able to preserve the details of retinal vessels while effectively eliminating image noise. Specifically, our approach is carried out by determining an optimal spatial kernel for the bilateral filter, which is represented by a line spread function with an orientation and scale adjusted adaptively to the local vessel structure. Moreover, this approach can also be served as a preprocessing tool for improving the accuracy of the vessel detection technique. Experimental results show the superiority of our approach over state-of-the-art image denoising techniques such as the bilateral filter.
Collapse
|
28
|
|
29
|
A novel ensemble decision tree based on under-sampling and clonal selection for web spam detection. Pattern Anal Appl 2017. [DOI: 10.1007/s10044-017-0602-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
|
30
|
Zhao Y, Zheng Y, Liu Y, Yang J, Zhao Y, Chen D, Wang Y. Intensity and Compactness Enabled Saliency Estimation for Leakage Detection in Diabetic and Malarial Retinopathy. IEEE TRANSACTIONS ON MEDICAL IMAGING 2017; 36:51-63. [PMID: 27455519 DOI: 10.1109/tmi.2016.2593725] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
Leakage in retinal angiography currently is a key feature for confirming the activities of lesions in the management of a wide range of retinal diseases, such as diabetic maculopathy and paediatric malarial retinopathy. This paper proposes a new saliency-based method for the detection of leakage in fluorescein angiography. A superpixel approach is firstly employed to divide the image into meaningful patches (or superpixels) at different levels. Two saliency cues, intensity and compactness, are then proposed for the estimation of the saliency map of each individual superpixel at each level. The saliency maps at different levels over the same cues are fused using an averaging operator. The two saliency maps over different cues are fused using a pixel-wise multiplication operator. Leaking regions are finally detected by thresholding the saliency map followed by a graph-cut segmentation. The proposed method has been validated using the only two publicly available datasets: one for malarial retinopathy and the other for diabetic retinopathy. The experimental results show that it outperforms one of the latest competitors and performs as well as a human expert for leakage detection and outperforms several state-of-the-art methods for saliency detection.
Collapse
|
31
|
|
32
|
Perez-Ortiz M, Gutierrez PA, Tino P, Hervas-Martinez C. Oversampling the Minority Class in the Feature Space. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2016; 27:1947-1961. [PMID: 26316222 DOI: 10.1109/tnnls.2015.2461436] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
The imbalanced nature of some real-world data is one of the current challenges for machine learning researchers. One common approach oversamples the minority class through convex combination of its patterns. We explore the general idea of synthetic oversampling in the feature space induced by a kernel function (as opposed to input space). If the kernel function matches the underlying problem, the classes will be linearly separable and synthetically generated patterns will lie on the minority class region. Since the feature space is not directly accessible, we use the empirical feature space (EFS) (a Euclidean space isomorphic to the feature space) for oversampling purposes. The proposed method is framed in the context of support vector machines, where the imbalanced data sets can pose a serious hindrance. The idea is investigated in three scenarios: 1) oversampling in the full and reduced-rank EFSs; 2) a kernel learning technique maximizing the data class separation to study the influence of the feature space structure (implicitly defined by the kernel function); and 3) a unified framework for preferential oversampling that spans some of the previous approaches in the literature. We support our investigation with extensive experiments over 50 imbalanced data sets.
Collapse
|
33
|
Wei ZS, Han K, Yang JY, Shen HB, Yu DJ. Protein–protein interaction sites prediction by ensembling SVM and sample-weighted random forests. Neurocomputing 2016. [DOI: 10.1016/j.neucom.2016.02.022] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
|
34
|
Wang L, Zhang H, He K, Chang Y, Yang X. Active Contours Driven by Multi-Feature Gaussian Distribution Fitting Energy with Application to Vessel Segmentation. PLoS One 2015; 10:e0143105. [PMID: 26571031 PMCID: PMC4646657 DOI: 10.1371/journal.pone.0143105] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2015] [Accepted: 10/30/2015] [Indexed: 12/03/2022] Open
Abstract
Active contour models are of great importance for image segmentation and can extract smooth and closed boundary contours of the desired objects with promising results. However, they cannot work well in the presence of intensity inhomogeneity. Hence, a novel region-based active contour model is proposed by taking image intensities and ‘vesselness values’ from local phase-based vesselness enhancement into account simultaneously to define a novel multi-feature Gaussian distribution fitting energy in this paper. This energy is then incorporated into a level set formulation with a regularization term for accurate segmentations. Experimental results based on publicly available STructured Analysis of the Retina (STARE) demonstrate our model is more accurate than some existing typical methods and can successfully segment most small vessels with varying width.
Collapse
Affiliation(s)
- Lei Wang
- Medical Imaging Department, Suzhou Institute of Biomedical Engineering and Technology, Chinese Academy of Sciences, Suzhou, Jiangsu, China
| | - Huimao Zhang
- Radiology Department, The First Hospital of JiLin University, Changchun, JiLin, China
| | - Kan He
- Radiology Department, The First Hospital of JiLin University, Changchun, JiLin, China
| | - Yan Chang
- Medical Imaging Department, Suzhou Institute of Biomedical Engineering and Technology, Chinese Academy of Sciences, Suzhou, Jiangsu, China
| | - Xiaodong Yang
- Medical Imaging Department, Suzhou Institute of Biomedical Engineering and Technology, Chinese Academy of Sciences, Suzhou, Jiangsu, China
- * E-mail:
| |
Collapse
|
35
|
Prediction of Protein–Protein Interaction Sites with Machine-Learning-Based Data-Cleaning and Post-Filtering Procedures. J Membr Biol 2015; 249:141-53. [DOI: 10.1007/s00232-015-9856-z] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2015] [Accepted: 11/03/2015] [Indexed: 12/12/2022]
|
36
|
Zhao Y, Rada L, Chen K, Harding SP, Zheng Y. Automated Vessel Segmentation Using Infinite Perimeter Active Contour Model with Hybrid Region Information with Application to Retinal Images. IEEE TRANSACTIONS ON MEDICAL IMAGING 2015; 34:1797-807. [PMID: 25769147 DOI: 10.1109/tmi.2015.2409024] [Citation(s) in RCA: 151] [Impact Index Per Article: 15.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/15/2023]
Abstract
Automated detection of blood vessel structures is becoming of crucial interest for better management of vascular disease. In this paper, we propose a new infinite active contour model that uses hybrid region information of the image to approach this problem. More specifically, an infinite perimeter regularizer, provided by using L(2) Lebesgue measure of the γ -neighborhood of boundaries, allows for better detection of small oscillatory (branching) structures than the traditional models based on the length of a feature's boundaries (i.e., H(1) Hausdorff measure). Moreover, for better general segmentation performance, the proposed model takes the advantage of using different types of region information, such as the combination of intensity information and local phase based enhancement map. The local phase based enhancement map is used for its superiority in preserving vessel edges while the given image intensity information will guarantee a correct feature's segmentation. We evaluate the performance of the proposed model by applying it to three public retinal image datasets (two datasets of color fundus photography and one fluorescein angiography dataset). The proposed model outperforms its competitors when compared with other widely used unsupervised and supervised methods. For example, the sensitivity (0.742), specificity (0.982) and accuracy (0.954) achieved on the DRIVE dataset are very close to those of the second observer's annotations.
Collapse
|
37
|
Zhao Y, MacCormick IJC, Parry DG, Beare NAV, Harding SP, Zheng Y. Automated Detection of Vessel Abnormalities on Fluorescein Angiogram in Malarial Retinopathy. Sci Rep 2015; 5:11154. [PMID: 26053690 PMCID: PMC4459173 DOI: 10.1038/srep11154] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2014] [Accepted: 05/18/2015] [Indexed: 11/08/2022] Open
Abstract
The detection and assessment of intravascular filling defects is important, because they may represent a process central to cerebral malaria pathogenesis: neurovascular sequestration. We have developed and validated a framework that can automatically detect intravascular filling defects in fluorescein angiogram images. It first employs a state-of-the-art segmentation approach to extract the vessels from images and then divide them into individual segments by geometrical analysis. A feature vector based on the intensity and shape of saliency maps is generated to represent the level of abnormality of each vessel segment. An AdaBoost classifier with weighted cost coefficient is trained to classify the vessel segments into normal and abnormal categories. To demonstrate its effectiveness, we apply this framework to 6,358 vessel segments in images from 10 patients with malarial retinopathy. The test sensitivity, specificity, accuracy, and area under curve (AUC) are 74.7%, 73.5%, 74.1% and 74.2% respectively when compared to the reference standard of human expert manual annotations. This performance is comparable to the agreement that we find between human observers of intravascular filling defects. Our method will be a powerful new tool for studying malarial retinopathy.
Collapse
Affiliation(s)
- Yitian Zhao
- School of Mechatronical Engineering, Beijing Institute of Technology, Beijing, China
- Department of Eye and Vision Science, University of Liverpool, Liverpool, UK
| | - Ian J. C. MacCormick
- Department of Eye and Vision Science, University of Liverpool, Liverpool, UK
- Malawi-Liverpool-Wellcome Trust Clinical Research Programme, Blantyre, Malawi
| | - David G. Parry
- St. Pauls Eye Unit, Royal Liverpool University Hospital, Liverpool, UK
| | - Nicholas A. V. Beare
- Department of Eye and Vision Science, University of Liverpool, Liverpool, UK
- St. Pauls Eye Unit, Royal Liverpool University Hospital, Liverpool, UK
| | - Simon P. Harding
- Department of Eye and Vision Science, University of Liverpool, Liverpool, UK
- St. Pauls Eye Unit, Royal Liverpool University Hospital, Liverpool, UK
| | - Yalin Zheng
- Department of Eye and Vision Science, University of Liverpool, Liverpool, UK
- St. Pauls Eye Unit, Royal Liverpool University Hospital, Liverpool, UK
| |
Collapse
|
38
|
Tan SC, Watada J, Ibrahim Z, Khalid M. Evolutionary fuzzy ARTMAP neural networks for classification of semiconductor defects. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2015; 26:933-950. [PMID: 25014967 DOI: 10.1109/tnnls.2014.2329097] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Wafer defect detection using an intelligent system is an approach of quality improvement in semiconductor manufacturing that aims to enhance its process stability, increase production capacity, and improve yields. Occasionally, only few records that indicate defective units are available and they are classified as a minority group in a large database. Such a situation leads to an imbalanced data set problem, wherein it engenders a great challenge to deal with by applying machine-learning techniques for obtaining effective solution. In addition, the database may comprise overlapping samples of different classes. This paper introduces two models of evolutionary fuzzy ARTMAP (FAM) neural networks to deal with the imbalanced data set problems in a semiconductor manufacturing operations. In particular, both the FAM models and hybrid genetic algorithms are integrated in the proposed evolutionary artificial neural networks (EANNs) to classify an imbalanced data set. In addition, one of the proposed EANNs incorporates a facility to learn overlapping samples of different classes from the imbalanced data environment. The classification results of the proposed evolutionary FAM neural networks are presented, compared, and analyzed using several classification metrics. The outcomes positively indicate the effectiveness of the proposed networks in handling classification problems with imbalanced data sets.
Collapse
|
39
|
Zhao Y, Liu Y, Wu X, Harding SP, Zheng Y. Retinal vessel segmentation: an efficient graph cut approach with retinex and local phase. PLoS One 2015; 10:e0122332. [PMID: 25830353 PMCID: PMC4382050 DOI: 10.1371/journal.pone.0122332] [Citation(s) in RCA: 64] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2014] [Accepted: 02/10/2015] [Indexed: 11/18/2022] Open
Abstract
Our application concerns the automated detection of vessels in retinal images to improve understanding of the disease mechanism, diagnosis and treatment of retinal and a number of systemic diseases. We propose a new framework for segmenting retinal vasculatures with much improved accuracy and efficiency. The proposed framework consists of three technical components: Retinex-based image inhomogeneity correction, local phase-based vessel enhancement and graph cut-based active contour segmentation. These procedures are applied in the following order. Underpinned by the Retinex theory, the inhomogeneity correction step aims to address challenges presented by the image intensity inhomogeneities, and the relatively low contrast of thin vessels compared to the background. The local phase enhancement technique is employed to enhance vessels for its superiority in preserving the vessel edges. The graph cut-based active contour method is used for its efficiency and effectiveness in segmenting the vessels from the enhanced images using the local phase filter. We have demonstrated its performance by applying it to four public retinal image datasets (3 datasets of color fundus photography and 1 of fluorescein angiography). Statistical analysis demonstrates that each component of the framework can provide the level of performance expected. The proposed framework is compared with widely used unsupervised and supervised methods, showing that the overall framework outperforms its competitors. For example, the achieved sensitivity (0:744), specificity (0:978) and accuracy (0:953) for the DRIVE dataset are very close to those of the manual annotations obtained by the second observer.
Collapse
Affiliation(s)
- Yitian Zhao
- Department of Eye and Vision Science, University of Liverpool, Liverpool, United Kingdom
| | - Yonghuai Liu
- Department of Computer Science, Aberystwyth University, Aberystwyth, United Kingdom
| | - Xiangqian Wu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Simon P. Harding
- Department of Eye and Vision Science, University of Liverpool, Liverpool, United Kingdom
- St Paul’s Eye Unit, Royal Liverpool University Hospital, Liverpool, United Kingdom
| | - Yalin Zheng
- Department of Eye and Vision Science, University of Liverpool, Liverpool, United Kingdom
- St Paul’s Eye Unit, Royal Liverpool University Hospital, Liverpool, United Kingdom
- * E-mail:
| |
Collapse
|
40
|
Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One 2015; 10:e0118432. [PMID: 25738806 PMCID: PMC4349800 DOI: 10.1371/journal.pone.0118432] [Citation(s) in RCA: 1606] [Impact Index Per Article: 160.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2014] [Accepted: 01/16/2015] [Indexed: 11/18/2022] Open
Abstract
Binary classifiers are routinely evaluated with performance measures such as sensitivity and specificity, and performance is frequently illustrated with Receiver Operating Characteristics (ROC) plots. Alternative measures such as positive predictive value (PPV) and the associated Precision/Recall (PRC) plots are used less frequently. Many bioinformatics studies develop and evaluate classifiers that are to be applied to strongly imbalanced datasets in which the number of negatives outweighs the number of positives significantly. While ROC plots are visually appealing and provide an overview of a classifier's performance across a wide range of specificities, one can ask whether ROC plots could be misleading when applied in imbalanced classification scenarios. We show here that the visual interpretability of ROC plots in the context of imbalanced datasets can be deceptive with respect to conclusions about the reliability of classification performance, owing to an intuitive but wrong interpretation of specificity. PRC plots, on the other hand, can provide the viewer with an accurate prediction of future classification performance due to the fact that they evaluate the fraction of true positives among positive predictions. Our findings have potential implications for the interpretation of a large number of studies that use ROC plots on imbalanced datasets.
Collapse
Affiliation(s)
- Takaya Saito
- Computational Biology Unit, Department of Informatics, University of Bergen, P. O. Box 7803, N-5020, Bergen, Norway
- * E-mail: (TS); (MR)
| | - Marc Rehmsmeier
- Computational Biology Unit, Department of Informatics, University of Bergen, P. O. Box 7803, N-5020, Bergen, Norway
- * E-mail: (TS); (MR)
| |
Collapse
|
41
|
Hong X, Chen S, Qatawneh A, Daqrouq K, Sheikh M, Morfeq A. A radial basis function network classifier to maximise leave-one-out mutual information. Appl Soft Comput 2014. [DOI: 10.1016/j.asoc.2014.06.003] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
42
|
Hu J, He X, Yu DJ, Yang XB, Yang JY, Shen HB. A new supervised over-sampling algorithm with application to protein-nucleotide binding residue prediction. PLoS One 2014; 9:e107676. [PMID: 25229688 PMCID: PMC4168127 DOI: 10.1371/journal.pone.0107676] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2014] [Accepted: 08/09/2014] [Indexed: 12/21/2022] Open
Abstract
Protein-nucleotide interactions are ubiquitous in a wide variety of biological processes. Accurately identifying interaction residues solely from protein sequences is useful for both protein function annotation and drug design, especially in the post-genomic era, as large volumes of protein data have not been functionally annotated. Protein-nucleotide binding residue prediction is a typical imbalanced learning problem, where binding residues are extremely fewer in number than non-binding residues. Alleviating the severity of class imbalance has been demonstrated to be a promising means of improving the prediction performance of a machine-learning-based predictor for class imbalance problems. However, little attention has been paid to the negative impact of class imbalance on protein-nucleotide binding residue prediction. In this study, we propose a new supervised over-sampling algorithm that synthesizes additional minority class samples to address class imbalance. The experimental results from protein-nucleotide interaction datasets demonstrate that the proposed supervised over-sampling algorithm can relieve the severity of class imbalance and help to improve prediction performance. Based on the proposed over-sampling algorithm, a predictor, called TargetSOS, is implemented for protein-nucleotide binding residue prediction. Cross-validation tests and independent validation tests demonstrate the effectiveness of TargetSOS. The web-server and datasets used in this study are freely available at http://www.csbio.sjtu.edu.cn/bioinf/TargetSOS/.
Collapse
Affiliation(s)
- Jun Hu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, Jiangsu, China
| | - Xue He
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, Jiangsu, China
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, Jiangsu, China
- Changshu Institute, Nanjing University of Science and Technology, Changshu, Jiangsu, China
- * E-mail: (DJY); (HBS)
| | - Xi-Bei Yang
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, Jiangsu, China
- School of Computer Science and Engineering, Jiangsu University of Science and Technology, Zhenjiang, Jiangsu, China
| | - Jing-Yu Yang
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, Jiangsu, China
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai, China
- * E-mail: (DJY); (HBS)
| |
Collapse
|
43
|
Gao M, Hong X, Chen S, Harris CJ, Khalaf E. PDFOS: PDF estimation based over-sampling for imbalanced two-class problems. Neurocomputing 2014. [DOI: 10.1016/j.neucom.2014.02.006] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
44
|
Yu H, Ni J. An Improved Ensemble Learning Method for Classifying High-Dimensional and Imbalanced Biomedicine Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:657-666. [PMID: 26356336 DOI: 10.1109/tcbb.2014.2306838] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Training classifiers on skewed data can be technically challenging tasks, especially if the data is high-dimensional simultaneously, the tasks can become more difficult. In biomedicine field, skewed data type often appears. In this study, we try to deal with this problem by combining asymmetric bagging ensemble classifier (asBagging) that has been presented in previous work and an improved random subspace (RS) generation strategy that is called feature subspace (FSS). Specifically, FSS is a novel method to promote the balance level between accuracy and diversity of base classifiers in asBagging. In view of the strong generalization capability of support vector machine (SVM), we adopt it to be base classifier. Extensive experiments on four benchmark biomedicine data sets indicate that the proposed ensemble learning method outperforms many baseline approaches in terms of Accuracy, F-measure, G-mean and AUC evaluation criterions, thus it can be regarded as an effective and efficient tool to deal with high-dimensional and imbalanced biomedical data.
Collapse
|
45
|
|
46
|
|
47
|
Iterative nearest neighborhood oversampling in semisupervised learning from imbalanced data. ScientificWorldJournal 2013; 2013:875450. [PMID: 23935439 PMCID: PMC3725769 DOI: 10.1155/2013/875450] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2013] [Accepted: 06/04/2013] [Indexed: 12/04/2022] Open
Abstract
Transductive graph-based semisupervised learning methods usually build an undirected graph utilizing both labeled and unlabeled samples as vertices. Those methods propagate label information of labeled samples to neighbors through their edges in order to get the predicted labels of unlabeled samples. Most popular semi-supervised learning approaches are sensitive to initial label distribution which happened in imbalanced labeled datasets. The class boundary will be severely skewed by the majority classes in an imbalanced classification. In this paper, we proposed a simple and effective approach to alleviate the unfavorable influence of imbalance problem by iteratively selecting a few unlabeled samples and adding them into the minority classes to form a balanced labeled dataset for the learning methods afterwards. The experiments on UCI datasets and MNIST handwritten digits dataset showed that the proposed approach outperforms other existing state-of-art methods.
Collapse
|
48
|
Castro CL, Braga AP. Novel cost-sensitive approach to improve the multilayer perceptron performance on imbalanced data. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2013; 24:888-899. [PMID: 24808471 DOI: 10.1109/tnnls.2013.2246188] [Citation(s) in RCA: 69] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Traditional learning algorithms applied to complex and highly imbalanced training sets may not give satisfactory results when distinguishing between examples of the classes. The tendency is to yield classification models that are biased towards the overrepresented (majority) class. This paper investigates this class imbalance problem in the context of multilayer perceptron (MLP) neural networks. The consequences of the equal cost (loss) assumption on imbalanced data are formally discussed from a statistical learning theory point of view. A new cost-sensitive algorithm (CSMLP) is presented to improve the discrimination ability of (two-class) MLPs. The CSMLP formulation is based on a joint objective function that uses a single cost parameter to distinguish the importance of class errors. The learning rule extends the Levenberg-Marquadt's rule, ensuring the computational efficiency of the algorithm. In addition, it is theoretically demonstrated that the incorporation of prior information via the cost parameter may lead to balanced decision boundaries in the feature space. Based on the statistical analysis of results on real data, our approach shows a significant improvement of the area under the receiver operating characteristic curve and G-mean measures of regular MLPs.
Collapse
|
49
|
Park BJ, Oh SK, Pedrycz W. The design of polynomial function-based neural network predictors for detection of software defects. Inf Sci (N Y) 2013. [DOI: 10.1016/j.ins.2011.01.026] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
50
|
Pang S, Zhu L, Chen G, Sarrafzadeh A, Ban T, Inoue D. Dynamic class imbalance learning for incremental LPSVM. Neural Netw 2013; 44:87-100. [PMID: 23584135 DOI: 10.1016/j.neunet.2013.02.007] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2012] [Revised: 02/27/2013] [Accepted: 02/28/2013] [Indexed: 11/18/2022]
Abstract
Linear Proximal Support Vector Machines (LPSVMs), like decision trees, classic SVM, etc. are originally not equipped to handle drifting data streams that exhibit high and varying degrees of class imbalance. For online classification of data streams with imbalanced class distribution, we propose a dynamic class imbalance learning (DCIL) approach to incremental LPSVM (IncLPSVM) modeling. In doing so, we simplify a computationally non-renewable weighted LPSVM to several core matrices multiplying two simple weight coefficients. When data addition and/or retirement occurs, the proposed DCIL-IncLPSVM(1) accommodates newly presented class imbalance by a simple matrix and coefficient updating, meanwhile ensures no discriminative information lost throughout the learning process. Experiments on benchmark datasets indicate that the proposed DCIL-IncLPSVM outperforms classic IncSVM and IncLPSVM in terms of F-measure and G-mean metrics. Moreover, our application to online face membership authentication shows that the proposed DCIL-IncLPSVM remains effective in the presence of highly dynamic class imbalance, which usually poses serious problems to previous approaches.
Collapse
Affiliation(s)
- Shaoning Pang
- Department of Computing, Unitec Institute of Technology, Private Bag 92025, Auckland 1025, New Zealand.
| | | | | | | | | | | |
Collapse
|