1
|
Xu Y, Yu Z, Chen CLP, Liu Z. Adaptive Subspace Optimization Ensemble Method for High-Dimensional Imbalanced Data Classification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:2284-2297. [PMID: 34469316 DOI: 10.1109/tnnls.2021.3106306] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
It is hard to construct an optimal classifier for high-dimensional imbalanced data, on which the performance of classifiers is seriously affected and becomes poor. Although many approaches, such as resampling, cost-sensitive, and ensemble learning methods, have been proposed to deal with the skewed data, they are constrained by high-dimensional data with noise and redundancy. In this study, we propose an adaptive subspace optimization ensemble method (ASOEM) for high-dimensional imbalanced data classification to overcome the above limitations. To construct accurate and diverse base classifiers, a novel adaptive subspace optimization (ASO) method based on adaptive subspace generation (ASG) process and rotated subspace optimization (RSO) process is designed to generate multiple robust and discriminative subspaces. Then a resampling scheme is applied on the optimized subspace to build a class-balanced data for each base classifier. To verify the effectiveness, our ASOEM is implemented based on different resampling strategies on 24 real-world high-dimensional imbalanced datasets. Experimental results demonstrate that our proposed methods outperform other mainstream imbalance learning approaches and classifier ensemble methods.
Collapse
|
2
|
Abstract
A huge amount of data is generated daily leading to big data challenges. One of them is related to text mining, especially text classification. To perform this task we usually need a large set of labeled data that can be expensive, time-consuming, or difficult to be obtained. Considering this scenario semi-supervised learning (SSL), the branch of machine learning concerned with using labeled and unlabeled data has expanded in volume and scope. Since no recent survey exists to overview how SSL has been used in text classification, we aim to fill this gap and present an up-to-date review of SSL for text classification. We retrieve 1794 works from the last 5 years from IEEE Xplore, ACM Digital Library, Science Direct, and Springer. Then, 157 articles were selected to be included in this review. We present the application domain, datasets, and languages employed in the works. The text representations and machine learning algorithms. We also summarize and organize the works following a recent taxonomy of SSL. We analyze the percentage of labeled data used, the evaluation metrics, and obtained results. Lastly, we present some limitations and future trends in the area. We aim to provide researchers and practitioners with an outline of the area as well as useful information for their current research.
Collapse
Affiliation(s)
- José Marcio Duarte
- Science and Technology Department, Federal University of São Paulo, Cesare Mansueto Giulio Lattes Ave, 1201, São José dos Campos, SP 12247-014 Brazil
| | - Lilian Berton
- Science and Technology Department, Federal University of São Paulo, Cesare Mansueto Giulio Lattes Ave, 1201, São José dos Campos, SP 12247-014 Brazil
| |
Collapse
|
3
|
Din SU, Kumar J, Shao J, Mawuli CB, Ndiaye WD. Learning High-Dimensional Evolving Data Streams With Limited Labels. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:11373-11384. [PMID: 34033560 DOI: 10.1109/tcyb.2021.3070420] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
In the context of streaming data, learning algorithms often need to confront several unique challenges, such as concept drift, label scarcity, and high dimensionality. Several concept drift-aware data stream learning algorithms have been proposed to tackle these issues over the past decades. However, most existing algorithms utilize a supervised learning framework and require all true class labels to update their models. Unfortunately, in the streaming environment, requiring all labels is unfeasible and not realistic in many real-world applications. Therefore, learning data streams with minimal labels is a more practical scenario. Considering the problem of the curse of dimensionality and label scarcity, in this article, we present a new semisupervised learning technique for streaming data. To cure the curse of dimensionality, we employ a denoising autoencoder to transform the high-dimensional feature space into a reduced, compact, and more informative feature representation. Furthermore, we use a cluster-and-label technique to reduce the dependency on true class labels. We employ a synchronization-based dynamic clustering technique to summarize the streaming data into a set of dynamic microclusters that are further used for classification. In addition, we employ a disagreement-based learning method to cope with concept drift. Extensive experiments performed on many real-world datasets demonstrate the superior performance of the proposed method compared to several state-of-the-art methods.
Collapse
|
4
|
Construction of English and American Literature Corpus Based on Machine Learning Algorithm. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:9773452. [PMID: 35694598 PMCID: PMC9184167 DOI: 10.1155/2022/9773452] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/07/2022] [Accepted: 05/20/2022] [Indexed: 11/18/2022]
Abstract
In China, the application of corpus in language teaching, especially in English and American literature teaching, is still in the preliminary research stage, and there are various shortcomings, which have not been paid due attention by front-line educators. Constructing English and American literature corpus according to certain principles can effectively promote English and American literature teaching. The research of this paper is devoted to how to automatically build a corpus of English and American literature. In the process of keyword extraction, key phrases and keywords are effectively combined. The similarity between atomic events is calculated by the TextRank algorithm, and then the first N sentences with high similarity are selected and sorted. Based on ML (machine learning) text classification method, a combined classifier based on SVM (support vector machine) and NB (Naive Bayes) is proposed. The experimental results show that, from the point of view of accuracy and recall, the classification effect of the combined algorithm proposed in this paper is the best among the three methods. The best classification results of accuracy, recall, and F value are 0.87, 0.9, and 0.89, respectively. Experimental results show that this method can quickly, accurately, and persistently obtain high-quality bilingual mixed web pages.
Collapse
|
5
|
Yu Z, Ye F, Yang K, Cao W, Chen CLP, Cheng L, You J, Wong HS. Semisupervised Classification With Novel Graph Construction for High-Dimensional Data. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:75-88. [PMID: 33048763 DOI: 10.1109/tnnls.2020.3027526] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Graph-based methods have achieved impressive performance on semisupervised classification (SSC). Traditional graph-based methods have two main drawbacks. First, the graph is predefined before training a classifier, which does not leverage the interactions between the classifier training and similarity matrix learning. Second, when handling high-dimensional data with noisy or redundant features, the graph constructed in the original input space is actually unsuitable and may lead to poor performance. In this article, we propose an SSC method with novel graph construction (SSC-NGC), in which the similarity matrix is optimized in both label space and an additional subspace to get a better and more robust result than in original data space. Furthermore, to obtain a high-quality subspace, we learn the projection matrix of the additional subspace by preserving the local and global structure of the data. Finally, we intergrade the classifier training, the graph construction, and the subspace learning into a unified framework. With this framework, the classifier parameters, similarity matrix, and projection matrix of subspace are adaptively learned in an iterative scheme to obtain an optimal joint result. We conduct extensive comparative experiments against state-of-the-art methods over multiple real-world data sets. Experimental results demonstrate the superiority of the proposed method over other state-of-the-art algorithms.
Collapse
|
6
|
Weak-label-based global and local multi-view multi-label learning with three-way clustering. INT J MACH LEARN CYB 2021. [DOI: 10.1007/s13042-021-01450-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
7
|
Kumar B, Gupta D. Universum based Lagrangian twin bounded support vector machine to classify EEG signals. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2021; 208:106244. [PMID: 34216880 DOI: 10.1016/j.cmpb.2021.106244] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/23/2020] [Accepted: 06/15/2021] [Indexed: 06/13/2023]
Abstract
BACKGROUND AND OBJECTIVE The detection of brain-related problems and neurological disorders like epilepsy, sleep disorder, and so on is done by using electroencephalogram (EEG) signals which contain noisy signals and outliers. Universum data contains a set of a sample that does not belong to any of the concerned classes and serves as the advanced knowledge about the data distribution. Earlier information has been utilized viably in improving classification performance. Recently a novel universum support vector machine (USVM) was proposed for EEG signal classification and further, a universum twin support vector machine (UTWSVM) was proposed based on USVM to improve the performance. Inspired by USVM and UTWSVM, this paper suggests a novel method called universum based Lagrangian twin bounded support vector machine (ULTBSVM), where universum data is utilized to incorporate the prior information about the data distribution to classify healthy and seizure EEG signals. METHODS In the proposed ULTBSVM the square of the 2-norm of the slack variables is used to formulate the objective function strongly convex; hence it always gives unique solutions. Unlike twin support vector machine (TWSVM) and universum twin support vector machine (UTWSVM), the proposed ULTBSVM is having regularization terms that follow the structural risk minimization (SRM) principle and enhance the stability in the dual formulations, make the model well-posed and prevents the overfitting problem. Here, interracial EEG data have been considered as universum data to classify healthy and seizure signals. Several feature extraction techniques have been implemented to get important noiseless features. RESULTS Several EEG datasets, as well as publicly available UCI datasets, are utilized to assess the performance of the proposed method. An analytical comparison has been performed of the proposed method with USVM and UTWSVM to detect seizure and healthy signals and for real-world data, the ULTBSVM is compared with the universum based models as well as TWSVM and the proposed method gives better results in most of the cases as compared to the other methods. CONCLUSION The results clearly show that ULTBSVM is a potential method for the classification of EEG signals as well as real-world datasets having interracial data as universum data. Here we have used universum points for the binary class classification problem, but one can extend and use it for multi-class classification problems as well.
Collapse
Affiliation(s)
- Bikram Kumar
- Department of Computer Science and Engineering, National Institute of Technology, Arunachal Pradesh 791112, India
| | - Deepak Gupta
- Department of Computer Science and Engineering, National Institute of Technology, Arunachal Pradesh 791112, India.
| |
Collapse
|
8
|
Robust and sparse label propagation for graph-based semi-supervised classification. APPL INTELL 2021. [DOI: 10.1007/s10489-021-02360-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
9
|
|
10
|
An automatic sampling ratio detection method based on genetic algorithm for imbalanced data classification. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2021.106800] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
11
|
Abstract
Class overlap and class imbalance are two data complexities that challenge the design of effective classifiers in Pattern Recognition and Data Mining as they may cause a significant loss in performance. Several solutions have been proposed to face both data difficulties, but most of these approaches tackle each problem separately. In this paper, we propose a two-stage under-sampling technique that combines the DBSCAN clustering algorithm to remove noisy samples and clean the decision boundary with a minimum spanning tree algorithm to face the class imbalance, thus handling class overlap and imbalance simultaneously with the aim of improving the performance of classifiers. An extensive experimental study shows a significantly better behavior of the new algorithm as compared to 12 state-of-the-art under-sampling methods using three standard classification models (nearest neighbor rule, J48 decision tree, and support vector machine with a linear kernel) on both real-life and synthetic databases.
Collapse
|
12
|
Deng S, Xie X, Yuan C, Yang L, Wu X. Numerical sensitive data recognition based on hybrid gene expression programming for active distribution networks. Appl Soft Comput 2020. [DOI: 10.1016/j.asoc.2020.106213] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
13
|
Distributed semi-supervised learning algorithms for random vector functional-link networks with distributed data splitting across samples and features. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2020.105577] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
14
|
Abstract
The application of word associations has become increasingly widespread. However, the association norms produced by traditional free association tests tend not to exceed 10,000 stimulus words, making the number of associated words too small to be representative of the overall language. In this study we used text corpora totaling over 400 million Chinese words, along with a multitude of association measures, to automatically construct a Chinese Lexical Association Database (CLAD) comprising the lexical association of over 80,000 words. Comparison of the CLAD with a database of traditional Chinese word association norms shows that word associations extracted from large text corpora are similar in strength to those elicited from free association tests but contain a much greater number of associative word pairs. Additionally, the relatively small numbers of participants involved in the creation of traditional norms result in relatively coarse scales of association measurement, whereas the differentiation of association strengths is greatly enhanced in the CLAD. The CLAD provides researchers with a great supplement to traditional word association norms. A query website at www.chinesereadability.net/LexicalAssociation/CLAD/ affords access to the database.
Collapse
|
15
|
|
16
|
Semi-Supervised Convolutional Neural Network for Law Advice Online. APPLIED SCIENCES-BASEL 2019. [DOI: 10.3390/app9173617] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
With the rapid developments of Internet technology, a mass of law cases is constantly occurring and needs to be dealt with in time. Automatic classification of law text is the most basic and critical process in the online law advice platform. Deep neural network-based natural language processing (DNN-NLP) is one of the most promising approaches to implement text classification. Meanwhile, as the convolutional neural network-based (CNN-based) methods developed, CNN-based text classification has already achieved impressive results. However, previous work applied amounts of manually-annotated data, which increased the labor cost and reduced the adaptability of the approach. Hence, we present a new semi-supervised model to solve the problem of data annotation. Our method learns the embedding of small text regions from unlabeled data and then integrates the learned embedding into the supervised training. More specifically, the learned embedding regions with the two-view-embedding model are used as an additional input to the CNN’s convolution layer. In addition, to implement the multi-task learning task, we propose the multi-label classification algorithm to assign multiple labels to an instance. The proposed method is evaluated experimentally subject to a law case description dataset and English standard dataset RCV1 . On Chinese data, the simulation results demonstrate that, compared with the existing methods such as linear SVM, our scheme respectively improves by 7.76%, 7.86%, 9.19%, and 2.96% the precision, recall, F-1, and Hamming loss. Analogously, the results suggest that compared to CNN, our scheme respectively improves by 4.46%, 5.76%, 5.14% and 0.87% in terms of precision, recall, F-1, and Hamming loss. It is worth mentioning that the robustness of this method makes it suitable and effective for automatic classification of law text. Furthermore, the design concept proposed is promising, which can be utilized in other real-world applications such as news classification and public opinion monitoring.
Collapse
|
17
|
Yu Z, Zhang Y, Chen CLP, You J, Wong HS, Dai D, Wu S, Zhang J. Multiobjective Semisupervised Classifier Ensemble. IEEE TRANSACTIONS ON CYBERNETICS 2019; 49:2280-2293. [PMID: 29993923 DOI: 10.1109/tcyb.2018.2824299] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Classification of high-dimensional data with very limited labels is a challenging task in the field of data mining and machine learning. In this paper, we propose the multiobjective semisupervised classifier ensemble (MOSSCE) approach to address this challenge. Specifically, a multiobjective subspace selection process (MOSSP) in MOSSCE is first designed to generate the optimal combination of feature subspaces. Three objective functions are then proposed for MOSSP, which include the relevance of features, the redundancy between features, and the data reconstruction error. Then, MOSSCE generates an auxiliary training set based on the sample confidence to improve the performance of the classifier ensemble. Finally, the training set, combined with the auxiliary training set, is used to select the optimal combination of basic classifiers in the ensemble, train the classifier ensemble, and generate the final result. In addition, diversity analysis of the ensemble learning process is applied, and a set of nonparametric statistical tests is adopted for the comparison of semisupervised classification approaches on multiple datasets. The experiments on 12 gene expression datasets and two large image datasets show that MOSSCE has a better performance than other state-of-the-art semisupervised classifiers on high-dimensional data.
Collapse
|
18
|
|
19
|
Yu Z, Zhang Y, You J, Chen CLP, Wong HS, Han G, Zhang J. Adaptive Semi-Supervised Classifier Ensemble for High Dimensional Data Classification. IEEE TRANSACTIONS ON CYBERNETICS 2019; 49:366-379. [PMID: 29989979 DOI: 10.1109/tcyb.2017.2761908] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
High dimensional data classification with very limited labeled training data is a challenging task in the area of data mining. In order to tackle this task, we first propose a feature selection-based semi-supervised classifier ensemble framework (FSCE) to perform high dimensional data classification. Then, we design an adaptive semi-supervised classifier ensemble framework (ASCE) to improve the performance of FSCE. When compared with FSCE, ASCE is characterized by an adaptive feature selection process, an adaptive weighting process (AWP), and an auxiliary training set generation process (ATSGP). The adaptive feature selection process generates a set of compact subspaces based on the selected attributes obtained by the feature selection algorithms, while the AWP associates each basic semi-supervised classifier in the ensemble with a weight value. The ATSGP enlarges the training set with unlabeled samples. In addition, a set of nonparametric tests are adopted to compare multiple semi-supervised classifier ensemble (SSCE)approaches over different datasets. The experiments on 20 high dimensional real-world datasets show that: 1) the two adaptive processes in ASCE are useful for improving the performance of the SSCE approach and 2) ASCE works well on high dimensional datasets with very limited labeled training data, and outperforms most state-of-the-art SSCE approaches.
Collapse
|
20
|
|
21
|
|
22
|
Ranjan NM, Prasad RS. LFNN: Lion fuzzy neural network-based evolutionary model for text classification using context and sense based features. Appl Soft Comput 2018. [DOI: 10.1016/j.asoc.2018.07.016] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
23
|
|
24
|
|
25
|
Structure regularized self-paced learning for robust semi-supervised pattern classification. Neural Comput Appl 2018. [DOI: 10.1007/s00521-018-3478-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
26
|
Yu Z, Lu Y, Zhang J, You J, Wong HS, Wang Y, Han G. Progressive Semisupervised Learning of Multiple Classifiers. IEEE TRANSACTIONS ON CYBERNETICS 2018; 48:689-702. [PMID: 28113355 DOI: 10.1109/tcyb.2017.2651114] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Semisupervised learning methods are often adopted to handle datasets with very small number of labeled samples. However, conventional semisupervised ensemble learning approaches have two limitations: 1) most of them cannot obtain satisfactory results on high dimensional datasets with limited labels and 2) they usually do not consider how to use an optimization process to enlarge the training set. In this paper, we propose the progressive semisupervised ensemble learning approach (PSEMISEL) to address the above limitations and handle datasets with very small number of labeled samples. When compared with traditional semisupervised ensemble learning approaches, PSEMISEL is characterized by two properties: 1) it adopts the random subspace technique to investigate the structure of the dataset in the subspaces and 2) a progressive training set generation process and a self evolutionary sample selection process are proposed to enlarge the training set. We also use a set of nonparametric tests to compare different semisupervised ensemble learning methods over multiple datasets. The experimental results on 18 real-world datasets from the University of California, Irvine machine learning repository show that PSEMISEL works well on most of the real-world datasets, and outperforms other state-of-the-art approaches on 10 out of 18 datasets.
Collapse
|
27
|
Kang Q, Chen X, Li S, Zhou M. A Noise-Filtered Under-Sampling Scheme for Imbalanced Classification. IEEE TRANSACTIONS ON CYBERNETICS 2017; 47:4263-4274. [PMID: 28113413 DOI: 10.1109/tcyb.2016.2606104] [Citation(s) in RCA: 59] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Under-sampling is a popular data preprocessing method in dealing with class imbalance problems, with the purposes of balancing datasets to achieve a high classification rate and avoiding the bias toward majority class examples. It always uses full minority data in a training dataset. However, some noisy minority examples may reduce the performance of classifiers. In this paper, a new under-sampling scheme is proposed by incorporating a noise filter before executing resampling. In order to verify the efficiency, this scheme is implemented based on four popular under-sampling methods, i.e., Undersampling + Adaboost, RUSBoost, UnderBagging, and EasyEnsemble through benchmarks and significance analysis. Furthermore, this paper also summarizes the relationship between algorithm performance and imbalanced ratio. Experimental results indicate that the proposed scheme can improve the original undersampling-based methods with significance in terms of three popular metrics for imbalanced classification, i.e., the area under the curve, -measure, and -mean.
Collapse
|
28
|
Deng S, Yue D, Zhou A, Fu X, Yang L, Xue Y. Distributed content filtering algorithm based on data label and policy expression in active distribution networks. Neurocomputing 2017. [DOI: 10.1016/j.neucom.2017.03.087] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
29
|
Dai J, Hu Q, Zhang J, Hu H, Zheng N. Attribute Selection for Partially Labeled Categorical Data By Rough Set Approach. IEEE TRANSACTIONS ON CYBERNETICS 2017; 47:2460-2471. [PMID: 28029637 DOI: 10.1109/tcyb.2016.2636339] [Citation(s) in RCA: 42] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Attribute selection is considered as the most characteristic result in rough set theory to distinguish itself to other theories. However, existing attribute selection approaches can not handle partially labeled data. So far, few studies on attribute selection in partially labeled data have been conducted. In this paper, the concept of discernibility pair based on rough set theory is raised to construct a uniform measure for the attributes in both supervised framework and unsupervised framework. Based on discernibility pair, two kinds of semisupervised attribute selection algorithm based on rough set theory are developed to handle partially labeled categorical data. Experiments demonstrate the effectiveness of the proposed attribute selection algorithms.
Collapse
|