1
|
Zhou S, Xiang Y, Huang H, Huang P, Peng C, Yang X, Song P. Unsupervised feature selection with evolutionary sparsity. Neural Netw 2025; 189:107512. [PMID: 40349430 DOI: 10.1016/j.neunet.2025.107512] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2024] [Revised: 03/21/2025] [Accepted: 04/21/2025] [Indexed: 05/14/2025]
Abstract
The ℓ2,0-norm is playing an increasingly important role in unsupervised feature selection. However, existing algorithm for optimization problem with ℓ2,0-norm constraint has two problems: First, they cannot automatically determine the sparsity, also known as the number of key features. Second, they have the risk of converging towards local optima, therefore selecting trivial (less informative) features. To address these problems, this paper proposes an unsupervised feature selection method with evolutionary sparsity (EVSP), which integrates the feature selection process with a sparse projection matrix and population search mechanisms into a unified unsupervised feature selection framework. Specifically, the level of sparsity is encoded as population individuals, and subsequently, a multi-objective evolutionary algorithm based on binary encoding is introduced to recursively determine the optimal level of sparsity, thus unsupervisedly guiding the learning of an optimal row-sparse projection matrix. Moreover, by utilizing the feature weights learned through sparse projection, a two-stage strategy called the mutation-repair operator is designed to steer the evolution of the population, aiming to generate high-quality candidate solutions. Comprehensive experiments on eleven benchmark datasets, with a maximum dimensionality of 10304 features and a maximum size of 9298 samples, demonstrate that the proposed EVSP method can effectively determine the optimal sparsity level, significantly outperforming several state-of-the-art methods.
Collapse
Affiliation(s)
- Shixuan Zhou
- School of Software Engineering, South China University of Technology, Guangzhou 510006, China.
| | - Yi Xiang
- School of Software Engineering, South China University of Technology, Guangzhou 510006, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China.
| | - Han Huang
- School of Software Engineering, South China University of Technology, Guangzhou 510006, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China; Key Laboratory of Big Data and Intelligent Robot (SCUT), Ministry of Education, Guangzhou, 510006, China; Guangdong Engineering Center for Large Model and GenAI Technology, Guangzhou 510006, China.
| | - Pei Huang
- School of Information Science, Guangdong University of Finance and Economics, Guangzhou, 510006, China.
| | - Chaoda Peng
- School of Mathematics and Informatics, South China Agricultural University, Guangzhou, 510006, China.
| | - Xiaowei Yang
- School of Software Engineering, South China University of Technology, Guangzhou 510006, China.
| | - Peng Song
- School of Computer and Control Engineering, Yantai University, Yantai 264005, China.
| |
Collapse
|
2
|
Shang R, Zhong J, Zhang W, Xu S, Li Y. Multilabel Feature Selection via Shared Latent Sublabel Structure and Simultaneous Orthogonal Basis Clustering. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:5288-5303. [PMID: 38656846 DOI: 10.1109/tnnls.2024.3382911] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/26/2024]
Abstract
Multilabel feature selection solves the dimension distress of high-dimensional multilabel data by selecting the optimal subset of features. Noisy and incomplete labels of raw multilabel data hinder the acquisition of label-guided information. In existing approaches, mapping the label space to a low-dimensional latent space by semantic decomposition to mitigate label noise is considered an effective strategy. However, the decomposed latent label space contains redundant label information, which misleads the capture of potential label relevance. To eliminate the effect of redundant information on the extraction of latent label correlations, a novel method named SLOFS via shared latent sublabel structure and simultaneous orthogonal basis clustering for multilabel feature selection is proposed. First, a latent orthogonal base structure shared (LOBSS) term is engineered to guide the construction of a redundancy-free latent sublabel space via the separated latent clustering center structure. The LOBSS term simultaneously retains latent sublabel information and latent clustering center structure. Moreover, the structure and relevance information of nonredundant latent sublabels are fully explored. The introduction of graph regularization ensures structural consistency in the data space and latent sublabels, thus helping the feature selection process. SLOFS employs a dynamic sublabel graph to obtain a high-quality sublabel space and uses regularization to constrain label correlations on dynamic sublabel projections. Finally, an effective convergence provable optimization scheme is proposed to solve the SLOFS method. The experimental studies on the 18 datasets demonstrate that the presented method performs consistently better than previous feature selection methods.
Collapse
|
3
|
Qian W, Tu Y, Huang J, Shu W, Cheung YM. Partial Multilabel Learning Using Noise-Tolerant Broad Learning System With Label Enhancement and Dimensionality Reduction. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:3758-3772. [PMID: 38289837 DOI: 10.1109/tnnls.2024.3352285] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2024]
Abstract
Partial multilabel learning (PML) addresses the issue of noisy supervision, which contains an overcomplete set of candidate labels for each instance with only a valid subset of training data. Using label enhancement techniques, researchers have computed the probability of a label being ground truth. However, enhancing labels in the noisy label space makes it impossible for the existing partial multilabel label enhancement methods to achieve satisfactory results. Besides, few methods simultaneously involve the ambiguity problem, the feature space's redundancy, and the model's efficiency in PML. To address these issues, this article presents a novel joint partial multilabel framework using broad learning systems (namely BLS-PML) with three innovative mechanisms: 1) a trustworthy label space is reconstructed through a novel label enhancement method to avoid the bias caused by noisy labels; 2) a low-dimensional feature space is obtained by a confidence-based dimensionality reduction method to reduce the effect of redundancy in the feature space; and 3) a noise-tolerant BLS is proposed by adding a dimensionality reduction layer and a trustworthy label layer to deal with PML problem. We evaluated it on six real-world and seven synthetic datasets, using eight state-of-the-art partial multilabel algorithms as baselines and six evaluation metrics. Out of 144 experimental scenarios, our method significantly outperforms the baselines by about 80%, demonstrating its robustness and effectiveness in handling partial multilabel tasks.
Collapse
|
4
|
Xu T, Xu Y, Yang S, Li B, Zhang W. Learning Accurate Label-Specific Features From Partially Multilabeled Data. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:10436-10450. [PMID: 37022887 DOI: 10.1109/tnnls.2023.3241921] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Feature selection is an effective dimensionality reduction technique, which can speed up an algorithm and improve model performance such as predictive accuracy and result comprehensibility. The study of selecting label-specific features for each class label has attracted considerable attention since each class label might be determined by some inherent characteristics, where precise label information is required to guide label-specific feature selection. However, obtaining noise-free labels is quite difficult and impractical. In reality, each instance is often annotated by a candidate label set that comprises multiple ground-truth labels and other false-positive labels, termed partial multilabel (PML) learning scenario. Here, false-positive labels concealed in a candidate label set might induce the selection of false label-specific features while masking the intrinsic label correlations, which misleads the selection of relevant features and compromises the selection performance. To address this issue, a novel two-stage partial multilabel feature selection (PMLFS) approach is proposed, which elicits credible labels to guide accurate label-specific feature selection. First, the label confidence matrix is learned to help elicit ground-truth labels from the candidate label set via the label structure reconstruction strategy, each element of which indicates how likely a class label is ground truth. After that, based on distilled credible labels, a joint selection model, including label-specific feature learner and common feature learner, is designed to learn accurate label-specific features to each class label and common features for all class labels. Besides, label correlations are fused into the features selection process to facilitate the generation of an optimal feature subset. Extensive experimental results clearly validate the superiority of the proposed approach.
Collapse
|
5
|
Yang X, Che H, Leung MF, Wen S. Self-paced regularized adaptive multi-view unsupervised feature selection. Neural Netw 2024; 175:106295. [PMID: 38614023 DOI: 10.1016/j.neunet.2024.106295] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Revised: 03/14/2024] [Accepted: 04/05/2024] [Indexed: 04/15/2024]
Abstract
Multi-view unsupervised feature selection (MUFS) is an efficient approach for dimensional reduction of heterogeneous data. However, existing MUFS approaches mostly assign the samples the same weight, thus the diversity of samples is not utilized efficiently. Additionally, due to the presence of various regularizations, the resulting MUFS problems are often non-convex, making it difficult to find the optimal solutions. To address this issue, a novel MUFS method named Self-paced Regularized Adaptive Multi-view Unsupervised Feature Selection (SPAMUFS) is proposed. Specifically, the proposed approach firstly trains the MUFS model with simple samples, and gradually learns complex samples by using self-paced regularizer. l2,p-norm (0
Collapse
Affiliation(s)
- Xuanhao Yang
- College of Electronic and Information Engineering, Southwest University, Chongqing, 400715, China.
| | - Hangjun Che
- College of Electronic and Information Engineering, Southwest University, Chongqing, 400715, China; Chongqing Key Laboratory of Nonlinear Circuits and Intelligent Information Processing, Chongqing, 400715, China.
| | - Man-Fai Leung
- School of Computing and Information Science, Faculty of Science and Engineering, Anglia Ruskin University, Cambridge, UK.
| | - Shiping Wen
- Faculty of Engineering and Information Technology, Australian Artificial Intelligence Institute, University of Technology Sydney, Sydney, NSW 2007, Australia.
| |
Collapse
|
6
|
Chen Z, Liu Y, Zhang Y, Zhu J, Li Q, Wu X. Shared Manifold Regularized Joint Feature Selection for Joint Classification and Regression in Alzheimer's Disease Diagnosis. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2024; 33:2730-2745. [PMID: 38578858 DOI: 10.1109/tip.2024.3382600] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/07/2024]
Abstract
In Alzheimer's disease (AD) diagnosis, joint feature selection for predicting disease labels (classification) and estimating cognitive scores (regression) with neuroimaging data has received increasing attention. In this paper, we propose a model named Shared Manifold regularized Joint Feature Selection (SMJFS) that performs classification and regression in a unified framework for AD diagnosis. For classification, unlike the existing works that build least squares regression models which are insufficient in the ability of extracting discriminative information for classification, we design an objective function that integrates linear discriminant analysis and subspace sparsity regularization for acquiring an informative feature subset. Furthermore, the local data relationships are learned according to the samples' transformed distances to exploit the local data structure adaptively. For regression, in contrast to previous works that overlook the correlations among cognitive scores, we learn a latent score space to capture the correlations and employ the latent space to design a regression model with l2,1 -norm regularization, facilitating the feature selection in regression task. Moreover, the missing cognitive scores can be recovered in the latent space for increasing the number of available training samples. Meanwhile, to capture the correlations between the two tasks and describe the local relationships between samples, we construct an adaptive shared graph to guide the subspace learning in classification and the latent cognitive score learning in regression simultaneously. An efficient iterative optimization algorithm is proposed to solve the optimization problem. Extensive experiments on three datasets validate the discriminability of the features selected by SMJFS.
Collapse
|
7
|
An in-depth and contrasting survey of meta-heuristic approaches with classical feature selection techniques specific to cervical cancer. Knowl Inf Syst 2023. [DOI: 10.1007/s10115-022-01825-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
|
8
|
Robust multi-label feature selection with shared label enhancement. Knowl Inf Syst 2022. [DOI: 10.1007/s10115-022-01747-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
|
9
|
Discriminatory Label-specific Weights for Multi-label Learning with Missing Labels. Neural Process Lett 2022. [DOI: 10.1007/s11063-022-10945-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
10
|
Taguchi YH, Turki T. Novel feature selection method via kernel tensor decomposition for improved multi-omics data analysis. BMC Med Genomics 2022; 15:37. [PMID: 35209912 PMCID: PMC8876179 DOI: 10.1186/s12920-022-01181-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2021] [Accepted: 02/11/2022] [Indexed: 11/27/2022] Open
Abstract
BACKGROUND Feature selection of multi-omics data analysis remains challenging owing to the size of omics datasets, comprising approximately [Formula: see text]-[Formula: see text] features. In particular, appropriate methods to weight individual omics datasets are unclear, and the approach adopted has substantial consequences for feature selection. In this study, we extended a recently proposed kernel tensor decomposition (KTD)-based unsupervised feature extraction (FE) method to integrate multi-omics datasets obtained from common samples in a weight-free manner. METHOD KTD-based unsupervised FE was reformatted as the collection of kernelized tensors sharing common samples, which was applied to synthetic and real datasets. RESULTS The proposed advanced KTD-based unsupervised FE method showed comparative performance to that of the previously proposed KTD method, as well as tensor decomposition-based unsupervised FE, but required reduced memory and central processing unit time. Moreover, this advanced KTD method, specifically designed for multi-omics analysis, attributes P values to features, which is rare for existing multi-omics-oriented methods. CONCLUSIONS The sample R code is available at https://github.com/tagtag/MultiR/ .
Collapse
Affiliation(s)
- Y-h. Taguchi
- Department of Physics, Chuo University, 1-13-27 Kasuga, Bunkyo-ku, Tokyo, 112-8551 Japan
| | - Turki Turki
- Department of Computer Science, King Abdulaziz University, Jeddah, 21589 Saudi Arabia
| |
Collapse
|
11
|
LPI-HyADBS: a hybrid framework for lncRNA-protein interaction prediction integrating feature selection and classification. BMC Bioinformatics 2021; 22:568. [PMID: 34836494 PMCID: PMC8620196 DOI: 10.1186/s12859-021-04485-x] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2021] [Accepted: 11/09/2021] [Indexed: 12/03/2022] Open
Abstract
Background Long noncoding RNAs (lncRNAs) have dense linkages with a plethora of important cellular activities. lncRNAs exert functions by linking with corresponding RNA-binding proteins. Since experimental techniques to detect lncRNA-protein interactions (LPIs) are laborious and time-consuming, a few computational methods have been reported for LPI prediction. However, computation-based LPI identification methods have the following limitations: (1) Most methods were evaluated on a single dataset, and researchers may thus fail to measure their generalization ability. (2) The majority of methods were validated under cross validation on lncRNA-protein pairs, did not investigate the performance under other cross validations, especially for cross validation on independent lncRNAs and independent proteins. (3) lncRNAs and proteins have abundant biological information, how to select informative features need to further investigate. Results Under a hybrid framework (LPI-HyADBS) integrating feature selection based on AdaBoost, and classification models including deep neural network (DNN), extreme gradient Boost (XGBoost), and SVM with a penalty Coefficient of misclassification (C-SVM), this work focuses on finding new LPIs. First, five datasets are arranged. Each dataset contains lncRNA sequences, protein sequences, and an LPI network. Second, biological features of lncRNAs and proteins are acquired based on Pyfeat. Third, the obtained features of lncRNAs and proteins are selected based on AdaBoost and concatenated to depict each LPI sample. Fourth, DNN, XGBoost, and C-SVM are used to classify lncRNA-protein pairs based on the concatenated features. Finally, a hybrid framework is developed to integrate the classification results from the above three classifiers. LPI-HyADBS is compared to six classical LPI prediction approaches (LPI-SKF, LPI-NRLMF, Capsule-LPI, LPI-CNNCP, LPLNP, and LPBNI) on five datasets under 5-fold cross validations on lncRNAs, proteins, lncRNA-protein pairs, and independent lncRNAs and independent proteins. The results show LPI-HyADBS has the best LPI prediction performance under four different cross validations. In particular, LPI-HyADBS obtains better classification ability than other six approaches under the constructed independent dataset. Case analyses suggest that there is relevance between ZNF667-AS1 and Q15717. Conclusions Integrating feature selection approach based on AdaBoost, three classification techniques including DNN, XGBoost, and C-SVM, this work develops a hybrid framework to identify new linkages between lncRNAs and proteins. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04485-x.
Collapse
|