1
|
Hamraz M, Abbas T, Ali F, Khan DM, Aamir M. Modified Robust Proportional Overlapping Score for feature selection in high-dimensional micro-array data. Comput Biol Med 2025; 191:110165. [PMID: 40233674 DOI: 10.1016/j.compbiomed.2025.110165] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2025] [Revised: 03/09/2025] [Accepted: 04/04/2025] [Indexed: 04/17/2025]
Abstract
High-dimensional microarray datasets often contain tens of thousands of genes but only a small number of samples, typically ranging from tens to a few hundred. This imbalance, known as the curse of dimensionality or the n ≪ p problem, hampers the learning process. To address this issue, this study introduces the Modified Robust Proportional Overlapping Score (MRPOS), an enhanced feature selection method based on robust measures of dispersion, specifically the Sn and Qn statistics by Rousseeuw and Croux. MRPOS identifies discriminative genes in binary class problems by evaluating gene expression overlap. This study considers the four gene expression datasets, each divided into two parts: a training subset covering 70 % of the data and a testing subset holding the remaining 30 %. The MRPOS eliminates genes with high inter-class similarity while retaining those differentiating classes. The method's performance is assessed against four established feature selection techniques using classification error rates from four gene expression datasets. Three classifiers, random forest, k-nearest neighbor (k-NN), and support vector machine (SVM), are employed, with results visualized through bar plots of classification errors. The findings highlight the distinctiveness and effectiveness of the proposed method.
Collapse
Affiliation(s)
- Muhammad Hamraz
- Department of Statistics, Abdul Wali Khan University, Mardan, 23200, Pakistan
| | - Tahir Abbas
- Department of Mathematics, College of Sciences, University of Sharjah, 27272, Sharjah, United Arab Emirates.
| | - Fawad Ali
- Department of Statistics, Abdul Wali Khan University, Mardan, 23200, Pakistan
| | - Dost Muhammad Khan
- Department of Statistics, Abdul Wali Khan University, Mardan, 23200, Pakistan
| | - Muhammad Aamir
- Department of Statistics, Abdul Wali Khan University, Mardan, 23200, Pakistan
| |
Collapse
|
2
|
Li Q, Zhang Z, Ma Z. Raman spectral pattern recognition of breast cancer: A machine learning strategy based on feature fusion and adaptive hyperparameter optimization. Heliyon 2023; 9:e18148. [PMID: 37501962 PMCID: PMC10368853 DOI: 10.1016/j.heliyon.2023.e18148] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2023] [Revised: 07/08/2023] [Accepted: 07/10/2023] [Indexed: 07/29/2023] Open
Abstract
Raman spectroscopy, as a kind of molecular vibration spectroscopy, provides abundant information for measuring components and molecular structure in the early detection and diagnosis of breast cancer. Currently, portable Raman spectrometers have simplified and made equipment application more affordable, albeit at the cost of sacrificing the signal-to-noise ratio (SNR). Consequently, this necessitates a higher recognition rate from pattern recognition algorithms. Our study employs a feature fusion strategy to reduce the dimensionality of high-dimensional Raman spectra and enhance the discriminative information between normal tissues and tumors. In the conducted random experiment, the classifier achieved a performance of over 96% for all three average metrics: accuracy, sensitivity, and specificity. Additionally, we propose a multi-parameter serial encoding evolutionary algorithm (MSEA) and integrate it into the Adaptive Local Hyperplane K-nearest Neighbor classification algorithm (ALHK) for adaptive hyperparameter optimization. The implementation of serial encoding tackles the predicament of parallel optimization in multi-hyperparameter vector problems. To bolster the convergence of the optimization algorithm towards a global optimal solution, an exponential viability function is devised for nonlinear processing. Moreover, an improved elitist strategy is employed for individual selection, effectively eliminating the influence of probability factors on the robustness of the optimization algorithm. This study further optimizes the hyperparameter space through sensitivity analysis of hyperparameters and cross-validation experiments, leading to superior performance compared to the ALHK algorithm with manual hyperparameter configuration.
Collapse
Affiliation(s)
- Qingbo Li
- School of Instrumentation and Optoelectronic Engineering, Precision Opto-Mechatronics Technology Key Laboratory of Education Ministry, Beihang University, Xueyuan Road No. 37, Haidian District, Beijing, 100191, China
| | - Zhixiang Zhang
- School of Instrumentation and Optoelectronic Engineering, Precision Opto-Mechatronics Technology Key Laboratory of Education Ministry, Beihang University, Xueyuan Road No. 37, Haidian District, Beijing, 100191, China
| | - Zhenhe Ma
- Hebei Key Laboratory of Micro-Nano Precision Optical Sensing and Detection Technology, Northeastern University, Qinhuangdao Campus, Qinhuangdao, 066004, China
| |
Collapse
|
3
|
Hamraz M, Ali A, Mashwani WK, Aldahmani S, Khan Z. Feature selection for high dimensional microarray gene expression data via weighted signal to noise ratio. PLoS One 2023; 18:e0284619. [PMID: 37098036 PMCID: PMC10128961 DOI: 10.1371/journal.pone.0284619] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Accepted: 04/04/2023] [Indexed: 04/26/2023] Open
Abstract
Feature selection in high dimensional gene expression datasets not only reduces the dimension of the data, but also the execution time and computational cost of the underlying classifier. The current study introduces a novel feature selection method called weighted signal to noise ratio (WSNR) by exploiting the weights of features based on support vectors and signal to noise ratio, with an objective to identify the most informative genes in high dimensional classification problems. The combination of two state-of-the-art procedures enables the extration of the most informative genes. The corresponding weights of these procedures are then multiplied and arranged in decreasing order. Larger weight of a feature indicates its discriminatory power in classifying the tissue samples to their true classes. The current method is validated on eight gene expression datasets. Moreover, results of the proposed method (WSNR) are also compared with four well known feature selection methods. We found that the (WSNR) outperform the other competing methods on 6 out of 8 datasets. Box-plots and Bar-plots of the results of the proposed method and all the other methods are also constructed. The proposed method is further assessed on simulated data. Simulation analysis reveal that (WSNR) outperforms all the other methods included in the study.
Collapse
Affiliation(s)
- Muhammad Hamraz
- Department of Statistics, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Amjad Ali
- Department of Statistics, Abdul Wali Khan University Mardan, Mardan, Pakistan
| | - Wali Khan Mashwani
- Institute of Numerical Sciences, Kohat University of Science and Technology, Kohat, Pakistan
| | - Saeed Aldahmani
- Department of Analytics in the Digital Era, United Arab Emirates University, Al Ain, UAE
| | - Zardad Khan
- Department of Analytics in the Digital Era, United Arab Emirates University, Al Ain, UAE
| |
Collapse
|
4
|
Liang J, Wang C, Zhang D, Xie Y, Zeng Y, Li T, Zuo Z, Ren J, Zhao Q. VSOLassoBag: a variable-selection oriented LASSO bagging algorithm for biomarker discovery in omic-based translational research. J Genet Genomics 2023; 50:151-162. [PMID: 36608930 DOI: 10.1016/j.jgg.2022.12.005] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2022] [Accepted: 12/26/2022] [Indexed: 01/04/2023]
Abstract
Screening biomolecular markers from high-dimensional biological data is one of the long-standing tasks for biomedical translational research. With its advantages in both feature shrinkage and biological interpretability, Least Absolute Shrinkage and Selection Operator (LASSO) algorithm is one of the most popular methods for the scenarios of clinical biomarker development. However, in practice, applying LASSO on omics-based data with high dimensions and low-sample size may usually result in an excess number of predictive variables, leading to the overfitting of the model. Here, we present VSOLassoBag, a wrapped LASSO approach by integrating an ensemble learning strategy to help select efficient and stable variables with high confidence from omics-based data. Using a bagging strategy in combination with a parametric method or inflection point search method, VSOLassoBag can integrate and vote variables generated from multiple LASSO models to determine the optimal candidates. The application of VSOLassoBag on both simulation datasets and real-world datasets shows that the algorithm can effectively identify markers for either case-control binary classification or prognosis prediction. In addition, by comparing with multiple existing algorithms, VSOLassoBag shows a comparable performance under different scenarios while resulting in fewer features than others. In summary, VSOLassoBag, which is available at https://seqworld.com/VSOLassoBag/ under the GPL v3 license, provides an alternative strategy for selecting reliable biomarkers from high-dimensional omics data. For user's convenience, we implement VSOLassoBag as an R package that provides multithreading computing configurations.
Collapse
Affiliation(s)
- Jiaqi Liang
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-sen University Cancer Center, Guangzhou, Guangdong 510060, China; State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, Guangdong 510275, China
| | - Chaoye Wang
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-sen University Cancer Center, Guangzhou, Guangdong 510060, China
| | - Di Zhang
- Department of Coloproctology Surgery, Guangdong Provincial Key Laboratory of Colorectal and Pelvic Floor Diseases, Guangdong Institute of Gastroenterology, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, Guangdong 510655, China
| | - Yubin Xie
- Precision Medicine Institute, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou, Guangdong 510060, China
| | - Yanru Zeng
- State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, Guangdong 510275, China
| | - Tianqin Li
- Computer Science Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, United States
| | - Zhixiang Zuo
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-sen University Cancer Center, Guangzhou, Guangdong 510060, China
| | - Jian Ren
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-sen University Cancer Center, Guangzhou, Guangdong 510060, China
| | - Qi Zhao
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Sun Yat-sen University Cancer Center, Guangzhou, Guangdong 510060, China.
| |
Collapse
|
5
|
An Efficient AP-ANN-Based Multimethod Fusion Model to Detect Stress through EEG Signal Analysis. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:7672297. [PMID: 36544857 PMCID: PMC9763020 DOI: 10.1155/2022/7672297] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/25/2022] [Revised: 09/30/2022] [Accepted: 10/31/2022] [Indexed: 12/14/2022]
Abstract
Stress is a universal emotion that every human experiences daily. Psychologists say stress may lead to heart attack, depression, hypertension, strokes, or even sudden death. Many technical explorations like stress detection through facial expression, speech, text, physical behaviors, etc., were explored, but no consensus has been reached on the best method. The advancement in biomedical engineering yielded a rapid development of electroencephalogram (EEG) signal analysis that has inspired the idea of a multimethod fusion approach for the first time which employs multiple techniques such as discrete wavelet transform (DWT) for de-noising, adaptive synthetic sampling (ADASYN) for class balancing, and affinity propagation (AP) as a stratified sampling model along with the artificial neural network (ANN) as the classifier model for human emotion classification. From the EEG recordings of the DEAP dataset, the artifacts are removed, the signal is decomposed using a DWT, and features are extracted and fused to form the feature vector. As the dataset is high-dimensional, feature selection is done and ADASYN is used to address the imbalance of classes resulting in large-scale data. The innovative idea of the proposed system is to perform sampling using affinity propagation as a stratified sampling-based clustering algorithm as it determines the number of representative samples automatically which makes it superior to the K-Means, K-Medoid, that requires the K-value. Those samples are used as inputs to various classification models, the comparison of the AP-ANN, AP-SVM, and AP-RF is done, and their most important five performance metrics such as accuracy, precision, recall, F1-score, and specificity were compared. From our experiment, the AP-ANN model provides better accuracy of 86.8% and greater precision of 85.7%, a higher F1 score of 84.9%, a recall rate of 84.1%, and a specificity value of 89.2% which altogether provides better results than the other existing algorithms.
Collapse
|