1
|
Liu K, Chen Q, Huang GH. An Efficient Feature Selection Algorithm for Gene Families Using NMF and ReliefF. Genes (Basel) 2023; 14:421. [PMID: 36833348 PMCID: PMC9957060 DOI: 10.3390/genes14020421] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2022] [Revised: 01/24/2023] [Accepted: 01/25/2023] [Indexed: 02/10/2023] Open
Abstract
Gene families, which are parts of a genome's information storage hierarchy, play a significant role in the development and diversity of multicellular organisms. Several studies have focused on the characteristics of gene families, such as function, homology, or phenotype. However, statistical and correlation analyses on the distribution of gene family members in the genome have yet to be conducted. Here, a novel framework incorporating gene family analysis and genome selection based on NMF-ReliefF is reported. Specifically, the proposed method starts by obtaining gene families from the TreeFam database and determining the number of gene families within the feature matrix. Then, NMF-ReliefF is used to select features from the gene feature matrix, which is a new feature selection algorithm that overcomes the inefficiencies of traditional methods. Finally, a support vector machine is utilized to classify the acquired features. The results show that the framework achieved an accuracy of 89.1% and an AUC of 0.919 on the insect genome test set. We also employed four microarray gene data sets to evaluate the performance of the NMF-ReliefF algorithm. The outcomes show that the proposed method may strike a delicate balance between robustness and discrimination. Additionally, the proposed method's categorization is superior to state-of-the-art feature selection approaches.
Collapse
Affiliation(s)
- Kai Liu
- College of Plant Protection, Hunan Agricultural University, Changsha 410128, China
- Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Nongda Road, Furong District, Changsha 410128, China
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China
| | - Qi Chen
- College of Plant Protection, Hunan Agricultural University, Changsha 410128, China
- Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Nongda Road, Furong District, Changsha 410128, China
| | - Guo-Hua Huang
- College of Plant Protection, Hunan Agricultural University, Changsha 410128, China
- Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Nongda Road, Furong District, Changsha 410128, China
| |
Collapse
|
2
|
Gerolami J, Wong JJM, Zhang R, Chen T, Imtiaz T, Smith M, Jamaspishvili T, Koti M, Glasgow JI, Mousavi P, Renwick N, Tyryshkin K. A Computational Approach to Identification of Candidate Biomarkers in High-Dimensional Molecular Data. Diagnostics (Basel) 2022; 12:diagnostics12081997. [PMID: 36010347 PMCID: PMC9407361 DOI: 10.3390/diagnostics12081997] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2022] [Revised: 08/16/2022] [Accepted: 08/17/2022] [Indexed: 12/13/2022] Open
Abstract
Complex high-dimensional datasets that are challenging to analyze are frequently produced through ‘-omics’ profiling. Typically, these datasets contain more genomic features than samples, limiting the use of multivariable statistical and machine learning-based approaches to analysis. Therefore, effective alternative approaches are urgently needed to identify features-of-interest in ‘-omics’ data. In this study, we present the molecular feature selection tool, a novel, ensemble-based, feature selection application for identifying candidate biomarkers in ‘-omics’ data. As proof-of-principle, we applied the molecular feature selection tool to identify a small set of immune-related genes as potential biomarkers of three prostate adenocarcinoma subtypes. Furthermore, we tested the selected genes in a model to classify the three subtypes and compared the results to models built using all genes and all differentially expressed genes. Genes identified with the molecular feature selection tool performed better than the other models in this study in all comparison metrics: accuracy, precision, recall, and F1-score using a significantly smaller set of genes. In addition, we developed a simple graphical user interface for the molecular feature selection tool, which is available for free download. This user-friendly interface is a valuable tool for the identification of potential biomarkers in gene expression datasets and is an asset for biomarker discovery studies.
Collapse
Affiliation(s)
- Justin Gerolami
- School of Computing, Queen’s University, Kingston, ON K7L 3N6, Canada
| | - Justin Jong Mun Wong
- Department of Pathology and Molecular Medicine, Queen’s University, Kingston, ON K7L 3N6, Canada
| | - Ricky Zhang
- School of Computing, Queen’s University, Kingston, ON K7L 3N6, Canada
| | - Tong Chen
- School of Computing, Queen’s University, Kingston, ON K7L 3N6, Canada
| | - Tashifa Imtiaz
- Department of Pathology and Molecular Medicine, Queen’s University, Kingston, ON K7L 3N6, Canada
| | - Miranda Smith
- School of Computing, Queen’s University, Kingston, ON K7L 3N6, Canada
| | - Tamara Jamaspishvili
- Department of Pathology and Molecular Medicine, Queen’s University, Kingston, ON K7L 3N6, Canada
- Department of Pathology & Laboratory Medicine, SUNY Upstate Medical University, Syracuse, NY 13210, USA
| | - Madhuri Koti
- Department of Biomedical and Molecular Sciences, Queen’s University, Kingston, ON K7L 3N6, Canada
| | | | - Parvin Mousavi
- School of Computing, Queen’s University, Kingston, ON K7L 3N6, Canada
| | - Neil Renwick
- Department of Pathology and Molecular Medicine, Queen’s University, Kingston, ON K7L 3N6, Canada
| | - Kathrin Tyryshkin
- School of Computing, Queen’s University, Kingston, ON K7L 3N6, Canada
- Department of Pathology and Molecular Medicine, Queen’s University, Kingston, ON K7L 3N6, Canada
- Correspondence: ; Tel.: +1-613-533-2345
| |
Collapse
|
3
|
Yang D, Zhu X. Gene Correlation Guided Gene Selection for Microarray Data Classification. Biomed Res Int 2021; 2021:6490118. [PMID: 34435048 DOI: 10.1155/2021/6490118] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Accepted: 08/09/2021] [Indexed: 12/14/2022]
Abstract
The microarray cancer data obtained by DNA microarray technology play an important role for cancer prevention, diagnosis, and treatment. However, predicting the different types of tumors is a challenging task since the sample size in microarray data is often small but the dimensionality is very high. Gene selection, which is an effective means, is aimed at mitigating the curse of dimensionality problem and can boost the classification accuracy of microarray data. However, many of previous gene selection methods focus on model design, but neglect the correlation between different genes. In this paper, we introduce a novel unsupervised gene selection method by taking the gene correlation into consideration, named gene correlation guided gene selection (G3CS). Specifically, we calculate the covariance of different gene dimension pairs and embed it into our unsupervised gene selection model to regularize the gene selection coefficient matrix. In such a manner, redundant genes can be effectively excluded. In addition, we utilize a matrix factorization term to exploit the cluster structure of original microarray data to assist the learning process. We design an iterative updating algorithm with convergence guarantee to solve the resultant optimization problem. Experimental results on six publicly available microarray datasets are conducted to validate the efficacy of our proposed method.
Collapse
|