1
|
Downie L, Bouffler SE, Amor DJ, Christodoulou J, Yeung A, Horton AE, Macciocca I, Archibald AD, Wall M, Caruana J, Lunke S, Stark Z. Gene selection for genomic newborn screening: Moving toward consensus? Genet Med 2024; 26:101077. [PMID: 38275146 DOI: 10.1016/j.gim.2024.101077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2023] [Revised: 01/15/2024] [Accepted: 01/18/2024] [Indexed: 01/27/2024] Open
Abstract
PURPOSE Gene selection for genomic newborn screening (gNBS) underpins the validity, acceptability, and ethical application of this technology. Existing gNBS gene lists are highly variable despite being based on shared principles of gene-disease validity, treatability, and age of onset. This study aimed to curate a gNBS gene list that builds upon existing efforts and provide a core consensus list of gene-disease pairs assessed by multiple expert groups worldwide. METHODS Our multidisciplinary expert team curated a gene list using an open platform and multiple existing curated resources. We included severe treatable disorders with age of disease onset <5 years with established gene-disease associations and reliable variant detection. We compared the final list with published lists from 5 other gNBS projects to determine consensus genes and to identify areas of discrepancy. RESULTS We reviewed 1279 genes and 604 met our inclusion criteria. Metabolic conditions comprised the largest group (25%), followed by immunodeficiencies (21%) and endocrine disorders (15%). We identified 55 consensus genes included by all 6 gNBS research projects. Common reasons for discrepancy included variable definitions of treatability and strength of gene-disease association. CONCLUSION We have identified a consensus gene list for gNBS that can be used as a basis for systematic harmonization efforts internationally.
Collapse
Affiliation(s)
- Lilian Downie
- Victorian Clinical Genetics Services, Murdoch Children's Research Institute, Melbourne, VIC, Australia; Murdoch Children's Research Institute, Melbourne, VIC, Australia; University of Melbourne, Melbourne, VIC, Australia
| | | | - David J Amor
- Murdoch Children's Research Institute, Melbourne, VIC, Australia; University of Melbourne, Melbourne, VIC, Australia
| | - John Christodoulou
- Victorian Clinical Genetics Services, Murdoch Children's Research Institute, Melbourne, VIC, Australia; Murdoch Children's Research Institute, Melbourne, VIC, Australia; University of Melbourne, Melbourne, VIC, Australia
| | - Alison Yeung
- Victorian Clinical Genetics Services, Murdoch Children's Research Institute, Melbourne, VIC, Australia; University of Melbourne, Melbourne, VIC, Australia
| | - Ari E Horton
- Victorian Heart Institute, Monash University, Melbourne, VIC, Australia; Public Health Genomics, School of Public Health and Preventive Medicine, Monash University, Melbourne, VIC, Australia
| | - Ivan Macciocca
- Victorian Clinical Genetics Services, Murdoch Children's Research Institute, Melbourne, VIC, Australia; University of Melbourne, Melbourne, VIC, Australia
| | - Alison D Archibald
- Victorian Clinical Genetics Services, Murdoch Children's Research Institute, Melbourne, VIC, Australia; Murdoch Children's Research Institute, Melbourne, VIC, Australia; University of Melbourne, Melbourne, VIC, Australia
| | - Meghan Wall
- Victorian Clinical Genetics Services, Murdoch Children's Research Institute, Melbourne, VIC, Australia; Murdoch Children's Research Institute, Melbourne, VIC, Australia; University of Melbourne, Melbourne, VIC, Australia
| | - Jade Caruana
- Murdoch Children's Research Institute, Melbourne, VIC, Australia
| | - Sebastian Lunke
- Victorian Clinical Genetics Services, Murdoch Children's Research Institute, Melbourne, VIC, Australia; University of Melbourne, Melbourne, VIC, Australia
| | - Zornitza Stark
- Victorian Clinical Genetics Services, Murdoch Children's Research Institute, Melbourne, VIC, Australia; University of Melbourne, Melbourne, VIC, Australia; Australian Genomics, Melbourne, VIC, Australia.
| |
Collapse
|
2
|
Qiu F, Heidari AA, Chen Y, Chen H, Liang G. Advancing forensic-based investigation incorporating slime mould search for gene selection of high-dimensional genetic data. Sci Rep 2024; 14:8599. [PMID: 38615048 PMCID: PMC11016116 DOI: 10.1038/s41598-024-59064-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2024] [Accepted: 04/06/2024] [Indexed: 04/15/2024] Open
Abstract
Modern medicine has produced large genetic datasets of high dimensions through advanced gene sequencing technology, and processing these data is of great significance for clinical decision-making. Gene selection (GS) is an important data preprocessing technique that aims to select a subset of feature information to improve performance and reduce data dimensionality. This study proposes an improved wrapper GS method based on forensic-based investigation (FBI). The method introduces the search mechanism of the slime mould algorithm in the FBI to improve the original FBI; the newly proposed algorithm is named SMA_FBI; then GS is performed by converting the continuous optimizer to a binary version of the optimizer through a transfer function. In order to verify the superiority of SMA_FBI, experiments are first executed on the 30-function test set of CEC2017 and compared with 10 original algorithms and 10 state-of-the-art algorithms. The experimental results show that SMA_FBI is better than other algorithms in terms of finding the optimal solution, convergence speed, and robustness. In addition, BSMA_FBI (binary version of SMA_FBI) is compared with 8 binary algorithms on 18 high-dimensional genetic data from the UCI repository. The results indicate that BSMA_FBI is able to obtain high classification accuracy with fewer features selected in GS applications. Therefore, SMA_FBI is considered an optimization tool with great potential for dealing with global optimization problems, and its binary version, BSMA_FBI, can be used for GS tasks.
Collapse
Affiliation(s)
- Feng Qiu
- Institute of Big Data and Information Technology, Wenzhou University, Wenzhou, 325035, China
| | - Ali Asghar Heidari
- School of Surveying and Geospatial Engineering, College of Engineering, University of Tehran, Tehran, Iran
| | - Yi Chen
- Department of Computer Science and Artificial Intelligence, Wenzhou University, Wenzhou, 325035, China
| | - Huiling Chen
- Institute of Big Data and Information Technology, Wenzhou University, Wenzhou, 325035, China.
| | - Guoxi Liang
- Department of Artificial Intelligence, Wenzhou Polytechnic, Wenzhou, 325035, China.
| |
Collapse
|
3
|
Chen J, Wen B. Bi-level gene selection of cancer by combining clustering and sparse learning. Comput Biol Med 2024; 172:108236. [PMID: 38471351 DOI: 10.1016/j.compbiomed.2024.108236] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2023] [Revised: 02/07/2024] [Accepted: 02/25/2024] [Indexed: 03/14/2024]
Abstract
The diagnosis of cancer based on gene expression profile data has attracted extensive attention in the field of biomedical science. This type of data usually has the characteristics of high dimensionality and noise. In this paper, a hybrid gene selection method based on clustering and sparse learning is proposed to choose the key genes with high precision. We first propose a filter method, which combines the k-means clustering algorithm and signal-to-noise ratio ranking method, and then, a weighted gene co-expression network has been applied to the reduced data set to identify modules corresponding to biological pathways. Moreover, we choose the key genes by using group bridge and sparse group lasso as wrapper methods. Finally, we conduct some numerical experiments on six cancer datasets. The numerical results show that our proposed method has achieved good performance in gene selection and cancer classification.
Collapse
Affiliation(s)
- Junnan Chen
- School of Science, Hebei University of Technology, Tianjin, PR China.
| | - Bo Wen
- Institute of Mathematics, Hebei University of Technology, Tianjin, PR China.
| |
Collapse
|
4
|
Li M, Cao R, Zhao Y, Li Y, Deng S. Population characteristic exploitation-based multi-orientation multi-objective gene selection for microarray data classification. Comput Biol Med 2024; 170:108089. [PMID: 38330824 DOI: 10.1016/j.compbiomed.2024.108089] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2023] [Revised: 01/23/2024] [Accepted: 01/27/2024] [Indexed: 02/10/2024]
Abstract
Gene selection is a process of selecting discriminative genes from microarray data that helps to diagnose and classify cancer samples effectively. Swarm intelligence evolution-based gene selection algorithms can never circumvent the problem that the population is prone to local optima in the process of gene selection. To tackle this challenge, previous research has focused primarily on two aspects: mitigating premature convergence to local optima and escaping from local optima. In contrast to these strategies, this paper introduces a novel perspective by adopting reverse thinking, where the issue of local optima is seen as an opportunity rather than an obstacle. Building on this foundation, we propose MOMOGS-PCE, a novel gene selection approach that effectively exploits the advantageous characteristics of populations trapped in local optima to uncover global optimal solutions. Specifically, MOMOGS-PCE employs a novel population initialization strategy, which involves the initialization of multiple populations that explore diverse orientations to foster distinct population characteristics. The subsequent step involved the utilization of an enhanced NSGA-II algorithm to amplify the advantageous characteristics exhibited by the population. Finally, a novel exchange strategy is proposed to facilitate the transfer of characteristics between populations that have reached near maturity in evolution, thereby promoting further population evolution and enhancing the search for more optimal gene subsets. The experimental results demonstrated that MOMOGS-PCE exhibited significant advantages in comprehensive indicators compared with six competitive multi-objective gene selection algorithms. It is confirmed that the "reverse-thinking" approach not only avoids local optima but also leverages it to uncover superior gene subsets for cancer diagnosis.
Collapse
Affiliation(s)
- Min Li
- School of Information Engineering, Nanchang Institute of Technology, No. 289 Tianxiang Road, Nanchang, Jiangxi, PR China.
| | - Rutun Cao
- School of Information Engineering, Nanchang Institute of Technology, No. 289 Tianxiang Road, Nanchang, Jiangxi, PR China
| | - Yangfan Zhao
- School of Information Engineering, Nanchang Institute of Technology, No. 289 Tianxiang Road, Nanchang, Jiangxi, PR China
| | - Yulong Li
- School of Information Engineering, Nanchang Institute of Technology, No. 289 Tianxiang Road, Nanchang, Jiangxi, PR China
| | - Shaobo Deng
- School of Information Engineering, Nanchang Institute of Technology, No. 289 Tianxiang Road, Nanchang, Jiangxi, PR China
| |
Collapse
|
5
|
Osama S, Ali M, Ali AA, Shaban H. Gene selection and tumor identification based on a hybrid of the multi-filter embedded recursive mountain gazelle algorithm. Comput Biol Med 2023; 167:107674. [PMID: 37976816 DOI: 10.1016/j.compbiomed.2023.107674] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Revised: 10/09/2023] [Accepted: 11/06/2023] [Indexed: 11/19/2023]
Abstract
Microarray gene expression data are useful for identifying gene expression patterns associated with cancer outcomes; however, their high dimensionality make it difficult to extract meaningful information and accurately classify tumors. Hence, developing effective methods for reducing dimensionality while preserving relevant information is a crucial task. Hybrid-based gene selection methods are widely proposed in the gene expression analysis domain and can still be enhanced in terms of efficiency and reliability. This study proposes a new hybrid-based gene selection method, called multi-filter embedded mountain gazelle optimizer (MUL-MGO), which utilizes two filters and an embedded method to remove irrelevant genes, followed by selecting the most relevant genes using recently developed MGO algorithm. To the best of our knowledge, this is the first work to exploit MGO as a gene or feature selection method. A new version of MGO, called recursive mountain gazelle optimizer (RMGO), which implements MGO algorithm recursively to avoid local optima, minimize search space, and obtain minimum gene count without decreasing the classifier's performance, is developed. The proposed RMGO is used to develop a new hybrid gene selection method employing similar filters and embedded methods as MUL-MGO, but with a recursive MGO algorithm version. The resulting method is called multi-filter embedded recursive mountain gazelle optimizer (MUL-RMGO). Several classifiers are used for cancer classification. Accordingly, several experimental studies are performed on eight microarray gene expression datasets to demonstrate the proficiencies of MUL-MGO and MUL-RMGO methods. The experimental findings indicate the efficiency and productivity of the suggested MUL-MGO and MUL-RMGO methods for gene selection. The methods outperform cutting-edge methods in the literature, with MUL-RMGO exceeding MUL-MGO in terms of accuracy and selected gene count.
Collapse
Affiliation(s)
- Sarah Osama
- Computer Science Department, Faculty of Computers and Information, Minia University, Minia, Egypt.
| | - Moatez Ali
- Department of Internal Medicine, St. Barnabas Hospital, NY, USA.
| | - Abdelmgeid A Ali
- Computer Science Department, Faculty of Computers and Information, Minia University, Minia, Egypt.
| | - Hassan Shaban
- Computer Science Department, Faculty of Computers and Information, Minia University, Minia, Egypt.
| |
Collapse
|
6
|
Moslemi A, Ahmadian A. Dual regularized subspace learning using adaptive graph learning and rank constraint: Unsupervised feature selection on gene expression microarray datasets. Comput Biol Med 2023; 167:107659. [PMID: 37950946 DOI: 10.1016/j.compbiomed.2023.107659] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Revised: 10/13/2023] [Accepted: 10/31/2023] [Indexed: 11/13/2023]
Abstract
High-dimensional problems have increasingly drawn attention in gene selection and analysis. To add insult to injury, usually the number of features is greater than number of samples in microarray gene dataset which leads to an ill-posed underdetermined equation system. Poor performance and high computational time for learning algorithms are consequences of redundant features in high-dimensional data. Feature selection is a noteworthy pre-processing method to ameliorate the curse of dimensionality with aim of maximum relevancy and minimum redundancy information preservation. Likewise, unsupervised feature selection has been important since collecting labels for data is expensive. In this paper, we develop a novel robust unsupervised feature selection to select discriminative subset of features for unlabeled data based on rank constrained and dual regularized nonnegative matrix factorization. The major focus of the proposed technique is to discard redundant features while keeping the informative features. Proposed feature selection technique consists of nonnegative matrix factorization to decompose the data into feature weight matrix and representation matrix, inner product norm as regularization for both feature weight matrix and representation matrix, adaptive structure learning to preserve local information and Schatten-p norm as rank constraint. To demonstrate the effectiveness of the proposed method, numerical studies are conducted on six benchmark microarray datasets. The results show that the proposed technique outperforms eight state-of-art unsupervised feature selection techniques in terms of clustering accuracy and normalized mutual information.
Collapse
Affiliation(s)
- Amir Moslemi
- Imaging Research and Physical Sciences, Sunnybrook Health Sciences Centre, Toronto, Ontario, Canada.
| | - Arash Ahmadian
- Edward S. Rogers Sr. Department of Electrical and Computer Engineering, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
7
|
Rahimi MR, Makarem D, Sarspy S, Mahdavi SA, Albaghdadi MF, Armaghan SM. Classification of cancer cells and gene selection based on microarray data using MOPSO algorithm. J Cancer Res Clin Oncol 2023; 149:15171-15184. [PMID: 37634207 DOI: 10.1007/s00432-023-05308-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2023] [Accepted: 08/16/2023] [Indexed: 08/29/2023]
Abstract
PURPOSE Microarray information is crucial for the identification and categorisation of malignant tissues. The very limited sample size in the microarray has always been a challenge for classification design in cancer research. As a result, by pre-processing gene selection approaches and genes lacking their information, the microarray data are deleted prior to categorisation. In essence, an appropriate gene selection technique can significantly increase the accuracy of illness (cancer) classification. METHODS For the classification of high-dimensional microarray data, a novel approach based on the hybrid model of multi-objective particle swarm optimisation (MOPSO) is proposed in this research. First, a binary vector representing each particle's position is presented at random. A gene is represented by each bit. Bit 0 denotes the absence of selection of the characteristic (gene) corresponding to it, while bit 1 denotes the selection of the gene. Therefore, the position of each particle represents a set of genes, and the linear Bayesian discriminant analysis classification algorithm calculates each particle's degree of fitness to assess the quality of the gene set that particle has chosen. The suggested methodology is applied to four different cancer database sets, and the results are contrasted with those of other approaches currently in use. RESULTS The proposed algorithm has been applied on four sets of cancer database and its results have been compared with other existing methods. The results of the implementation show that the improvement of classification accuracy in the proposed algorithm compared to other methods for four sets of databases is 25.84% on average. So that it has improved by 18.63% in the blood cancer database, 24.25% in the lung cancer database, 27.73% in the breast cancer database, and 32.80% in the prostate cancer database. Therefore, the proposed algorithm is able to identify a small set of genes containing information in a way choose to increase the classification accuracy. CONCLUSION Our proposed solution is used for data classification, which also improves classification accuracy. This is possible because the MOPSO model removes redundancy and reduces the number of redundant and redundant genes by considering how genes are correlated with each other.
Collapse
Affiliation(s)
| | - Dorna Makarem
- Escuela Tecnica Superior de Ingenieros de Telecomunicacion Politecnica de Madrid, Madrid, Spain
| | - Sliva Sarspy
- Department of Computer Science, College of Science, Cihan University-Erbil, Erbil, Iraq
| | | | | | | |
Collapse
|
8
|
Li Y, Wu M, Ma S, Wu M. ZINBMM: a general mixture model for simultaneous clustering and gene selection using single-cell transcriptomic data. Genome Biol 2023; 24:208. [PMID: 37697330 PMCID: PMC10496184 DOI: 10.1186/s13059-023-03046-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2022] [Accepted: 08/22/2023] [Indexed: 09/13/2023] Open
Abstract
Clustering is a critical component of single-cell RNA sequencing (scRNA-seq) data analysis and can help reveal cell types and infer cell lineages. Despite considerable successes, there are few methods tailored to investigating cluster-specific genes contributing to cell heterogeneity, which can promote biological understanding of cell heterogeneity. In this study, we propose a zero-inflated negative binomial mixture model (ZINBMM) that simultaneously achieves effective scRNA-seq data clustering and gene selection. ZINBMM conducts a systemic analysis on raw counts, accommodating both batch effects and dropout events. Simulations and the analysis of five scRNA-seq datasets demonstrate the practical applicability of ZINBMM.
Collapse
Affiliation(s)
- Yang Li
- Center for Applied Statistics and School of Statistics, Renmin University of China, Beijing, China
- RSS and China-Re Life Joint Lab on Public Health and Risk Management, Renmin University of China, Beijing, China
- Statistical Consulting Center, Renmin University of China, Beijing, China
| | - Mingcong Wu
- Center for Applied Statistics and School of Statistics, Renmin University of China, Beijing, China
- Statistical Consulting Center, Renmin University of China, Beijing, China
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, USA
| | - Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China.
| |
Collapse
|
9
|
Moslemi A, Bidar M, Ahmadian A. Subspace learning using structure learning and non-convex regularization: Hybrid technique with mushroom reproduction optimization in gene selection. Comput Biol Med 2023; 164:107309. [PMID: 37536092 DOI: 10.1016/j.compbiomed.2023.107309] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2023] [Revised: 07/26/2023] [Accepted: 07/28/2023] [Indexed: 08/05/2023]
Abstract
Gene selection as a problem with high dimensions has drawn considerable attention in machine learning and computational biology over the past decade. In the field of gene selection in cancer datasets, different types of feature selection techniques in terms of strategy (filter, wrapper and embedded) and label information (supervised, unsupervised, and semi-supervised) have been developed. However, using hybrid feature selection can still improve the performance. In this paper, we propose a hybrid feature selection based on filter and wrapper strategies. In the filter-phase, we develop an unsupervised features selection based on non-convex regularized non-negative matrix factorization and structure learning, which we deem NCNMFSL. In the wrapper-phase, for the first time, mushroom reproduction optimization (MRO) is leveraged to obtain the most informative features subset. In this hybrid feature selection method, irrelevant features are filtered-out through NCNMFSL, and most discriminative features are selected by MRO. To show the effectiveness and proficiency of the proposed method, numerical experiments are conducted on Breast, Heart, Colon, Leukemia, Prostate, Tox-171 and GLI-85 benchmark datasets. SVM and decision tree classifiers are leveraged to analyze proposed technique and top accuracy are 0.97, 0.84, 0.98, 0.95, 0.98, 0.87 and 0.85 for Breast, Heart, Colon, Leukemia, Prostate, Tox-171 and GLI-85, respectively. The computational results show the effectiveness of the proposed method in comparison with state-of-art feature selection techniques.
Collapse
Affiliation(s)
- Amir Moslemi
- Department of Physics, Ryerson University, Toronto, ON, Canada.
| | - Mahdi Bidar
- Department of Computer Science, University of Regina, Regina, Canada
| | - Arash Ahmadian
- Edward S. Rogers Sr. Department of Electrical and Computer Engineering, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
10
|
Alweshah M, Aldabbas Y, Abu-Salih B, Oqeil S, Hasan HS, Alkhalaileh S, Kassaymeh S. Hybrid black widow optimization with iterated greedy algorithm for gene selection problems. Heliyon 2023; 9:e20133. [PMID: 37809602 PMCID: PMC10559925 DOI: 10.1016/j.heliyon.2023.e20133] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2023] [Revised: 09/03/2023] [Accepted: 09/12/2023] [Indexed: 10/10/2023] Open
Abstract
Gene Selection (GS) is a strategy method targeted at reducing redundancy, limited expressiveness, and low informativeness in gene expression datasets obtained by DNA Microarray technology. These datasets contain a plethora of diverse and high-dimensional samples and genes, with a significant discrepancy in the number of samples and genes present. The complexities of GS are especially noticeable in the context of microarray expression data analysis, owing to the inherent data imbalance. The main goal of this study is to offer a simplified and computationally effective approach to dealing with the conundrum of attribute selection in microarray gene expression data. We use the Black Widow Optimization algorithm (BWO) in the context of GS to achieve this, using two unique methodologies: the unaltered BWO variation and the hybridized BWO variant combined with the Iterated Greedy algorithm (BWO-IG). By improving the local search capabilities of BWO, this hybridization attempts to promote more efficient gene selection. A series of tests was carried out using nine benchmark datasets that were obtained from the gene expression data repository in the pursuit of empirical validation. The results of these tests conclusively show that the BWO-IG technique performs better than the traditional BWO algorithm. Notably, the hybridized BWO-IG technique excels in the efficiency of local searches, making it easier to identify relevant genes and producing findings with higher levels of reliability in terms of accuracy and the degree of gene pruning. Additionally, a comparison analysis is done against five modern wrapper Feature Selection (FS) methodologies, namely BIMFOHHO, BMFO, BHHO, BCS, and BBA, in order to put the suggested BWO-IG method's effectiveness into context. The comparison that follows highlights BWO-IG's obvious superiority in reducing the number of selected genes while also obtaining remarkably high classification accuracy. The key findings were an average classification accuracy of 94.426, average fitness values of 0.061, and an average number of selected genes of 2933.767.
Collapse
Affiliation(s)
- Mohammed Alweshah
- Prince Abdullah Bin Ghazi Faculty of Information and Communication Technology, Al-Balqa Applied University, Al-Salt, Jordan
| | - Yasmeen Aldabbas
- Prince Abdullah Bin Ghazi Faculty of Information and Communication Technology, Al-Balqa Applied University, Al-Salt, Jordan
| | - Bilal Abu-Salih
- Department of Computer Science, King Abdullah II School of Information Technology, The University of Jordan, Amman, Jordan
| | - Saleh Oqeil
- Prince Abdullah Bin Ghazi Faculty of Information and Communication Technology, Al-Balqa Applied University, Al-Salt, Jordan
| | - Hazem S. Hasan
- Department of Plant Production and Protection, Faculty of Agricultural Technology, Al-Balqa Applied University, Al-Salt, Jordan
| | - Saleh Alkhalaileh
- Prince Abdullah Bin Ghazi Faculty of Information and Communication Technology, Al-Balqa Applied University, Al-Salt, Jordan
| | - Sofian Kassaymeh
- Software Engineering Department, Faculty of Information Technology, Aqaba University of Technology, Aqaba, Jordan
| |
Collapse
|
11
|
Fu Q, Li Q, Li X. An improved multi-objective marine predator algorithm for gene selection in classification of cancer microarray data. Comput Biol Med 2023; 160:107020. [PMID: 37196457 DOI: 10.1016/j.compbiomed.2023.107020] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2023] [Revised: 04/09/2023] [Accepted: 05/05/2023] [Indexed: 05/19/2023]
Abstract
Gene selection (GS) is an important branch of interest within the field of feature selection, which is widely used in cancer classification. It provides essential insights into the pathogenesis of cancer and enables a deeper understanding of cancer data. In cancer classification, GS is essentially a multi-objective optimization problem, which aims to simultaneously optimize the two objectives of classification accuracy and the size of the gene subset. The marine predator algorithm (MPA) has been successfully employed in practical applications, however, its random initialization can lead to blindness, which may adversely affect the convergence of the algorithm. Furthermore, the elite individuals in guiding evolution are randomly chosen from the Pareto solutions, which may degrade the good exploration performance of the population. To overcome these limitations, a multi-objective improved MPA with continuous mapping initialization and leader selection strategies is proposed. In this work, a new continuous mapping initialization with ReliefF overwhelms the defects with less information in late evolution. Moreover, an improved elite selection mechanism with Gaussian distribution guides the population to evolve towards a better Pareto front. Finally, an efficient mutation method is adopted to prevent evolutionary stagnation. To evaluate its effectiveness, the proposed algorithm was compared with 9 famous algorithms. The experimental results on 16 datasets demonstrate that the proposed algorithm can significantly reduce the data dimension and obtain the highest classification accuracy on most of high-dimension cancer microarray datasets.
Collapse
Affiliation(s)
- Qiyong Fu
- School of Computer Science and Technology, Zhejiang Normal University, Jinhua 321004, China
| | - Qi Li
- School of Computer Science and Technology, Zhejiang Normal University, Jinhua 321004, China
| | - Xiaobo Li
- School of Computer Science and Technology, Zhejiang Normal University, Jinhua 321004, China.
| |
Collapse
|
12
|
Daneshvar NHN, Masoudi-Sobhanzadeh Y, Omidi Y. A voting-based machine learning approach for classifying biological and clinical datasets. BMC Bioinformatics 2023; 24:140. [PMID: 37041456 PMCID: PMC10088226 DOI: 10.1186/s12859-023-05274-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2022] [Accepted: 04/05/2023] [Indexed: 04/13/2023] Open
Abstract
BACKGROUND Different machine learning techniques have been proposed to classify a wide range of biological/clinical data. Given the practicability of these approaches accordingly, various software packages have been also designed and developed. However, the existing methods suffer from several limitations such as overfitting on a specific dataset, ignoring the feature selection concept in the preprocessing step, and losing their performance on large-size datasets. To tackle the mentioned restrictions, in this study, we introduced a machine learning framework consisting of two main steps. First, our previously suggested optimization algorithm (Trader) was extended to select a near-optimal subset of features/genes. Second, a voting-based framework was proposed to classify the biological/clinical data with high accuracy. To evaluate the efficiency of the proposed method, it was applied to 13 biological/clinical datasets, and the outcomes were comprehensively compared with the prior methods. RESULTS The results demonstrated that the Trader algorithm could select a near-optimal subset of features with a significant level of p-value < 0.01 relative to the compared algorithms. Additionally, on the large-sie datasets, the proposed machine learning framework improved prior studies by ~ 10% in terms of the mean values associated with fivefold cross-validation of accuracy, precision, recall, specificity, and F-measure. CONCLUSION Based on the obtained results, it can be concluded that a proper configuration of efficient algorithms and methods can increase the prediction power of machine learning approaches and help researchers in designing practical diagnosis health care systems and offering effective treatment plans.
Collapse
Affiliation(s)
| | - Yosef Masoudi-Sobhanzadeh
- Research Center for Pharmaceutical Nanotechnology, Biomedicine Institute, Tabriz University of Medical Sciences, Tabriz, Iran.
- Faculty of Advanced Medical Sciences, Tabriz University of Medical Sciences, Tabriz, Iran.
| | - Yadollah Omidi
- Department of Pharmaceutical Sciences, College of Pharmacy, Nova Southeastern University, Florida, 33328, USA.
| |
Collapse
|
13
|
Wang Z, Zhou Y, Takagi T, Song J, Tian YS, Shibuya T. Genetic algorithm-based feature selection with manifold learning for cancer classification using microarray data. BMC Bioinformatics 2023; 24:139. [PMID: 37031189 PMCID: PMC10082986 DOI: 10.1186/s12859-023-05267-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2022] [Accepted: 04/02/2023] [Indexed: 04/10/2023] Open
Abstract
BACKGROUND Microarray data have been widely utilized for cancer classification. The main characteristic of microarray data is "large p and small n" in that data contain a small number of subjects but a large number of genes. It may affect the validity of the classification. Thus, there is a pressing demand of techniques able to select genes relevant to cancer classification. RESULTS This study proposed a novel feature (gene) selection method, Iso-GA, for cancer classification. Iso-GA hybrids the manifold learning algorithm, Isomap, in the genetic algorithm (GA) to account for the latent nonlinear structure of the gene expression in the microarray data. The Davies-Bouldin index is adopted to evaluate the candidate solutions in Isomap and to avoid the classifier dependency problem. Additionally, a probability-based framework is introduced to reduce the possibility of genes being randomly selected by GA. The performance of Iso-GA was evaluated on eight benchmark microarray datasets of cancers. Iso-GA outperformed other benchmarking gene selection methods, leading to good classification accuracy with fewer critical genes selected. CONCLUSIONS The proposed Iso-GA method can effectively select fewer but critical genes from microarray data to achieve competitive classification performance.
Collapse
Affiliation(s)
- Zixuan Wang
- Division of Medical Data Informatics, Human Genome Center, Institute of Medical Science, The University of Tokyo, Tokyo, 108-8639, Japan.
| | - Yi Zhou
- Beijing International Center for Mathematical Research, Peking University, Beijing, 100871, China
| | - Tatsuya Takagi
- Graduate School of Pharmaceutical Sciences, Osaka University, 1-6 Yamadaoka, Suita, Osaka, 565-0871, Japan
| | - Jiangning Song
- Biomedicine Discovery Institute and Monash Data Futures Institute, Monash University, Melbourne, VIC, 3800, Australia
| | - Yu-Shi Tian
- Graduate School of Pharmaceutical Sciences, Osaka University, 1-6 Yamadaoka, Suita, Osaka, 565-0871, Japan.
| | - Tetsuo Shibuya
- Division of Medical Data Informatics, Human Genome Center, Institute of Medical Science, The University of Tokyo, Tokyo, 108-8639, Japan
| |
Collapse
|
14
|
Liu J, Feng H, Tang Y, Zhang L, Qu C, Zeng X, Peng X. A novel hybrid algorithm based on Harris Hawks for tumor feature gene selection. PeerJ Comput Sci 2023; 9:e1229. [PMID: 37346505 PMCID: PMC10280456 DOI: 10.7717/peerj-cs.1229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2022] [Accepted: 01/09/2023] [Indexed: 06/23/2023]
Abstract
Background Gene expression data are often used to classify cancer genes. In such high-dimensional datasets, however, only a few feature genes are closely related to tumors. Therefore, it is important to accurately select a subset of feature genes with high contributions to cancer classification. Methods In this article, a new three-stage hybrid gene selection method is proposed that combines a variance filter, extremely randomized tree and Harris Hawks (VEH). In the first stage, we evaluated each gene in the dataset through the variance filter and selected the feature genes that meet the variance threshold. In the second stage, we use extremely randomized tree to further eliminate irrelevant genes. Finally, we used the Harris Hawks algorithm to select the gene subset from the previous two stages to obtain the optimal feature gene subset. Results We evaluated the proposed method using three different classifiers on eight published microarray gene expression datasets. The results showed a 100% classification accuracy for VEH in gastric cancer, acute lymphoblastic leukemia and ovarian cancer, and an average classification accuracy of 95.33% across a variety of other cancers. Compared with other advanced feature selection algorithms, VEH has obvious advantages when measured by many evaluation criteria.
Collapse
Affiliation(s)
- Junjian Liu
- Department of Statistics, Hunan Normal University College of Mathematics and Statistics, Changsha, Hunan, China
| | - Huicong Feng
- Department of Pathology and Pathophysiology, Hunan Normal University School of Medicine, Changsha, Hunan, China
| | - Yifan Tang
- Department of Pathology and Pathophysiology, Hunan Normal University School of Medicine, Changsha, Hunan, China
| | - Lupeng Zhang
- Department of Biochemistry and Molecular Biology, Jishou University School of Medicine, Jishou, Hunan, China
| | - Chiwen Qu
- Department of Statistics, Hunan Normal University College of Mathematics and Statistics, Changsha, Hunan, China
| | - Xiaomin Zeng
- Department of Epidemiology and Health Statistics, Xiangya Public Health School, Central South University, Changsha, Hunan, China
| | - Xiaoning Peng
- Department of Statistics, Hunan Normal University College of Mathematics and Statistics, Changsha, Hunan, China
- Department of Pathology and Pathophysiology, Hunan Normal University School of Medicine, Changsha, Hunan, China
| |
Collapse
|
15
|
Li Z, Zhang Q, Wang P, Liu F, Song Y, Wen CF. Gene Selection in a Single Cell Gene Space Based on D-S Evidence Theory. Interdiscip Sci 2022; 14:722-744. [PMID: 35484463 DOI: 10.1007/s12539-022-00518-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2021] [Revised: 03/28/2022] [Accepted: 04/01/2022] [Indexed: 06/14/2023]
Abstract
If the samples, features and information values in a real-valued information system are cells, genes and gene expression values, respectively, then for convenience, this system is said to be a single cell gene space. In the era of big data, people are faced with high dimensional gene expression data with redundancy and noise causing its strong uncertainty. D-S evidence theory excels at tackling the problem of uncertainty, and its conditions to be met are weaker than Bayesian probability theory. Therefore, this paper studies the gene selection in a single cell gene space to remove noise and redundancy with D-S evidence theory. The distance between two cells in each gene is first defined. Then, the tolerance relation is established according to the defined distance. In addition, the belief and plausibility functions to grasp the uncertainty of a single cell gene space are introduced on the basis of the tolerance classes. Statistical analysis shows that they can effectively measure the uncertainty of a single cell gene space. Furthermore, several gene selection algorithms in a single cell gene space are presented using the proposed belief and plausibility. Finally, the performance of the proposed algorithm is compared to other algorithms on some published single-cell data sets. Experimental results and statistical tests show that the classification and clustering performance of the presented algorithm not only exceeds the other three state-of-the-art algorithms, but also its gene reduction rate is very high.
Collapse
Affiliation(s)
- Zhaowen Li
- Key Laboratory of Complex System Optimization and Big Data Processing in Department of Guangxi Education, Yulin Normal University, Yulin, 537000, Guangxi, People's Republic of China
| | - Qinli Zhang
- School of Big Data and Artificial Intelligence, Chizhou University, Chizhou, 247000, Anhui, People's Republic of China.
| | - Pei Wang
- Key Laboratory of Complex System Optimization and Big Data Processing in Department of Guangxi Education, Yulin Normal University, Yulin, 537000, Guangxi, People's Republic of China
| | - Fang Liu
- School of Mathematics and Information Science, Guangxi University, Nanning, 530004, Guangxi, People's Republic of China
| | - Yan Song
- School of Mathematics and Statistics, Yulin Normal University, Yulin, 537000, Guangxi, People's Republic of China
| | - Ching-Feng Wen
- Key Laboratory of Complex System Optimization and Big Data Processing in Department of Guangxi Education, Yulin Normal University, Yulin, 537000, Guangxi, People's Republic of China.
| |
Collapse
|
16
|
Liu S, Yao W. Prediction of lung cancer using gene expression and deep learning with KL divergence gene selection. BMC Bioinformatics 2022; 23:175. [PMID: 35549644 PMCID: PMC9103042 DOI: 10.1186/s12859-022-04689-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2021] [Accepted: 04/13/2022] [Indexed: 11/24/2022] Open
Abstract
Background Lung cancer is one of the cancers with the highest mortality rate in China. With the rapid development of high-throughput sequencing technology and the research and application of deep learning methods in recent years, deep neural networks based on gene expression have become a hot research direction in lung cancer diagnosis in recent years, which provide an effective way of early diagnosis for lung cancer. Thus, building a deep neural network model is of great significance for the early diagnosis of lung cancer. However, the main challenges in mining gene expression datasets are the curse of dimensionality and imbalanced data. The existing methods proposed by some researchers can’t address the problems of high-dimensionality and imbalanced data, because of the overwhelming number of variables measured (genes) versus the small number of samples, which result in poor performance in early diagnosis for lung cancer. Method Given the disadvantages of gene expression data sets with small datasets, high-dimensionality and imbalanced data, this paper proposes a gene selection method based on KL divergence, which selects some genes with higher KL divergence as model features. Then build a deep neural network model using Focal Loss as loss function, at the same time, we use k-fold cross validation method to verify and select the best model, we set the value of k is five in this paper. Result The deep learning model method based on KL divergence gene selection proposed in this paper has an AUC of 0.99 on the validation set. The generalization performance of model is high. Conclusion The deep neural network model based on KL divergence gene selection proposed in this paper is proved to be an accurate and effective method for lung cancer prediction.
Collapse
Affiliation(s)
- Suli Liu
- College of Public Health, Zhengzhou University, Zhengzhou, 450001, China
| | - Wu Yao
- College of Public Health, Zhengzhou University, Zhengzhou, 450001, China.
| |
Collapse
|
17
|
Rostami M, Forouzandeh S, Berahmand K, Soltani M, Shahsavari M, Oussalah M. Gene selection for microarray data classification via multi-objective graph theoretic-based method. Artif Intell Med 2022; 123:102228. [PMID: 34998517 DOI: 10.1016/j.artmed.2021.102228] [Citation(s) in RCA: 28] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2020] [Revised: 11/23/2021] [Accepted: 11/27/2021] [Indexed: 12/20/2022]
Abstract
In recent decades, the improvement of computer technology has increased the growth of high-dimensional microarray data. Thus, data mining methods for DNA microarray data classification usually involve samples consisting of thousands of genes. One of the efficient strategies to solve this problem is gene selection, which improves the accuracy of microarray data classification and also decreases computational complexity. In this paper, a novel social network analysis-based gene selection approach is proposed. The proposed method has two main objectives of the relevance maximization and redundancy minimization of the selected genes. In this method, on each iteration, a maximum community is selected repetitively. Then among the existing genes in this community, the appropriate genes are selected by using the node centrality-based criterion. The reported results indicate that the developed gene selection algorithm while increasing the classification accuracy of microarray data, will also decrease the time complexity.
Collapse
Affiliation(s)
- Mehrdad Rostami
- Centre of Machine Vision and Signal Processing, Faculty of Information Technology, University of Oulu, Oulu, Finland.
| | - Saman Forouzandeh
- Department of Computer Engineering, University of Applied Science and Technology, Center of Tehran Municipality ICT org., Tehran, Iran
| | - Kamal Berahmand
- School of Computer Sciences, Science and Engineering Faculty, Queensland University of Technology (QUT), Brisbane, Australia.
| | - Mina Soltani
- Department of Nutrition, Kashan University of Medical Sciences, Kashan, Iran
| | - Meisam Shahsavari
- Department of engineering physics, Tsinghua University, Beijing, China
| | - Mourad Oussalah
- Centre of Machine Vision and Signal Processing, Faculty of Information Technology, University of Oulu, Oulu, Finland; Research Unit of Medical Imaging, Physics, and Technology, Faculty of Medicine, University of Oulu, Finland.
| |
Collapse
|
18
|
Li J, Liang K, Song X. Logistic regression with adaptive sparse group lasso penalty and its application in acute leukemia diagnosis. Comput Biol Med 2021; 141:105154. [PMID: 34952336 DOI: 10.1016/j.compbiomed.2021.105154] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2021] [Revised: 12/14/2021] [Accepted: 12/15/2021] [Indexed: 01/15/2023]
Abstract
Cancer diagnosis based on gene expression profile data has attracted extensive attention in computational biology and medicine. It suffers from three challenges in practical applications: noise, gene grouping, and adaptive gene selection. This paper aims to solve the above problems by developing the logistic regression with adaptive sparse group lasso penalty (LR-ASGL). A noise information processing method for cancer gene expression profile data is first presented via robust principal component analysis. Genes are then divided into groups by performing weighted gene co-expression network analysis on the clean matrix. By approximating the relative value of the noise size, gene reliability criterion and robust evaluation criterion are proposed. Finally, LR-ASGL is presented for simultaneous cancer diagnosis and adaptive gene selection. The performance of the proposed method is compared with the other four methods in three simulation settings: Gaussian noise, uniformly distributed noise, and mixed noise. The acute leukemia data are adopted as an experimental example to demonstrate the advantages of LR-ASGL in prediction and gene selection.
Collapse
Affiliation(s)
- Juntao Li
- College of Mathematics and Information Science, Henan Normal University, Xinxiang, 453007, China.
| | - Ke Liang
- College of Mathematics and Information Science, Henan Normal University, Xinxiang, 453007, China.
| | - Xuekun Song
- College of Information Technology, Henan University of Chinese Medicine, Zhengzhou, 450046, China.
| |
Collapse
|
19
|
Azadifar S, Ahmadi A. A graph-based gene selection method for medical diagnosis problems using a many-objective PSO algorithm. BMC Med Inform Decis Mak 2021; 21:333. [PMID: 34838034 PMCID: PMC8627636 DOI: 10.1186/s12911-021-01696-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2021] [Accepted: 11/16/2021] [Indexed: 11/16/2022] Open
Abstract
Background Gene expression data play an important role in bioinformatics applications. Although there may be a large number of features in such data, they mainly tend to contain only a few samples. This can negatively impact the performance of data mining and machine learning algorithms. One of the most effective approaches to alleviate this problem is to use gene selection methods. The aim of gene selection is to reduce the dimensions (features) of gene expression data leading to eliminating irrelevant and redundant genes. Methods This paper presents a hybrid gene selection method based on graph theory and a many-objective particle swarm optimization (PSO) algorithm. To this end, a filter method is first utilized to reduce the initial space of the genes. Then, the gene space is represented as a graph to apply a graph clustering method to group the genes into several clusters. Moreover, the many-objective PSO algorithm is utilized to search an optimal subset of genes according to several criteria, which include classification error, node centrality, specificity, edge centrality, and the number of selected genes. A repair operator is proposed to cover the whole space of the genes and ensure that at least one gene is selected from each cluster. This leads to an increasement in the diversity of the selected genes. Results To evaluate the performance of the proposed method, extensive experiments are conducted based on seven datasets and two evaluation measures. In addition, three classifiers—Decision Tree (DT), Support Vector Machine (SVM), and K-Nearest Neighbors (KNN)—are utilized to compare the effectiveness of the proposed gene selection method with other state-of-the-art methods. The results of these experiments demonstrate that our proposed method not only achieves more accurate classification, but also selects fewer genes than other methods. Conclusion This study shows that the proposed multi-objective PSO algorithm simultaneously removes irrelevant and redundant features using several different criteria. Also, the use of the clustering algorithm and the repair operator has improved the performance of the proposed method by covering the whole space of the problem.
Collapse
Affiliation(s)
- Saeid Azadifar
- Faculty of Computer Engineering, K. N. Toosi University of Technology, Tehran, Iran.
| | - Ali Ahmadi
- Faculty of Computer Engineering, K. N. Toosi University of Technology, Tehran, Iran
| |
Collapse
|
20
|
Nouri-Moghaddam B, Ghazanfari M, Fathian M. A novel bio-inspired hybrid multi-filter wrapper gene selection method with ensemble classifier for microarray data. Neural Comput Appl 2021;:1-31. [PMID: 34539088 DOI: 10.1007/s00521-021-06459-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2020] [Accepted: 08/26/2021] [Indexed: 01/04/2023]
Abstract
Microarray technology is known as one of the most important tools for collecting DNA expression data. This technology allows researchers to investigate and examine types of diseases and their origins. However, microarray data are often associated with a small sample size, a significant number of genes, imbalanced data, etc., making classification models inefficient. Thus, a new hybrid solution based on a multi-filter and adaptive chaotic multi-objective forest optimization algorithm (AC-MOFOA) is presented to solve the gene selection problem and construct the Ensemble Classifier. In the proposed solution, a multi-filter model (i.e., ensemble filter) is proposed as preprocessing step to reduce the dataset's dimensions, using a combination of five filter methods to remove redundant and irrelevant genes. Accordingly, the results of the five filter methods are combined using a voting-based function. Additionally, the results of the proposed multi-filter indicate that it has good capability in reducing the gene subset size and selecting relevant genes. Then, an AC-MOFOA based on the concepts of non-dominated sorting, crowding distance, chaos theory, and adaptive operators is presented. AC-MOFOA as a wrapper method aimed at reducing dataset dimensions, optimizing KELM, and increasing the accuracy of the classification, simultaneously. Next, in this method, an ensemble classifier model is presented using AC-MOFOA results to classify microarray data. The performance of the proposed algorithm was evaluated on nine public microarray datasets, and its results were compared in terms of the number of selected genes, classification efficiency, execution time, time complexity, hypervolume indicator, and spacing metric with five hybrid multi-objective methods, and three hybrid single-objective methods. According to the results, the proposed hybrid method could increase the accuracy of the KELM in most datasets by reducing the dataset's dimensions and achieve similar or superior performance compared to other multi-objective methods. Furthermore, the proposed Ensemble Classifier model could provide better classification accuracy and generalizability in the seven of nine microarray datasets compared to conventional ensemble methods. Moreover, the comparison results of the Ensemble Classifier model with three state-of-the-art ensemble generation methods indicate its competitive performance in which the proposed ensemble model achieved better results in the five of nine datasets.
Collapse
|
21
|
Byrjalsen A, Diets IJ, Bakhuizen J, Hansen TVO, Schmiegelow K, Gerdes AM, Stoltze U, Kuiper RP, Merks JHM, Wadt K, Jongmans M. Selection criteria for assembling a pediatric cancer predisposition syndrome gene panel. Fam Cancer 2021; 20:279-287. [PMID: 34061292 PMCID: PMC8484084 DOI: 10.1007/s10689-021-00254-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Accepted: 04/07/2021] [Indexed: 11/16/2022]
Abstract
Increasing use of genomic sequencing enables standardized screening of all childhood cancer predisposition syndromes (CPS) in children with cancer. Gene panels currently used often include adult-onset CPS genes and genes without substantial evidence linking them to cancer predisposition. We have developed criteria to select genes relevant for childhood-onset CPS and assembled a gene panel for use in children with cancer. We applied our criteria to 381 candidate genes, which were selected through two in-house panels (n = 338), a literature search (n = 39), and by assessing two Genomics England’s PanelApp panels (n = 4). We developed evaluation criteria that determined a gene’s eligibility for inclusion on a childhood-onset CPS gene panel. These criteria assessed (1) relevance in childhood cancer by a minimum of five childhood cancer patients reported carrying a pathogenic variant in the gene and (2) evidence supporting a causal relation between variants in this gene and cancer development. 138 genes fulfilled the criteria. In this study we have developed criteria to compile a childhood cancer predisposition gene panel which might ultimately be used in a clinical setting, regardless of the specific type of childhood cancer. This panel will be evaluated in a prospective study. The panel is available on (pediatric-cancer-predisposition-genepanel.nl) and will be regularly updated.
Collapse
Affiliation(s)
- Anna Byrjalsen
- Department of Clinical Genetics, Rigshospitalet, Blegdamsvej 9, 2100, Copenhagen East, Denmark
| | - Illja J Diets
- Department of Human Genetics, Radboudumc, Geert Grooteplein Zuid 10, 6525 GA, Nijmegen, The Netherlands
| | - Jette Bakhuizen
- Princess Máxima Center for Pediatric Oncology, Heidelberglaan 25, 3584 CS, Utrecht, The Netherlands.,Department of Genetics, University Medical Center Utrecht, 3508 AB, Utrecht, The Netherlands
| | - Thomas van Overeem Hansen
- Department of Clinical Genetics, Rigshospitalet, Blegdamsvej 9, 2100, Copenhagen East, Denmark.,Department of Paediatrics and Adolescent Medicine, Rigshospitalet, Blegdamsvej 9, 2100, Copenhagen East, Denmark
| | - Kjeld Schmiegelow
- Department of Paediatrics and Adolescent Medicine, Rigshospitalet, Blegdamsvej 9, 2100, Copenhagen East, Denmark
| | - Anne-Marie Gerdes
- Department of Clinical Genetics, Rigshospitalet, Blegdamsvej 9, 2100, Copenhagen East, Denmark
| | - Ulrik Stoltze
- Department of Paediatrics and Adolescent Medicine, Rigshospitalet, Blegdamsvej 9, 2100, Copenhagen East, Denmark
| | - Roland P Kuiper
- Princess Máxima Center for Pediatric Oncology, Heidelberglaan 25, 3584 CS, Utrecht, The Netherlands.,Department of Genetics, University Medical Center Utrecht, 3508 AB, Utrecht, The Netherlands
| | - Johannes H M Merks
- Princess Máxima Center for Pediatric Oncology, Heidelberglaan 25, 3584 CS, Utrecht, The Netherlands
| | - Karin Wadt
- Department of Clinical Genetics, Rigshospitalet, Blegdamsvej 9, 2100, Copenhagen East, Denmark
| | - Marjolijn Jongmans
- Princess Máxima Center for Pediatric Oncology, Heidelberglaan 25, 3584 CS, Utrecht, The Netherlands. .,Department of Genetics, University Medical Center Utrecht, 3508 AB, Utrecht, The Netherlands.
| |
Collapse
|
22
|
Pashaei E, Pashaei E. Gene selection using hybrid dragonfly black hole algorithm: A case study on RNA-seq COVID-19 data. Anal Biochem 2021; 627:114242. [PMID: 33974890 DOI: 10.1016/j.ab.2021.114242] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2020] [Revised: 04/12/2021] [Accepted: 05/02/2021] [Indexed: 11/18/2022]
Abstract
This paper introduces a new hybrid approach (DBH) for solving gene selection problem that incorporates the strengths of two existing metaheuristics: binary dragonfly algorithm (BDF) and binary black hole algorithm (BBHA). This hybridization aims to identify a limited and stable set of discriminative genes without sacrificing classification accuracy, whereas most current methods have encountered challenges in extracting disease-related information from a vast amount of redundant genes. The proposed approach first applies the minimum redundancy maximum relevancy (MRMR) filter method to reduce the dimensionality of feature space and then utilizes the suggested hybrid DBH algorithm to determine a smaller set of significant genes. The proposed approach was evaluated on eight benchmark gene expression datasets, and then, was compared against the latest state-of-art techniques to demonstrate algorithm efficiency. The comparative study shows that the proposed approach achieves a significant improvement as compared with existing methods in terms of classification accuracy and the number of selected genes. Moreover, the performance of the suggested method was examined on real RNA-Seq coronavirus-related gene expression data of asthmatic patients for selecting the most significant genes in order to improve the discriminative accuracy of angiotensin-converting enzyme 2 (ACE2). ACE2, as a coronavirus receptor, is a biomarker that helps to classify infected patients from uninfected in order to identify subgroups at risk for COVID-19. The result denotes that the suggested MRMR-DBH approach represents a very promising framework for finding a new combination of most discriminative genes with high classification accuracy.
Collapse
Affiliation(s)
- Elnaz Pashaei
- Department of Software Engineering, Istanbul Aydin University, Istanbul, Turkey.
| | - Elham Pashaei
- Department of Computer Engineering, Istanbul Gelisim University, Istanbul, Turkey.
| |
Collapse
|
23
|
Molina Mora JA, Montero-Manso P, García-Batán R, Campos-Sánchez R, Vilar-Fernández J, García F. A first perturbome of Pseudomonas aeruginosa: Identification of core genes related to multiple perturbations by a machine learning approach. Biosystems 2021; 205:104411. [PMID: 33757842 DOI: 10.1016/j.biosystems.2021.104411] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2020] [Revised: 03/11/2021] [Accepted: 03/12/2021] [Indexed: 01/27/2023]
Abstract
Tolerance to stress conditions is vital for organismal survival, including bacteria under specific environmental conditions, antibiotics, and other perturbations. Some studies have described common modulation and shared genes during stress response to different types of disturbances (termed as perturbome), leading to the idea of central control at the molecular level. We implemented a robust machine learning approach to identify and describe genes associated with multiple perturbations or perturbome in a Pseudomonas aeruginosa PAO1 model. Using microarray datasets from the Gene Expression Omnibus (GEO), we evaluated six approaches to rank and select genes: using two methodologies, data single partition (SP method) or multiple partitions (MP method) for training and testing datasets, we evaluated three classification algorithms (SVM Support Vector Machine, KNN K-Nearest neighbor and RF Random Forest). Gene expression patterns and topological features at the systems level were included to describe the perturbome elements. We were able to select and describe 46 core response genes associated with multiple perturbations in P. aeruginosa PAO1 and it can be considered a first report of the P. aeruginosa perturbome. Molecular annotations, patterns in expression levels, and topological features in molecular networks revealed biological functions of biosynthesis, binding, and metabolism, many of them related to DNA damage repair and aerobic respiration in the context of tolerance to stress. We also discuss different issues related to implemented and assessed algorithms, including data partitioning, classification approaches, and metrics. Altogether, this work offers a different and robust framework to select genes using a machine learning approach.
Collapse
Affiliation(s)
- Jose Arturo Molina Mora
- Centro de Investigacion en Enfermedades Tropicales (CIET) and Facultad de Microbiología, Universidad de Costa Rica, San Jose, Costa Rica.
| | | | - Raquel García-Batán
- Centro de Investigacion en Enfermedades Tropicales (CIET) and Facultad de Microbiología, Universidad de Costa Rica, San Jose, Costa Rica.
| | - Rebeca Campos-Sánchez
- Centro de Investigación en Biología Celular y Molecular (CIBCM), Universidad de Costa Rica, San José, Costa Rica.
| | | | - Fernando García
- Centro de Investigacion en Enfermedades Tropicales (CIET) and Facultad de Microbiología, Universidad de Costa Rica, San Jose, Costa Rica.
| |
Collapse
|
24
|
Acharya S, Cui L, Pan Y. Multi-view feature selection for identifying gene markers: a diversified biological data driven approach. BMC Bioinformatics 2020; 21:483. [PMID: 33375940 PMCID: PMC7772934 DOI: 10.1186/s12859-020-03810-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2020] [Accepted: 10/13/2020] [Indexed: 12/02/2022] Open
Abstract
BACKGROUND In recent years, to investigate challenging bioinformatics problems, the utilization of multiple genomic and proteomic sources has become immensely popular among researchers. One such issue is feature or gene selection and identifying relevant and non-redundant marker genes from high dimensional gene expression data sets. In that context, designing an efficient feature selection algorithm exploiting knowledge from multiple potential biological resources may be an effective way to understand the spectrum of cancer or other diseases with applications in specific epidemiology for a particular population. RESULTS In the current article, we design the feature selection and marker gene detection as a multi-view multi-objective clustering problem. Regarding that, we propose an Unsupervised Multi-View Multi-Objective clustering-based gene selection approach called UMVMO-select. Three important resources of biological data (gene ontology, protein interaction data, protein sequence) along with gene expression values are collectively utilized to design two different views. UMVMO-select aims to reduce gene space without/minimally compromising the sample classification efficiency and determines relevant and non-redundant gene markers from three cancer gene expression benchmark data sets. CONCLUSION A thorough comparative analysis has been performed with five clustering and nine existing feature selection methods with respect to several internal and external validity metrics. Obtained results reveal the supremacy of the proposed method. Reported results are also validated through a proper biological significance test and heatmap plotting.
Collapse
Affiliation(s)
- Sudipta Acharya
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, People’s Republic of China
| | - Laizhong Cui
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, People’s Republic of China
| | - Yi Pan
- Department of Computer Science, Georgia State University, Atlanta, USA
| |
Collapse
|
25
|
Abstract
Background The main goal of successful gene selection for microarray data is to find compact and predictive gene subsets which could improve the accuracy. Though a large pool of available methods exists, selecting the optimal gene subset for accurate classification is still very challenging for the diagnosis and treatment of cancer. Results To obtain the most predictive genes subsets without filtering out critical genes, a gene selection method based on least absolute shrinkage and selection operator (LASSO) and an improved binary particle swarm optimization (BPSO) is proposed in this paper. To avoid overfitting of LASSO, the initial gene pool is divided into clusters based on their structure. LASSO is then employed to select high predictive genes and further calculate the contribution value which indicates the genes’ sensitivity to samples’ classes. With the second-level gene pool established by double filter strategy, the BPSO encoding the contribution information obtained from LASSO is improved to perform gene selection. Moreover, from the perspective of the bit change probability, a new mapping function is defined to guide the updating of the particle to select the more predictive genes in the improved BPSO. Conclusions With the compact gene pool obtained by double filter strategies, the improved BPSO could select the optimal gene subsets with high probability. The experimental results on several public microarray data with extreme learning machine verify the effectiveness of the proposed method compared to the relevant methods.
Collapse
Affiliation(s)
- Ying Xiong
- School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang, 212013, China.,Jiangsu key Laboratory of Security Technology for industrial Cyberspace, Jiangsu University, Zhenjiang, 212013, China.,Information Department of the First Affiliated Hospital, Nanjing Medical University, Nanjing, 210029, China
| | - Qing-Hua Ling
- School of Computer Science, Jiangsu University of Science and Technology, Zhenjiang, 212003, China.
| | - Fei Han
- School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang, 212013, China.,Jiangsu key Laboratory of Security Technology for industrial Cyberspace, Jiangsu University, Zhenjiang, 212013, China
| | - Qing-Hua Liu
- School of Computer Science, Jiangsu University of Science and Technology, Zhenjiang, 212003, China
| |
Collapse
|
26
|
Alanni R, Hou J, Azzawi H, Xiang Y. Deep gene selection method to select genes from microarray datasets for cancer classification. BMC Bioinformatics 2019; 20:608. [PMID: 31775613 PMCID: PMC6880643 DOI: 10.1186/s12859-019-3161-2] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2019] [Accepted: 10/15/2019] [Indexed: 12/15/2022] Open
Abstract
Background Microarray datasets consist of complex and high-dimensional samples and genes, and generally the number of samples is much smaller than the number of genes. Due to this data imbalance, gene selection is a demanding task for microarray expression data analysis. Results The gene set selected by DGS has shown its superior performances in cancer classification. DGS has a high capability of reducing the number of genes in the original microarray datasets. The experimental comparisons with other representative and state-of-the-art gene selection methods also showed that DGS achieved the best performance in terms of the number of selected genes, classification accuracy, and computational cost. Conclusions We provide an efficient gene selection algorithm can select relevant genes which are significantly sensitive to the samples’ classes. With the few discriminative genes and less cost time by the proposed algorithm achieved much high prediction accuracy on several public microarray data, which in turn verifies the efficiency and effectiveness of the proposed gene selection method.
Collapse
Affiliation(s)
- Russul Alanni
- School of Information Technology, Deakin University, Geelong, Victoria, Australia.
| | - Jingyu Hou
- School of Information Technology, Deakin University, Geelong, Victoria, Australia
| | - Hasseeb Azzawi
- School of Information Technology, Deakin University, Geelong, Victoria, Australia
| | - Yong Xiang
- School of Information Technology, Deakin University, Geelong, Victoria, Australia
| |
Collapse
|
27
|
Soufan O, Ewald J, Viau C, Crump D, Hecker M, Basu N, Xia J. T1000: a reduced gene set prioritized for toxicogenomic studies. PeerJ 2019; 7:e7975. [PMID: 31681519 PMCID: PMC6824333 DOI: 10.7717/peerj.7975] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2019] [Accepted: 10/02/2019] [Indexed: 12/12/2022] Open
Abstract
There is growing interest within regulatory agencies and toxicological research communities to develop, test, and apply new approaches, such as toxicogenomics, to more efficiently evaluate chemical hazards. Given the complexity of analyzing thousands of genes simultaneously, there is a need to identify reduced gene sets. Though several gene sets have been defined for toxicological applications, few of these were purposefully derived using toxicogenomics data. Here, we developed and applied a systematic approach to identify 1,000 genes (called Toxicogenomics-1000 or T1000) highly responsive to chemical exposures. First, a co-expression network of 11,210 genes was built by leveraging microarray data from the Open TG-GATEs program. This network was then re-weighted based on prior knowledge of their biological (KEGG, MSigDB) and toxicological (CTD) relevance. Finally, weighted correlation network analysis was applied to identify 258 gene clusters. T1000 was defined by selecting genes from each cluster that were most associated with outcome measures. For model evaluation, we compared the performance of T1000 to that of other gene sets (L1000, S1500, Genes selected by Limma, and random set) using two external datasets based on the rat model. Additionally, a smaller (T384) and a larger version (T1500) of T1000 were used for dose-response modeling to test the effect of gene set size. Our findings demonstrated that the T1000 gene set is predictive of apical outcomes across a range of conditions (e.g., in vitro and in vivo, dose-response, multiple species, tissues, and chemicals), and generally performs as well, or better than other gene sets available.
Collapse
Affiliation(s)
- Othman Soufan
- Institute of Parasitology, McGill University, Montreal, Canada
| | - Jessica Ewald
- Faculty of Agricultural and Environmental Sciences, McGill University, Montreal, Canada
| | - Charles Viau
- Institute of Parasitology, McGill University, Montreal, Canada
| | - Doug Crump
- Ecotoxicology and Wildlife Health Division, Environment and Climate Change Canada, National Wildlife Research Centre, Carleton University, Ottawa, Canada
| | - Markus Hecker
- School of the Environment & Sustainability and Toxicology Centre, University of Saskatchewan, Saskatoon, Canada
| | - Niladri Basu
- Faculty of Agricultural and Environmental Sciences, McGill University, Montreal, Canada
| | - Jianguo Xia
- Institute of Parasitology, McGill University, Montreal, Canada.,Department of Animal Science, McGill University, Montreal, Canada
| |
Collapse
|
28
|
Sharma A, Rani R. C-HMOSHSSA: Gene selection for cancer classification using multi-objective meta-heuristic and machine learning methods. Comput Methods Programs Biomed 2019; 178:219-235. [PMID: 31416551 DOI: 10.1016/j.cmpb.2019.06.029] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/10/2019] [Revised: 06/24/2019] [Accepted: 06/27/2019] [Indexed: 05/21/2023]
Abstract
BACKGROUND AND OBJECTIVE Over the last two decades, DNA microarray technology has emerged as a powerful tool for early cancer detection and prevention. It helps to provide a detailed overview of disease complex microenvironment. Moreover, online availability of thousands of gene expression assays made microarray data classification an active research area. A common goal is to find a minimum subset of genes and maximizing the classification accuracy. METHODS In pursuit of a similar objective, we have proposed framework (C-HMOSHSSA) for gene selection using multi-objective spotted hyena optimizer (MOSHO) and salp swarm algorithm (SSA). The real-life optimization problems with more than one objective usually face the challenge to maintain convergence and diversity. Salp Swarm Algorithm (SSA) maintains diversity but, suffers from the overhead of maintaining the necessary information. On the other hand, the calculation of MOSHO requires low computational efforts hence is used for maintaining the necessary information. Therefore, the proposed algorithm is a hybrid algorithm that utilizes the features of both SSA and MOSHO to facilitate its exploration and exploitation capability. RESULTS Four different classifiers are trained on seven high-dimensional datasets using a subset of features (genes), which are obtained after applying the proposed hybrid gene selection algorithm. The results show that the proposed technique significantly outperforms existing state-of-the-art techniques. CONCLUSION It is also shown that the new sets of informative and biologically relevant genes are successfully identified by the proposed technique. The proposed approach can also be applied to other problem domains of interest which involve feature selection.
Collapse
Affiliation(s)
- Aman Sharma
- Computer Science and Engineering Department, Thapar Institute of Engineering & Technology, Patiala, Punjab, India.
| | - Rinkle Rani
- Computer Science and Engineering Department, Thapar Institute of Engineering & Technology, Patiala, Punjab, India.
| |
Collapse
|
29
|
Jansi Rani M, Devaraj D. Two-Stage Hybrid Gene Selection Using Mutual Information and Genetic Algorithm for Cancer Data Classification. J Med Syst 2019; 43:235. [PMID: 31209677 DOI: 10.1007/s10916-019-1372-8] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2019] [Accepted: 06/05/2019] [Indexed: 01/20/2023]
Abstract
Cancer is a deadly disease which requires a very complex and costly treatment. Microarray data classification plays an important role in cancer treatment. An efficient gene selection technique to select the more promising genes is necessary for cancer classification. Here, we propose a Two-stage MI-GA Gene Selection algorithm for selecting informative genes in cancer data classification. In the first stage, Mutual Information based gene selection is applied which selects only the genes that have high information related to the cancer. The genes which have high mutual information value are given as input to the second stage. The Genetic Algorithm based gene selection is applied in the second stage to identify and select the optimal set of genes required for accurate classification. For classification, Support Vector Machine (SVM) is used. The proposed MI-GA gene selection approach is applied to Colon, Lung and Ovarian cancer datasets and the results show that the proposed gene selection approach results in higher classification accuracy compared to the existing methods.
Collapse
Affiliation(s)
- M Jansi Rani
- School of Computing, Kalasalingam Academy of Research and Education, Krishnankoil, Virudhunagar, India.
| | - D Devaraj
- School of Electronics & Electrical Technology, Kalasalingam Academy of Research and Education, Krishnankoil, Virudhunagar, India
| |
Collapse
|
30
|
Han F, Tang D, Sun YWT, Cheng Z, Jiang J, Li QW. A hybrid gene selection method based on gene scoring strategy and improved particle swarm optimization. BMC Bioinformatics 2019; 20:289. [PMID: 31182017 PMCID: PMC6557739 DOI: 10.1186/s12859-019-2773-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
Abstract
Background Gene selection is one of the critical steps in the course of the classification of microarray data. Since particle swarm optimization has no complicated evolutionary operators and fewer parameters need to be adjusted, it has been used increasingly as an effective technique for gene selection. Since particle swarm optimization is apt to converge to local minima which lead to premature convergence, some particle swarm optimization based gene selection methods may select non-optimal genes with high probability. To select predictive genes with low redundancy as well as not filtering out key genes is still a challenge. Results To obtain predictive genes with lower redundancy as well as overcome the deficiencies of traditional particle swarm optimization based gene selection methods, a hybrid gene selection method based on gene scoring strategy and improved particle swarm optimization is proposed in this paper. To select the genes highly related to out samples’ classes, a gene scoring strategy based on randomization and extreme learning machine is proposed to filter much irrelevant genes. With the third-level gene pool established by multiple filter strategy, an improved particle swarm optimization is proposed to perform gene selection. In the improved particle swarm optimization, to decrease the likelihood of the premature of the swarm the Metropolis criterion of simulated annealing algorithm is introduced to update the particles, and the half of the swarm are reinitialized when the swarm is trapped into local minima. Conclusions Combining the gene scoring strategy with the improved particle swarm optimization, the new method could select functional gene subsets which are significantly sensitive to the samples’ classes. With the few discriminative genes selected by the proposed method, extreme learning machine and support vector machine classifiers achieve much high prediction accuracy on several public microarray data, which in turn verifies the efficiency and effectiveness of the proposed gene selection method.
Collapse
Affiliation(s)
- Fei Han
- School of Computer Science and Communication Engineering, Jiangsu University, Xuefu Road, Zhenjiang, Jiangsu, China. .,Jiangsu Key Laboratory of Security Technology for Industrial Cyberspace, Zhenjiang, Jiangsu, China.
| | - Di Tang
- School of Computer Science and Communication Engineering, Jiangsu University, Xuefu Road, Zhenjiang, Jiangsu, China.,Jiangsu Key Laboratory of Security Technology for Industrial Cyberspace, Zhenjiang, Jiangsu, China
| | - Yu-Wen-Tian Sun
- School of Computer Science and Communication Engineering, Jiangsu University, Xuefu Road, Zhenjiang, Jiangsu, China.,Jiangsu Key Laboratory of Security Technology for Industrial Cyberspace, Zhenjiang, Jiangsu, China
| | - Zhun Cheng
- School of Engineering, Nanjing Agricultural University, Weigang Road, Nanjing, Jiangsu, China
| | - Jing Jiang
- School of Computer Science and Communication Engineering, Jiangsu University, Xuefu Road, Zhenjiang, Jiangsu, China.,Jiangsu Key Laboratory of Security Technology for Industrial Cyberspace, Zhenjiang, Jiangsu, China
| | - Qiu-Wei Li
- School of Computer Science and Communication Engineering, Jiangsu University, Xuefu Road, Zhenjiang, Jiangsu, China.,Jiangsu Key Laboratory of Security Technology for Industrial Cyberspace, Zhenjiang, Jiangsu, China
| |
Collapse
|
31
|
Li J, Wang Y, Xiao H, Xu C. Gene selection of rat hepatocyte proliferation using adaptive sparse group lasso with weighted gene co-expression network analysis. Comput Biol Chem 2019; 80:364-373. [PMID: 31103917 DOI: 10.1016/j.compbiolchem.2019.04.010] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2017] [Revised: 11/30/2018] [Accepted: 04/23/2019] [Indexed: 11/29/2022]
Abstract
Grouped gene selection is the most important task for analyzing the microarray data of rat liver regeneration. Many existing gene selection methods cannot outstand the interactions among the selected genes. In the process of rat liver regeneration, one of the most important events involved in many biological processes is the proliferation of rat hepatocytes, so it can be used as a measure of the effectiveness of the method. Here we proposed an adaptive sparse group lasso to select genes in groups for rat hepatocyte proliferation. The weighted gene co-expression networks analysis was used to identify modules corresponding to gene pathways, based on which a strategy of dividing genes into groups was proposed. A strategy of adaptive gene selection was also presented by assessing the gene significance and introducing the adaptive lasso penalty. Moreover, an improved blockwise descent algorithm was proposed. Experimental results demonstrated that the proposed method can improve the classification accuracy, and select less number of significant genes which act jointly in groups and have direct or indirect effects on rat hepatocyte proliferation. The effectiveness of the method was verified by the method of rat hepatocyte proliferation.
Collapse
Affiliation(s)
- Juntao Li
- School of Mathematics and Information Science, Henan Normal University, Xinxiang 453007, PR China
| | - Yadi Wang
- School of Computer Science and Engineering, Southeast University, Nanjing, 211189, PR China.
| | - Huimin Xiao
- Department of Mathematics and Information Science, Henan University of Economics and Law, Zhengzhou 450002, PR China
| | - Cunshuan Xu
- State Key Laboratory Cultivation Base for Cell Differentiation Regulation, Henan Normal University, Xinxiang, 453007, PR China
| |
Collapse
|
32
|
Bhowmick SS, Bhattacharjee D, Rato L. In silico markers: an evolutionary and statistical approach to select informative genes of human breast cancer subtypes. Genes Genomics 2019; 41:1371-1382. [PMID: 31004329 DOI: 10.1007/s13258-019-00816-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2018] [Accepted: 04/02/2019] [Indexed: 10/27/2022]
Abstract
BACKGROUND Recent advancement in bioinformatics offers the ability to identify informative genes from high dimensional gene expression data. Selection of informative genes from these large datasets has emerged as an issue of major concern among researchers. OBJECTIVE Gene functionality and regulatory mechanisms can be understood through the analysis of these gene expression data. Here, we present a computational method to identify informative genes for breast cancer subtypes such as Basal, human epidermal growth factor receptor 2 (Her2), luminal A (LumA), and luminal B (LumB). METHODS The proposed In Silico Markers method is a wrapper feature selection method based on Least Absolute Shrinkage and Selection Operator (LASSO), Covariance Matrix Adaptation Evolution Strategy (CMA-ES) and Support Vector Machine (SVM) as a classifier. Moreover, the composite measure consisting of relevance, redundancy, and rank score of frequently appeared genes are used to select informative genes. RESULTS The informative genes are validated by statistical and biologically relevant criteria. For a comparative evaluation of the proposed approach, biological similarity score designed on semantic similarity measure of GO terms are investigated. Further, the proposed technique is evaluated with 7 existing gene selection techniques using two-class annotated breast cancer subtype datasets. CONCLUSION The utilization of this method can bring about the discovery of informative genes. Furthermore, under multiple criteria decision-making set-up, informative genes selected by the In Silico Markers are found to be admirable than the compared methods selected genes.
Collapse
Affiliation(s)
- Shib Sankar Bhowmick
- Department of Electronics and Communication Engineering, Heritage Institute of Technology, Kolkata, 700107, India.
| | - Debotosh Bhattacharjee
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, 700032, India
| | - Luis Rato
- Department of Informatics, University of Evora, 7004-516, Evora, Portugal
| |
Collapse
|
33
|
Masoudi-Sobhanzadeh Y, Motieghader H, Masoudi-Nejad A. FeatureSelect: a software for feature selection based on machine learning approaches. BMC Bioinformatics 2019; 20:170. [PMID: 30943889 PMCID: PMC6446290 DOI: 10.1186/s12859-019-2754-0] [Citation(s) in RCA: 44] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2018] [Accepted: 03/19/2019] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND Feature selection, as a preprocessing stage, is a challenging problem in various sciences such as biology, engineering, computer science, and other fields. For this purpose, some studies have introduced tools and softwares such as WEKA. Meanwhile, these tools or softwares are based on filter methods which have lower performance relative to wrapper methods. In this paper, we address this limitation and introduce a software application called FeatureSelect. In addition to filter methods, FeatureSelect consists of optimisation algorithms and three types of learners. It provides a user-friendly and straightforward method of feature selection for use in any kind of research, and can easily be applied to any type of balanced and unbalanced data based on several score functions like accuracy, sensitivity, specificity, etc. RESULTS: In addition to our previously introduced optimisation algorithm (WCC), a total of 10 efficient, well-known and recently developed algorithms have been implemented in FeatureSelect. We applied our software to a range of different datasets and evaluated the performance of its algorithms. Acquired results show that the performances of algorithms are varying on different datasets, but WCC, LCA, FOA, and LA are suitable than others in the overall state. The results also show that wrapper methods are better than filter methods. CONCLUSIONS FeatureSelect is a feature or gene selection software application which is based on wrapper methods. Furthermore, it includes some popular filter methods and generates various comparison diagrams and statistical measurements. It is available from GitHub ( https://github.com/LBBSoft/FeatureSelect ) and is free open source software under an MIT license.
Collapse
Affiliation(s)
- Yosef Masoudi-Sobhanzadeh
- Laboratory of system Biology and Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran
| | - Habib Motieghader
- Laboratory of system Biology and Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran
| | - Ali Masoudi-Nejad
- Laboratory of system Biology and Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran
| |
Collapse
|
34
|
Alanni R, Hou J, Azzawi H, Xiang Y. A novel gene selection algorithm for cancer classification using microarray datasets. BMC Med Genomics 2019; 12:10. [PMID: 30646919 PMCID: PMC6334429 DOI: 10.1186/s12920-018-0447-6] [Citation(s) in RCA: 36] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2018] [Accepted: 12/07/2018] [Indexed: 12/18/2022] Open
Abstract
Background Microarray datasets are an important medical diagnostic tool as they represent the states of a cell at the molecular level. Available microarray datasets for classifying cancer types generally have a fairly small sample size compared to the large number of genes involved. This fact is known as a curse of dimensionality, which is a challenging problem. Gene selection is a promising approach that addresses this problem and plays an important role in the development of efficient cancer classification due to the fact that only a small number of genes are related to the classification problem. Gene selection addresses many problems in microarray datasets such as reducing the number of irrelevant and noisy genes, and selecting the most related genes to improve the classification results. Methods An innovative Gene Selection Programming (GSP) method is proposed to select relevant genes for effective and efficient cancer classification. GSP is based on Gene Expression Programming (GEP) method with a new defined population initialization algorithm, a new fitness function definition, and improved mutation and recombination operators. . Support Vector Machine (SVM) with a linear kernel serves as a classifier of the GSP. Results Experimental results on ten microarray cancer datasets demonstrate that Gene Selection Programming (GSP) is effective and efficient in eliminating irrelevant and redundant genes/features from microarray datasets. The comprehensive evaluations and comparisons with other methods show that GSP gives a better compromise in terms of all three evaluation criteria, i.e., classification accuracy, number of selected genes, and computational cost. The gene set selected by GSP has shown its superior performances in cancer classification compared to those selected by the up-to-date representative gene selection methods. Conclusion Gene subset selected by GSP can achieve a higher classification accuracy with less processing time.
Collapse
Affiliation(s)
- Russul Alanni
- School of Information Technology, Deakin University, Burwood, 3125, VIC, Australia.
| | - Jingyu Hou
- School of Information Technology, Deakin University, Burwood, 3125, VIC, Australia
| | - Hasseeb Azzawi
- School of Information Technology, Deakin University, Burwood, 3125, VIC, Australia
| | - Yong Xiang
- School of Information Technology, Deakin University, Burwood, 3125, VIC, Australia
| |
Collapse
|
35
|
Kagaris D, Khamesipour A, Yiannoutsos CT. AUCTSP: an improved biomarker gene pair class predictor. BMC Bioinformatics 2018; 19:244. [PMID: 29940833 DOI: 10.1186/s12859-018-2231-1] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2018] [Accepted: 06/04/2018] [Indexed: 11/10/2022] Open
Abstract
Background The Top Scoring Pair (TSP) classifier, based on the concept of relative ranking reversals in the expressions of pairs of genes, has been proposed as a simple, accurate, and easily interpretable decision rule for classification and class prediction of gene expression profiles. The idea that differences in gene expression ranking are associated with presence or absence of disease is compelling and has strong biological plausibility. Nevertheless, the TSP formulation ignores significant available information which can improve classification accuracy and is vulnerable to selecting genes which do not have differential expression in the two conditions (“pivot" genes). Results We introduce the AUCTSP classifier as an alternative rank-based estimator of the magnitude of the ranking reversals involved in the original TSP. The proposed estimator is based on the Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) and as such, takes into account the separation of the entire distribution of gene expression levels in gene pairs under the conditions considered, as opposed to comparing gene rankings within individual subjects as in the original TSP formulation. Through extensive simulations and case studies involving classification in ovarian, leukemia, colon, breast and prostate cancers and diffuse large b-cell lymphoma, we show the superiority of the proposed approach in terms of improving classification accuracy, avoiding overfitting and being less prone to selecting non-informative (pivot) genes. Conclusions The proposed AUCTSP is a simple yet reliable and robust rank-based classifier for gene expression classification. While the AUCTSP works by the same principle as TSP, its ability to determine the top scoring gene pair based on the relative rankings of two marker genes across all subjects as opposed to each individual subject results in significant performance gains in classification accuracy. In addition, the proposed method tends to avoid selection of non-informative (pivot) genes as members of the top-scoring pair. Electronic supplementary material The online version of this article (10.1186/s12859-018-2231-1) contains supplementary material, which is available to authorized users.
Collapse
|
36
|
Huang B, Zhong N, Xia L, Yu G, Cao H. Sparse Representation-Based Patient-Specific Diagnosis and Treatment for Esophageal Squamous Cell Carcinoma. Bull Math Biol 2018; 80:2124-2136. [PMID: 29869044 DOI: 10.1007/s11538-018-0449-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2017] [Accepted: 05/25/2018] [Indexed: 11/28/2022]
Abstract
Precision medicine and personalized treatment have attracted attention in recent years. However, most genetic medicines mainly target one genetic site, while complex diseases like esophageal squamous cell carcinoma (ESCC) usually present heterogeneity that involves variations of many genetic markers. Here, we seek an approach to leverage genetic data and ESCC knowledge data to forward personalized diagnosis and treatment for ESCC. First, 851 ESCC-related gene markers and their druggability were studied through a comprehensive literature analysis. Then, a sparse representation-based variable selection (SRVS) was employed for patient-specific genetic marker selection using gene expression datasets. Results showed that the SRVS method could identify a unique gene vector for each patient group, leading to significantly higher classification accuracies compared to randomly selected genes (100, 97.17, 100, 100%; permutation p values: 0.0032, 0.0008, 0.0004, and 0.0008). The SRVS also outperformed an ANOVA-based gene selection method in terms of the classification ratio. The patient-specific gene markers are targets of ESCC effective drugs, providing specific guidance for medicine selection. Our results suggest the effectiveness of integrating previous database utilizing SRVS in assisting personalized medicine selection and treatment for ESCC.
Collapse
Affiliation(s)
- Bin Huang
- Department of Cardiothoracic Surgery, The Affiliated Jiangyin Hospital of Southeast University Medical College, No. 163 Shoushan Rd, Jiangyin, 214400, Jiangsu, China
| | - Ning Zhong
- Department of Cardiothoracic Surgery, The First People's Hospital of Kunshan, Kunshan, 215300, Jiangsu, China
| | - Lili Xia
- Department of Ultrasound, The People's Hospital of Tongling, Tongling, 215300, Anhui, China
| | - Guiping Yu
- Department of Cardiothoracic Surgery, The Affiliated Jiangyin Hospital of Southeast University Medical College, No. 163 Shoushan Rd, Jiangyin, 214400, Jiangsu, China.
| | - Hongbao Cao
- Department of Genomics Research, R&D Solutions, Elsevier Inc., Rockville, MD, 20852, USA. .,Unit on Statistical Genomics, National Institute of Health (NIH), Bethesda, MD, 20892, USA.
| |
Collapse
|
37
|
Castellanos-Garzón JA, Ramos J, López-Sánchez D, de Paz JF, Corchado JM. An Ensemble Framework Coping with Instability in the Gene Selection Process. Interdiscip Sci 2018; 10:12-23. [PMID: 29313209 DOI: 10.1007/s12539-017-0274-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2017] [Revised: 11/06/2017] [Accepted: 11/08/2017] [Indexed: 11/29/2022]
Abstract
This paper proposes an ensemble framework for gene selection, which is aimed at addressing instability problems presented in the gene filtering task. The complex process of gene selection from gene expression data faces different instability problems from the informative gene subsets found by different filter methods. This makes the identification of significant genes by the experts difficult. The instability of results can come from filter methods, gene classifier methods, different datasets of the same disease and multiple valid groups of biomarkers. Even though there is a wide number of proposals, the complexity imposed by this problem remains a challenge today. This work proposes a framework involving five stages of gene filtering to discover biomarkers for diagnosis and classification tasks. This framework performs a process of stable feature selection, facing the problems above and, thus, providing a more suitable and reliable solution for clinical and research purposes. Our proposal involves a process of multistage gene filtering, in which several ensemble strategies for gene selection were added in such a way that different classifiers simultaneously assess gene subsets to face instability. Firstly, we apply an ensemble of recent gene selection methods to obtain diversity in the genes found (stability according to filter methods). Next, we apply an ensemble of known classifiers to filter genes relevant to all classifiers at a time (stability according to classification methods). The achieved results were evaluated in two different datasets of the same disease (pancreatic ductal adenocarcinoma), in search of stability according to the disease, for which promising results were achieved.
Collapse
Affiliation(s)
- José A Castellanos-Garzón
- IBSAL/BISITE Research Group, University of Salamanca, Edificio I+D+i, 37007, Salamanca, Spain. .,CISUC, ECOS Research Group, University of Coimbra, Pólo II-Pinhal de Marrocos, 3030-290, Coimbra, Portugal.
| | - Juan Ramos
- IBSAL/BISITE Research Group, University of Salamanca, Edificio I+D+i, 37007, Salamanca, Spain
| | - Daniel López-Sánchez
- IBSAL/BISITE Research Group, University of Salamanca, Edificio I+D+i, 37007, Salamanca, Spain
| | - Juan F de Paz
- IBSAL/BISITE Research Group, University of Salamanca, Edificio I+D+i, 37007, Salamanca, Spain
| | - Juan M Corchado
- IBSAL/BISITE Research Group, University of Salamanca, Edificio I+D+i, 37007, Salamanca, Spain.,Osaka Institute of Technology, Osaka, 535-8585, Japan
| |
Collapse
|
38
|
Abstract
In recent years, tumor classification based on gene expression profiles has drawn great attention, and related research results have been widely applied to the clinical diagnosis of major gene diseases. These studies are of tremendous importance for accurate cancer diagnosis and subtype recognition. However, the microarray data of gene expression profiles have small samples, high dimensionality, large noise and data redundancy. To further improve the classification performance of microarray data, a gene selection approach based on the Fisher linear discriminant (FLD) and the neighborhood rough set (NRS) is proposed. First, the FLD method is employed to reduce the preliminarily genetic data to obtain features with a strong classification ability, which can form a candidate gene subset. Then, neighborhood precision and neighborhood roughness are defined in a neighborhood decision system, and the calculation approaches for neighborhood dependency and the significance of an attribute are given. A reduction model of neighborhood decision systems is presented. Thus, a gene selection algorithm based on FLD and NRS is proposed. Finally, four public gene datasets are used in the simulation experiments. Experimental results under the SVM classifier demonstrate that the proposed algorithm is effective, and it can select a smaller and more well-classified gene subset, as well as obtain better classification performance.
Collapse
Affiliation(s)
- Lin Sun
- a College of Computer & Information Engineering, Henan Normal University , Xinxiang , Henan , China.,b Post-doctoral Mobile Station of Biology, College of Life Science, Henan Normal University , Xinxiang , Henan , China.,c Engineering Technology Research Center for Computing Intelligence & Data Mining of Henan Province , Xinxiang , Henan , China
| | - Xiaoyu Zhang
- a College of Computer & Information Engineering, Henan Normal University , Xinxiang , Henan , China
| | - Jiucheng Xu
- a College of Computer & Information Engineering, Henan Normal University , Xinxiang , Henan , China
| | - Wei Wang
- a College of Computer & Information Engineering, Henan Normal University , Xinxiang , Henan , China.,c Engineering Technology Research Center for Computing Intelligence & Data Mining of Henan Province , Xinxiang , Henan , China
| | - Ruonan Liu
- a College of Computer & Information Engineering, Henan Normal University , Xinxiang , Henan , China
| |
Collapse
|
39
|
Tang C, Cao L, Zheng X, Wang M. Gene selection for microarray data classification via subspace learning and manifold regularization. Med Biol Eng Comput 2018; 56:1271-84. [PMID: 29256006 DOI: 10.1007/s11517-017-1751-6] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2017] [Accepted: 11/03/2017] [Indexed: 10/18/2022]
Abstract
With the rapid development of DNA microarray technology, large amount of genomic data has been generated. Classification of these microarray data is a challenge task since gene expression data are often with thousands of genes but a small number of samples. In this paper, an effective gene selection method is proposed to select the best subset of genes for microarray data with the irrelevant and redundant genes removed. Compared with original data, the selected gene subset can benefit the classification task. We formulate the gene selection task as a manifold regularized subspace learning problem. In detail, a projection matrix is used to project the original high dimensional microarray data into a lower dimensional subspace, with the constraint that the original genes can be well represented by the selected genes. Meanwhile, the local manifold structure of original data is preserved by a Laplacian graph regularization term on the low-dimensional data space. The projection matrix can serve as an importance indicator of different genes. An iterative update algorithm is developed for solving the problem. Experimental results on six publicly available microarray datasets and one clinical dataset demonstrate that the proposed method performs better when compared with other state-of-the-art methods in terms of microarray data classification. Graphical Abstract The graphical abstract of this work.
Collapse
|
40
|
Gao L, Ye M, Lu X, Huang D. Hybrid Method Based on Information Gain and Support Vector Machine for Gene Selection in Cancer Classification. Genomics Proteomics Bioinformatics 2017; 15:389-395. [PMID: 29246519 PMCID: PMC5828665 DOI: 10.1016/j.gpb.2017.08.002] [Citation(s) in RCA: 37] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/12/2017] [Revised: 07/25/2017] [Accepted: 08/08/2017] [Indexed: 12/30/2022]
Abstract
It remains a great challenge to achieve sufficient cancer classification accuracy with the entire set of genes, due to the high dimensions, small sample size, and big noise of gene expression data. We thus proposed a hybrid gene selection method, Information Gain-Support Vector Machine (IG-SVM) in this study. IG was initially employed to filter irrelevant and redundant genes. Then, further removal of redundant genes was performed using SVM to eliminate the noise in the datasets more effectively. Finally, the informative genes selected by IG-SVM served as the input for the LIBSVM classifier. Compared to other related algorithms, IG-SVM showed the highest classification accuracy and superior performance as evaluated using five cancer gene expression datasets based on a few selected genes. As an example, IG-SVM achieved a classification accuracy of 90.32% for colon cancer, which is difficult to be accurately classified, only based on three genes including CSRP1, MYL9, and GUCA2B.
Collapse
Affiliation(s)
- Lingyun Gao
- School of Medical Information, Wannan Medical College, Wuhu 241002, China
| | - Mingquan Ye
- School of Medical Information, Wannan Medical College, Wuhu 241002, China.
| | - Xiaojie Lu
- School of Medical Information, Wannan Medical College, Wuhu 241002, China
| | - Daobin Huang
- School of Medical Information, Wannan Medical College, Wuhu 241002, China
| |
Collapse
|
41
|
Butler JM, Hall N, Narendran N, Yang YC, Paraoan L. Identification of candidate protective variants for common diseases and evaluation of their protective potential. BMC Genomics 2017; 18:575. [PMID: 28774272 PMCID: PMC5543444 DOI: 10.1186/s12864-017-3964-3] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2016] [Accepted: 07/27/2017] [Indexed: 01/26/2023] Open
Abstract
BACKGROUND Human polymorphisms with derived alleles that are protective against disease may provide powerful translational opportunities. Here we report a method to identify such candidate polymorphisms and apply it to common non-synonymous SNPs (nsSNPs) associated with common diseases. Our study also sought to establish which of the identified protective nsSNPs show evidence of positive selection, taking this as indirect evidence that the protective variant has a beneficial effect on phenotype. Further, we performed an analysis to quantify the predicted effect of each protective variant on protein function/structure. RESULTS An initial analysis of eight SNPs previously identified as associated with age-related macular degeneration (AMD), revealed that two of them have a derived allele that is protective against developing the disease. One is in the complement component 2 gene (C2; E318D) and the other is in the complement factor B gene (CFB; R32Q). Then, combining genomewide ancestral allele information with known common disease-associated nsSNPs from the GWAS catalog, we found 32 additional SNPs which have a derived allele that is disease protective. Out of the total 34 identified candidate protective variants (CPVs), we found that 30 show stronger evidence of positive selection than the protective variant in lipoprotein lipase (LPL; S447X), which has already been translated into gene therapy. Furthermore, 11 of these CPVs have a higher probability of affecting protein structure than the lipoprotein lipase protective variant (LPL; S447X). CONCLUSIONS We identify 34 CPVs from the human genome. Diseases they confer protection against include, but are not limited to, type 2 diabetes, inflammatory bowel disease, age-related macular degeneration, multiple sclerosis and rheumatoid arthritis. We propose that those 30 CPVs with evidence of stronger positive selection than the LPL protective variant, may be considered as priority candidates for therapeutic approaches. The next step towards translation will require testing the hypotheses generated by our analyses, specifically whether the CPV arose from a gain-of-function or a loss-of-function mutation.
Collapse
Affiliation(s)
- Joe M Butler
- Department of Eye and Vision Science, Institute of Ageing and Chronic Disease, University of Liverpool, 6 West Derby Street, Liverpool, L7 8TX, UK
| | - Neil Hall
- The Earlham Institute, Norwich Research Park, Norwich, NR4 7UH, UK
| | - Niro Narendran
- Department of Ophthalmology, The Royal Wolverhampton NHS Trust, New Cross Hospital, Wolverhampton, WV10 0QP, UK
| | - Yit C Yang
- Department of Ophthalmology, The Royal Wolverhampton NHS Trust, New Cross Hospital, Wolverhampton, WV10 0QP, UK
| | - Luminita Paraoan
- Department of Eye and Vision Science, Institute of Ageing and Chronic Disease, University of Liverpool, 6 West Derby Street, Liverpool, L7 8TX, UK.
| |
Collapse
|
42
|
Dashtban M, Balafar M, Suravajhala P. Gene selection for tumor classification using a novel bio-inspired multi-objective approach. Genomics 2017; 110:10-17. [PMID: 28780377 DOI: 10.1016/j.ygeno.2017.07.010] [Citation(s) in RCA: 57] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2017] [Revised: 07/12/2017] [Accepted: 07/30/2017] [Indexed: 12/21/2022]
Abstract
Identifying the informative genes has always been a major step in microarray data analysis. The complexity of various cancer datasets makes this issue still challenging. In this paper, a novel Bio-inspired Multi-objective algorithm is proposed for gene selection in microarray data classification specifically in the binary domain of feature selection. The presented method extends the traditional Bat Algorithm with refined formulations, effective multi-objective operators, and novel local search strategies employing social learning concepts in designing random walks. A hybrid model using the Fisher criterion is then applied to three widely-used microarray cancer datasets to explore significant biomarkers which reveal the effectiveness of the proposed method for genomic analysis. Experimental results unveil new combinations of informative biomarkers have association with other studies.
Collapse
Affiliation(s)
- M Dashtban
- Department of Computer Engineering, Faculty of Electrical & Computer Engineering, University of Tabriz, Iran.
| | - Mohammadali Balafar
- Department of Computer Engineering, Faculty of Electrical & Computer Engineering, University of Tabriz, Iran
| | - Prashanth Suravajhala
- Birla Institute of Scientific Research, Statue Circle, Jaipur 302001, Rajasthan, India; Bioclues.org, Kukatpally, Hyderabad 500072, Telangana, India
| |
Collapse
|
43
|
Ramos J, Castellanos-Garzón JA, González-Briones A, de Paz JF, Corchado JM. An Agent-Based Clustering Approach for Gene Selection in Gene Expression Microarray. Interdiscip Sci 2017; 9:1-13. [PMID: 28281239 DOI: 10.1007/s12539-017-0219-6] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2016] [Revised: 01/26/2017] [Accepted: 02/07/2017] [Indexed: 12/21/2022]
Abstract
Gene selection is a major research area in microarray analysis, which seeks to discover differentially expressed genes for a particular target annotation. Such genes also often called informative genes are able to differentiate tissue samples belonging to different classes of the studied disease. Despite the fact that there is a wide number of proposals, the complexity imposed by this problem remains a challenge today. This research proposes a gene selection approach by means of a clustering-based multi-agent system. This proposal manages different filter methods and gene clustering through coordinated agents to discover informative gene subsets. To assess the reliability of our approach, we have used four important and public gene expression datasets, two Lung cancer datasets, Colon and Leukemia cancer dataset. The achieved results have been validated through cluster validity measures, visual analytics, a classifier and compared with other gene selection methods, proving the reliability of our proposal.
Collapse
|
44
|
Chen Y, Zhang Z, Zheng J, Ma Y, Xue Y. Gene selection for tumor classification using neighborhood rough sets and entropy measures. J Biomed Inform 2017; 67:59-68. [PMID: 28215562 DOI: 10.1016/j.jbi.2017.02.007] [Citation(s) in RCA: 69] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2016] [Revised: 01/25/2017] [Accepted: 02/09/2017] [Indexed: 01/04/2023]
Abstract
With the development of bioinformatics, tumor classification from gene expression data becomes an important useful technology for cancer diagnosis. Since a gene expression data often contains thousands of genes and a small number of samples, gene selection from gene expression data becomes a key step for tumor classification. Attribute reduction of rough sets has been successfully applied to gene selection field, as it has the characters of data driving and requiring no additional information. However, traditional rough set method deals with discrete data only. As for the gene expression data containing real-value or noisy data, they are usually employed by a discrete preprocessing, which may result in poor classification accuracy. In this paper, we propose a novel gene selection method based on the neighborhood rough set model, which has the ability of dealing with real-value data whilst maintaining the original gene classification information. Moreover, this paper addresses an entropy measure under the frame of neighborhood rough sets for tackling the uncertainty and noisy of gene expression data. The utilization of this measure can bring about a discovery of compact gene subsets. Finally, a gene selection algorithm is designed based on neighborhood granules and the entropy measure. Some experiments on two gene expression data show that the proposed gene selection is an effective method for improving the accuracy of tumor classification.
Collapse
|
45
|
Dashtban M, Balafar M. Gene selection for microarray cancer classification using a new evolutionary method employing artificial intelligence concepts. Genomics 2017; 109:91-107. [PMID: 28159597 DOI: 10.1016/j.ygeno.2017.01.004] [Citation(s) in RCA: 51] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2016] [Revised: 01/09/2017] [Accepted: 01/24/2017] [Indexed: 12/25/2022]
Abstract
Gene selection is a demanding task for microarray data analysis. The diverse complexity of different cancers makes this issue still challenging. In this study, a novel evolutionary method based on genetic algorithms and artificial intelligence is proposed to identify predictive genes for cancer classification. A filter method was first applied to reduce the dimensionality of feature space followed by employing an integer-coded genetic algorithm with dynamic-length genotype, intelligent parameter settings, and modified operators. The algorithmic behaviors including convergence trends, mutation and crossover rate changes, and running time were studied, conceptually discussed, and shown to be coherent with literature findings. Two well-known filter methods, Laplacian and Fisher score, were examined considering similarities, the quality of selected genes, and their influences on the evolutionary approach. Several statistical tests concerning choice of classifier, choice of dataset, and choice of filter method were performed, and they revealed some significant differences between the performance of different classifiers and filter methods over datasets. The proposed method was benchmarked upon five popular high-dimensional cancer datasets; for each, top explored genes were reported. Comparing the experimental results with several state-of-the-art methods revealed that the proposed method outperforms previous methods in DLBCL dataset.
Collapse
|
46
|
Wang A, An N, Yang J, Chen G, Li L, Alterovitz G. Wrapper-based gene selection with Markov blanket. Comput Biol Med 2017; 81:11-23. [PMID: 28006702 DOI: 10.1016/j.compbiomed.2016.12.002] [Citation(s) in RCA: 53] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2016] [Revised: 11/17/2016] [Accepted: 12/02/2016] [Indexed: 11/21/2022]
Abstract
Gene selection seeks to find a small subset of discriminant genes from the gene expression profiles. Current gene selection methods such as wrapper-based models mainly address the issue of obtaining high-quality gene subsets. However, they are considerably time consuming, due to the existence of irrelevant and redundant genes. In this study, we present an improved wrapper-based gene selection method by introducing the Markov blanket technique to reduce the required wrapper evaluation time. In addition, our method can identify targeting genes while eliminating redundant ones in an efficient way. We use ten publicly available microarray datasets to evaluate the proposed method. The results show that our method can handle gene selection effectively. Our experimental results also show that wrapper-based method combined with the Markov blanket outperforms other competing methods in terms of classification accuracy and time/space complexity.
Collapse
|
47
|
Hur B, Lim S, Chae H, Seo S, Lee S, Kang J, Kim S. CLIP-GENE: a web service of the condition specific context-laid integrative analysis for gene prioritization in mouse TF knockout experiments. Biol Direct 2016; 11:57. [PMID: 27776539 DOI: 10.1186/s13062-016-0158-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2016] [Accepted: 10/10/2016] [Indexed: 02/06/2023] Open
Abstract
MOTIVATION Transcriptome data from the gene knockout experiment in mouse is widely used to investigate functions of genes and relationship to phenotypes. When a gene is knocked out, it is important to identify which genes are affected by the knockout gene. Existing methods, including differentially expressed gene (DEG) methods, can be used for the analysis. However, existing methods require cutoff values to select candidate genes, which can produce either too many false positives or false negatives. This hurdle can be addressed either by improving the accuracy of gene selection or by providing a method to rank candidate genes effectively, or both. Prioritization of candidate genes should consider the goals or context of the knockout experiment. As of now, there are no tools designed for both selecting and prioritizing genes from the mouse knockout data. Hence, the necessity of a new tool arises. RESULTS In this study, we present CLIP-GENE, a web service that selects gene markers by utilizing differentially expressed genes, mouse transcription factor (TF) network, and single nucleotide variant information. Then, protein-protein interaction network and literature information are utilized to find genes that are relevant to the phenotypic differences. One of the novel features is to allow researchers to specify their contexts or hypotheses in a set of keywords to rank genes according to the contexts that the user specify. We believe that CLIP-GENE will be useful in characterizing functions of TFs in mouse experiments. AVAILABILITY http://epigenomics.snu.ac.kr/CLIP-GENE REVIEWERS: This article was reviewed by Dr. Lee and Dr. Pongor.
Collapse
|
48
|
Lu CL, Su TC, Lin TC, Chung IF. Systematic identification of multiple tumor types in microarray data based on hybrid differential evolution algorithm. Technol Health Care 2016; 24 Suppl 1:S237-44. [PMID: 26684567 DOI: 10.3233/thc-151080] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Correct classification and prediction of tumor cells are essential for microarrays to construct a diagnostic system. Differential evolution (DE) is a powerful optimization algorithm, which has been widely used in many areas. However, the standard DE and most of its variants search in the continuous space, which cannot solve the binary optimizations directly. In this paper, the hybrid framework based on the binary DE algorithm and silhouette filter, is proposed to improve searching ability to classify breast and leukemia cancers in microarray for biomarker discovery. The study is focused to use hybrid DE algorithm for gene selection and silhouette statistics as a discriminant function to classify multiple tumor types in microarray data. Distance metrics on silhouette statistics have also been discussed for high classification accuracy. Experimental results show that the hybrid method is effective to discriminate breast and leukemia cancer subtypes and find potential biomarkers for cancer diagnosis.
Collapse
Affiliation(s)
- Chun-Liang Lu
- Institute of Biomedical Informatics, National Yang-Ming University, Taipei, Taiwan.,Department of Applied Information and Multimedia, Ching Kuo Institute of Management and Health, Keelung County, Taiwan
| | - Tsan-Cheng Su
- Department of Computer Science and Information Engineering, National Dong Hwa University, Hualien County, Taiwan
| | - Tsun-Chen Lin
- Department of Computer and Communication Engineering, Dahan Institute of Technology, Hualien County, Taiwan
| | - I-Fang Chung
- Institute of Biomedical Informatics, National Yang-Ming University, Taipei, Taiwan
| |
Collapse
|
49
|
Abstract
BACKGROUND Development of biologically relevant models from gene expression data notably, microarray data has become a topic of great interest in the field of bioinformatics and clinical genetics and oncology. Only a small number of gene expression data compared to the total number of genes explored possess a significant correlation with a certain phenotype. Gene selection enables researchers to obtain substantial insight into the genetic nature of the disease and the mechanisms responsible for it. Besides improvement of the performance of cancer classification, it can also cut down the time and cost of medical diagnoses. METHODS This study presents a modified Artificial Bee Colony Algorithm (ABC) to select minimum number of genes that are deemed to be significant for cancer along with improvement of predictive accuracy. The search equation of ABC is believed to be good at exploration but poor at exploitation. To overcome this limitation we have modified the ABC algorithm by incorporating the concept of pheromones which is one of the major components of Ant Colony Optimization (ACO) algorithm and a new operation in which successive bees communicate to share their findings. RESULTS The proposed algorithm is evaluated using a suite of ten publicly available datasets after the parameters are tuned scientifically with one of the datasets. Obtained results are compared to other works that used the same datasets. The performance of the proposed method is proved to be superior. CONCLUSION The method presented in this paper can provide subset of genes leading to more accurate classification results while the number of selected genes is smaller. Additionally, the proposed modified Artificial Bee Colony Algorithm could conceivably be applied to problems in other areas as well.
Collapse
Affiliation(s)
| | - Rameen Shakur
- Wellcome Trust - Medical Research Council Cambridge Stem Cell Institute, University of Cambridge, Cambridge, UK
| | - Mohammad Kaykobad
- A ℓEDA Group, Department of CSE, BUET, Dhaka-1205, Dhaka, Bangladesh
| | | |
Collapse
|
50
|
Lovato P, Bicego M, Kesa M, Jojic N, Murino V, Perina A. Traveling on discrete embeddings of gene expression. Artif Intell Med 2016; 70:1-11. [PMID: 27431033 DOI: 10.1016/j.artmed.2016.05.002] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2015] [Revised: 05/20/2016] [Accepted: 05/21/2016] [Indexed: 12/24/2022]
Abstract
OBJECTIVE High-throughput technologies have generated an unprecedented amount of high-dimensional gene expression data. Algorithmic approaches could be extremely useful to distill information and derive compact interpretable representations of the statistical patterns present in the data. This paper proposes a mining approach to extract an informative representation of gene expression profiles based on a generative model called the Counting Grid (CG). METHOD Using the CG model, gene expression values are arranged on a discrete grid, learned in a way that "similar" co-expression patterns are arranged in close proximity, thus resulting in an intuitive visualization of the dataset. More than this, the model permits to identify the genes that distinguish between classes (e.g. different types of cancer). Finally, each sample can be characterized with a discriminative signature - extracted from the model - that can be effectively employed for classification. RESULTS A thorough evaluation on several gene expression datasets demonstrate the suitability of the proposed approach from a twofold perspective: numerically, we reached state-of-the-art classification accuracies on 5 datasets out of 7, and similar results when the approach is tested in a gene selection setting (with a stability always above 0.87); clinically, by confirming that many of the genes highlighted by the model as significant play also a key role for cancer biology. CONCLUSION The proposed framework can be successfully exploited to meaningfully visualize the samples; detect medically relevant genes; properly classify samples.
Collapse
Affiliation(s)
- Pietro Lovato
- Department of Computer Science, University of Verona, Strada le Grazie 15, 37134 Verona, Italy.
| | - Manuele Bicego
- Department of Computer Science, University of Verona, Strada le Grazie 15, 37134 Verona, Italy
| | - Maria Kesa
- Tallinn University of Technology, Ehitajate tee 5, 19086 Tallinn, Estonia
| | - Nebojsa Jojic
- Microsoft Research, One Microsoft Way, 98052 Redmond, WA, USA
| | - Vittorio Murino
- Pattern Analysis and Computer Vision (PAVIS), Istituto Italiano di Tecnologia (IIT), Via Morego 30, 16163 Genova, Italy
| | | |
Collapse
|