1
|
M S K, Rajaguru H, Nair AR. Enhancement of Classifier Performance with Adam and RanAdam Hyper-Parameter Tuning for Lung Cancer Detection from Microarray Data-In Pursuit of Precision. Bioengineering (Basel) 2024; 11:314. [PMID: 38671736 PMCID: PMC11047746 DOI: 10.3390/bioengineering11040314] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2024] [Revised: 03/18/2024] [Accepted: 03/20/2024] [Indexed: 04/28/2024] Open
Abstract
Microarray gene expression analysis is a powerful technique used in cancer classification and research to identify and understand gene expression patterns that can differentiate between different cancer types, subtypes, and stages. However, microarray databases are highly redundant, inherently nonlinear, and noisy. Therefore, extracting meaningful information from such a huge database is a challenging one. The paper adopts the Fast Fourier Transform (FFT) and Mixture Model (MM) for dimensionality reduction and utilises the Dragonfly optimisation algorithm as the feature selection technique. The classifiers employed in this research are Nonlinear Regression, Naïve Bayes, Decision Tree, Random Forest and SVM (RBF). The classifiers' performances are analysed with and without feature selection methods. Finally, Adaptive Moment Estimation (Adam) and Random Adaptive Moment Estimation (RanAdam) hyper-parameter tuning techniques are used as improvisation techniques for classifiers. The SVM (RBF) classifier with the Fast Fourier Transform Dimensionality Reduction method and Dragonfly feature selection achieved the highest accuracy of 98.343% with RanAdam hyper-parameter tuning compared to other classifiers.
Collapse
Affiliation(s)
- Karthika M S
- Department of Information Technology, Bannari Amman Institute of Technology, Sathyamangalam 638401, India;
| | - Harikumar Rajaguru
- Department of Electronics and Communication Engineering, Bannari Amman Institute of Technology, Sathyamangalam 638401, India;
| | - Ajin R. Nair
- Department of Electronics and Communication Engineering, Bannari Amman Institute of Technology, Sathyamangalam 638401, India;
| |
Collapse
|
2
|
Rakhshaninejad M, Fathian M, Shirkoohi R, Barzinpour F, Gandomi AH. Refining breast cancer biomarker discovery and drug targeting through an advanced data-driven approach. BMC Bioinformatics 2024; 25:33. [PMID: 38253993 PMCID: PMC10810249 DOI: 10.1186/s12859-024-05657-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2023] [Accepted: 01/15/2024] [Indexed: 01/24/2024] Open
Abstract
Breast cancer remains a major public health challenge worldwide. The identification of accurate biomarkers is critical for the early detection and effective treatment of breast cancer. This study utilizes an integrative machine learning approach to analyze breast cancer gene expression data for superior biomarker and drug target discovery. Gene expression datasets, obtained from the GEO database, were merged post-preprocessing. From the merged dataset, differential expression analysis between breast cancer and normal samples revealed 164 differentially expressed genes. Meanwhile, a separate gene expression dataset revealed 350 differentially expressed genes. Additionally, the BGWO_SA_Ens algorithm, integrating binary grey wolf optimization and simulated annealing with an ensemble classifier, was employed on gene expression datasets to identify predictive genes including TOP2A, AKR1C3, EZH2, MMP1, EDNRB, S100B, and SPP1. From over 10,000 genes, BGWO_SA_Ens identified 1404 in the merged dataset (F1 score: 0.981, PR-AUC: 0.998, ROC-AUC: 0.995) and 1710 in the GSE45827 dataset (F1 score: 0.965, PR-AUC: 0.986, ROC-AUC: 0.972). The intersection of DEGs and BGWO_SA_Ens selected genes revealed 35 superior genes that were consistently significant across methods. Enrichment analyses uncovered the involvement of these superior genes in key pathways such as AMPK, Adipocytokine, and PPAR signaling. Protein-protein interaction network analysis highlighted subnetworks and central nodes. Finally, a drug-gene interaction investigation revealed connections between superior genes and anticancer drugs. Collectively, the machine learning workflow identified a robust gene signature for breast cancer, illuminated their biological roles, interactions and therapeutic associations, and underscored the potential of computational approaches in biomarker discovery and precision oncology.
Collapse
Affiliation(s)
- Morteza Rakhshaninejad
- Industrial Engineering Department, Iran University of Science and Technology, Hengam Street, Tehran, 1684613114, Tehran, Iran
| | - Mohammad Fathian
- Industrial Engineering Department, Iran University of Science and Technology, Hengam Street, Tehran, 1684613114, Tehran, Iran.
| | - Reza Shirkoohi
- Cancer Biology Research Center, Cancer Institute, Imam Khomeini Hospital Complex, Tehran University of Medical Sciences, Keshavarz Boulevard, Tehran, 1419733141, Tehran, Iran
| | - Farnaz Barzinpour
- Industrial Engineering Department, Iran University of Science and Technology, Hengam Street, Tehran, 1684613114, Tehran, Iran
| | - Amir H Gandomi
- Faculty of Engineering and Information Technology, University of Technology Sydney, Ultimo, 2007, NSW, Australia
- University Research and Innovation Center (EKIK), Óbuda University, Budapest, 1034, Hungary
| |
Collapse
|
3
|
Bhandari N, Walambe R, Kotecha K, Khare SP. A comprehensive survey on computational learning methods for analysis of gene expression data. Front Mol Biosci 2022; 9:907150. [PMID: 36458095 PMCID: PMC9706412 DOI: 10.3389/fmolb.2022.907150] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2022] [Accepted: 09/28/2022] [Indexed: 09/19/2023] Open
Abstract
Computational analysis methods including machine learning have a significant impact in the fields of genomics and medicine. High-throughput gene expression analysis methods such as microarray technology and RNA sequencing produce enormous amounts of data. Traditionally, statistical methods are used for comparative analysis of gene expression data. However, more complex analysis for classification of sample observations, or discovery of feature genes requires sophisticated computational approaches. In this review, we compile various statistical and computational tools used in analysis of expression microarray data. Even though the methods are discussed in the context of expression microarrays, they can also be applied for the analysis of RNA sequencing and quantitative proteomics datasets. We discuss the types of missing values, and the methods and approaches usually employed in their imputation. We also discuss methods of data normalization, feature selection, and feature extraction. Lastly, methods of classification and class discovery along with their evaluation parameters are described in detail. We believe that this detailed review will help the users to select appropriate methods for preprocessing and analysis of their data based on the expected outcome.
Collapse
Affiliation(s)
- Nikita Bhandari
- Computer Science Department, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India
| | - Rahee Walambe
- Electronics and Telecommunication Department, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India
- Symbiosis Center for Applied AI (SCAAI), Symbiosis International (Deemed University), Pune, India
| | - Ketan Kotecha
- Computer Science Department, Symbiosis Institute of Technology, Symbiosis International (Deemed University), Pune, India
- Symbiosis Center for Applied AI (SCAAI), Symbiosis International (Deemed University), Pune, India
| | - Satyajeet P. Khare
- Symbiosis School of Biological Sciences, Symbiosis International (Deemed University), Pune, India
| |
Collapse
|
4
|
Abdulla M, Khasawneh MT. G-Forest: An ensemble method for cost-sensitive feature selection in gene expression microarrays. Artif Intell Med 2020; 108:101941. [DOI: 10.1016/j.artmed.2020.101941] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2020] [Revised: 06/27/2020] [Accepted: 08/07/2020] [Indexed: 12/27/2022]
|
5
|
Tan Q, Thomassen M, Kruse TA. Feature Selection for Predicting Tumor Metastases in Microarray Experiments using Paired Design. Cancer Inform 2017. [DOI: 10.1177/117693510700300025] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Among the major issues in gene expression profile classification, feature selection is an important and necessary step in achieving and creating good classification rules given the high dimensionality of microarray data. Although different feature selection methods have been reported, there has been no method specifically proposed for paired microarray experiments. In this paper, we introduce a simple procedure based on a modified t-statistic for feature selection to microarray experiments using the popular matched case-control design and apply to our recent study on tumor metastasis in a low-malignant group of breast cancer patients for selecting genes that best predict metastases. Gene or feature selection is optimized by thresholding in a leaving one-pair out cross-validation. Model comparison through empirical application has shown that our method manifests improved efficiency with high sensitivity and specificity.
Collapse
Affiliation(s)
- Qihua Tan
- Department of Biochemistry, Pharmacology and Genetics, Odense University Hospital, Odense, Denmark
- Department of epidemiology, Institute of Public Health, University of Southern Denmark, Odense, Denmark
| | - Mads Thomassen
- Department of Biochemistry, Pharmacology and Genetics, Odense University Hospital, Odense, Denmark
| | - Torben A. Kruse
- Department of Biochemistry, Pharmacology and Genetics, Odense University Hospital, Odense, Denmark
| |
Collapse
|
6
|
Development of a two-stage gene selection method that incorporates a novel hybrid approach using the cuckoo optimization algorithm and harmony search for cancer classification. J Biomed Inform 2017; 67:11-20. [DOI: 10.1016/j.jbi.2017.01.016] [Citation(s) in RCA: 56] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2016] [Revised: 01/24/2017] [Accepted: 01/31/2017] [Indexed: 12/24/2022]
|
7
|
Kumar M, Rath NK, Rath SK. Analysis of microarray leukemia data using an efficient MapReduce-based K-nearest-neighbor classifier. J Biomed Inform 2016; 60:395-409. [DOI: 10.1016/j.jbi.2016.03.002] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2015] [Revised: 02/28/2016] [Accepted: 03/02/2016] [Indexed: 10/22/2022]
|
8
|
Bonilla-Huerta E, Hernández-Montiel A, Caporal RM, López MA. Hybrid Framework Using Multiple-Filters and an Embedded Approach for an Efficient Selection and Classification of Microarray Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:12-26. [PMID: 26336138 DOI: 10.1109/tcbb.2015.2474384] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
A hybrid framework composed of two stages for gene selection and classification of DNA microarray data is proposed. At the first stage, five traditional statistical methods are combined for preliminary gene selection (Multiple Fusion Filter). Then, different relevant gene subsets are selected by using an embedded Genetic Algorithm (GA), Tabu Search (TS), and Support Vector Machine (SVM). A gene subset, consisting of the most relevant genes, is obtained from this process, by analyzing the frequency of each gene in the different gene subsets. Finally, the most frequent genes are evaluated by the embedded approach to obtain a final relevant small gene subset with high performance. The proposed method is tested in four DNA microarray datasets. From simulation study, it is observed that the proposed approach works better than other methods reported in the literature.
Collapse
|
9
|
Classification of microarray using MapReduce based proximal support vector machine classifier. Knowl Based Syst 2015. [DOI: 10.1016/j.knosys.2015.09.005] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
10
|
Kumar M, Rath NK, Swain A, Rath SK. Feature Selection and Classification of Microarray Data using MapReduce based ANOVA and K-Nearest Neighbor. ACTA ACUST UNITED AC 2015. [DOI: 10.1016/j.procs.2015.06.035] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
11
|
Hybrid Filter-Wrapper with a Specialized Random Multi-Parent Crossover Operator for Gene Selection and Classification Problems. ACTA ACUST UNITED AC 2012. [DOI: 10.1007/978-3-642-24553-4_60] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
|
12
|
Unler A, Murat A, Chinnam RB. mr2PSO: A maximum relevance minimum redundancy feature selection method based on swarm intelligence for support vector machine classification. Inf Sci (N Y) 2011. [DOI: 10.1016/j.ins.2010.05.037] [Citation(s) in RCA: 205] [Impact Index Per Article: 14.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
13
|
|
14
|
Huerta EB, Duval B, Hao JK. Fuzzy logic for elimination of redundant information of microarray data. GENOMICS PROTEOMICS & BIOINFORMATICS 2009; 6:61-73. [PMID: 18973862 PMCID: PMC5054105 DOI: 10.1016/s1672-0229(08)60021-2] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Gene subset selection is essential for classification and analysis of microarray data. However, gene selection is known to be a very difficult task since gene expression data not only have high dimensionalities, but also contain redundant information and noises. To cope with these difficulties, this paper introduces a fuzzy logic based pre-processing approach composed of two main steps. First, we use fuzzy inference rules to transform the gene expression levels of a given dataset into fuzzy values. Then we apply a similarity relation to these fuzzy values to define fuzzy equivalence groups, each group containing strongly similar genes. Dimension reduction is achieved by considering for each group of similar genes a single representative based on mutual information. To assess the usefulness of this approach, extensive experimentations were carried out on three well-known public datasets with a combined classification model using three statistic filters and three classifiers.
Collapse
|
15
|
|