1
|
Li K, Wang Z, Zhou Y, Li S. Lung adenocarcinoma identification based on hybrid feature selections and attentional convolutional neural networks. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2024; 21:2991-3015. [PMID: 38454716 DOI: 10.3934/mbe.2024133] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/09/2024]
Abstract
Lung adenocarcinoma, a chronic non-small cell lung cancer, needs to be detected early. Tumor gene expression data analysis is effective for early detection, yet its challenges lie in a small sample size, high dimensionality, and multi-noise characteristics. In this study, we propose a lung adenocarcinoma convolutional neural network (LATCNN), a deep learning model tailored for accurate lung adenocarcinoma prediction and identification of key genes. During the feature selection stage, we introduce a hybrid algorithm. Initially, the fast correlation-based filter (FCBF) algorithm swiftly filters out irrelevant features, followed by applying the k-means-synthetic minority over-sampling technique (k-means-SMOTE) method to address category imbalance. Subsequently, we enhance the particle swarm optimization (PSO) algorithm by incorporating fast-decay dynamic inertia weights and utilizing the classification and regression tree (CART) as the fitness function for the second stage of feature selection, aiming to further eliminate redundant features. In the classifier construction stage, we present an attention convolutional neural network (atCNN) that incorporates an attention mechanism. This improved model conducts feature selection post lung adenocarcinoma gene expression data analysis for classification and prediction. The results show that LATCNN effectively reduces the feature dimensions and accurately identifies 12 key genes with accuracy, recall, F1 score, and MCC of 99.70%, 99.33%, 99.98%, and 98.67%, respectively. These performance metrics surpass those of other comparative models, highlighting the significance of this research for advancing lung adenocarcinoma treatment.
Collapse
Affiliation(s)
- Kunpeng Li
- School of Information Engineering, Gansu University of Chinese Medicine, Lanzhou 730000, China
| | - Zepeng Wang
- School of Information Engineering, Gansu University of Chinese Medicine, Lanzhou 730000, China
| | - Yu Zhou
- School of Information Engineering, Gansu University of Chinese Medicine, Lanzhou 730000, China
| | - Sihai Li
- School of Information Engineering, Gansu University of Chinese Medicine, Lanzhou 730000, China
| |
Collapse
|
2
|
Manikandan P, Durga U, Ponnuraja C. An integrative machine learning framework for classifying SEER breast cancer. Sci Rep 2023; 13:5362. [PMID: 37005484 PMCID: PMC10067827 DOI: 10.1038/s41598-023-32029-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2023] [Accepted: 03/21/2023] [Indexed: 04/04/2023] Open
Abstract
Breast cancer is the commonest type of cancer in women worldwide and the leading cause of mortality for females. The aim of this research is to classify the alive and death status of breast cancer patients using the Surveillance, Epidemiology, and End Results dataset. Due to its capacity to handle enormous data sets systematically, machine learning and deep learning has been widely employed in biomedical research to answer diverse classification difficulties. Pre-processing the data enables its visualization and analysis for use in making important decisions. This research presents a feasible machine learning-based approach for categorizing SEER breast cancer dataset. Moreover, a two-step feature selection method based on Variance Threshold and Principal Component Analysis was employed to select the features from the SEER breast cancer dataset. After selecting the features, the classification of the breast cancer dataset is carried out using Supervised and Ensemble learning techniques such as Ada Boosting, XG Boosting, Gradient Boosting, Naive Bayes and Decision Tree. Utilizing the train-test split and k-fold cross-validation approaches, the performance of various machine learning algorithms is examined. The accuracy of Decision Tree for both train-test split and cross validation achieved as 98%. In this study, it is observed that the Decision Tree algorithm outperforms other supervised and ensemble learning approaches for the SEER Breast Cancer dataset.
Collapse
Affiliation(s)
- P Manikandan
- Department of Data Science, Loyola College, Chennai, 600 034, India.
| | - U Durga
- Department of Data Science, Loyola College, Chennai, 600 034, India
| | - C Ponnuraja
- ICMR-National Institute for Research in Tuberculosis, Chennai, 600 031, India.
| |
Collapse
|
3
|
Tahmouresi A, Rashedi E, Yaghoobi MM, Rezaei M. Gene selection using pyramid gravitational search algorithm. PLoS One 2022; 17:e0265351. [PMID: 35290401 PMCID: PMC8923457 DOI: 10.1371/journal.pone.0265351] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2021] [Accepted: 02/28/2022] [Indexed: 11/24/2022] Open
Abstract
Genetics play a prominent role in the development and progression of malignant neoplasms. Identification of the relevant genes is a high-dimensional data processing problem. Pyramid gravitational search algorithm (PGSA), a hybrid method in which the number of genes is cyclically reduced is proposed to conquer the curse of dimensionality. PGSA consists of two elements, a filter and a wrapper method (inspired by the gravitational search algorithm) which iterates through cycles. The genes selected in each cycle are passed on to the subsequent cycles to further reduce the dimension. PGSA tries to maximize the classification accuracy using the most informative genes while reducing the number of genes. Results are reported on a multi-class microarray gene expression dataset for breast cancer. Several feature selection algorithms have been implemented to have a fair comparison. The PGSA ranked first in terms of accuracy (84.5%) with 73 genes. To check if the selected genes are meaningful in terms of patient’s survival and response to therapy, protein-protein interaction network analysis has been applied on the genes. An interesting pattern was emerged when examining the genetic network. HSP90AA1, PTK2 and SRC genes were amongst the top-rated bottleneck genes, and DNA damage, cell adhesion and migration pathways are highly enriched in the network.
Collapse
Affiliation(s)
| | - Esmat Rashedi
- Department of Electrical and Computer Engineering, Graduate University of Advanced Technology, Kerman, Iran
- * E-mail:
| | - Mohammad Mehdi Yaghoobi
- Department of Biotechnology, Institute of Science and High Technology and Environmental Sciences, Graduate University of Advanced Technology, Kerman, Iran
| | - Masoud Rezaei
- Faculty of Medicine, Kerman University of Medical Sciences, Kerman, Iran
| |
Collapse
|
4
|
Shaikh TA, Ali R. An automated machine learning tool for breast cancer diagnosis for healthcare professionals. Health Syst (Basingstoke) 2021; 11:303-333. [DOI: 10.1080/20476965.2021.1966324] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022] Open
Affiliation(s)
- Tawseef Ayoub Shaikh
- Department Of Computer Science & Engineering, Baba Ghulam Shah Badshah University Rajouri, Rajouri, J&K, India
| | - Rashid Ali
- Department Of Computer Engineering, Aligarh Muslim University, Aligarh, Uttar Pradesh, India
| |
Collapse
|
5
|
Chennuru VK, Timmappareddy SR. Simulated annealing based undersampling (SAUS): a hybrid multi-objective optimization method to tackle class imbalance. APPL INTELL 2021. [DOI: 10.1007/s10489-021-02369-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
6
|
Al-Betar MA, Alomari OA, Abu-Romman SM. A TRIZ-inspired bat algorithm for gene selection in cancer classification. Genomics 2020; 112:114-126. [DOI: 10.1016/j.ygeno.2019.09.015] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2019] [Revised: 09/05/2019] [Accepted: 09/17/2019] [Indexed: 10/25/2022]
|
7
|
Mabu AM, Prasad R, Yadav R. Gene Expression Dataset Classification Using Artificial Neural Network and Clustering-Based Feature Selection. INTERNATIONAL JOURNAL OF SWARM INTELLIGENCE RESEARCH 2020. [DOI: 10.4018/ijsir.2020010104] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
With the progression of bioinformatics, applications of GE profiles on cancer diagnosis along with classification have become an intriguing subject in the bioinformatics field. It holds numerous genes with few samples that make it arduous to examine and process. A novel strategy aimed at the classification of GE dataset as well as clustering-centered feature selection is proposed in the paper. The proposed technique first preprocesses the dataset using normalization, and later, feature selection was accomplished with the assistance of feature clustering support vector machine (FCSVM). It has two phases, gene clustering and gene representation. To make the chose top-positioned features worthy for classification, feature reduction is performed by utilizing SVM-recursive feature elimination (SVM-RFE) algorithm. Finally, the feature-reduced data set was classified using artificial neural network (ANN) classifier. When compared with some recent swarm intelligence feature reduction approach, FCSVM-ANN showed an elegant performance.
Collapse
Affiliation(s)
| | - Rajesh Prasad
- African University of Science and Technology, Abuja, Nigeria
| | | |
Collapse
|
8
|
Xu X, Liang T, Zhu J, Zheng D, Sun T. Review of classical dimensionality reduction and sample selection methods for large-scale data processing. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2018.02.100] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
|
9
|
Sari M, Tuna C, Akogul S. Prediction of Tibial Rotation Pathologies Using Particle Swarm Optimization and K-Means Algorithms. J Clin Med 2018; 7:jcm7040065. [PMID: 29597270 PMCID: PMC5920439 DOI: 10.3390/jcm7040065] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2018] [Revised: 03/22/2018] [Accepted: 03/22/2018] [Indexed: 11/18/2022] Open
Abstract
The aim of this article is to investigate pathological subjects from a population through different physical factors. To achieve this, particle swarm optimization (PSO) and K-means (KM) clustering algorithms have been combined (PSO-KM). Datasets provided by the literature were divided into three clusters based on age and weight parameters and each one of right tibial external rotation (RTER), right tibial internal rotation (RTIR), left tibial external rotation (LTER), and left tibial internal rotation (LTIR) values were divided into three types as Type 1, Type 2 and Type 3 (Type 2 is non-pathological (normal) and the other two types are pathological (abnormal)), respectively. The rotation values of every subject in any cluster were noted. Then the algorithm was run and the produced values were also considered. The values of the produced algorithm, the PSO-KM, have been compared with the real values. The hybrid PSO-KM algorithm has been very successful on the optimal clustering of the tibial rotation types through the physical criteria. In this investigation, Type 2 (pathological subjects) is of especially high predictability and the PSO-KM algorithm has been very successful as an operation system for clustering and optimizing the tibial motion data assessments. These research findings are expected to be very useful for health providers, such as physiotherapists, orthopedists, and so on, in which this consequence may help clinicians to appropriately designing proper treatment schedules for patients.
Collapse
Affiliation(s)
- Murat Sari
- Department of Mathematics, Yildiz Technical University, Istanbul 34220, Turkey.
| | - Can Tuna
- Department of Mathematics, Yildiz Technical University, Istanbul 34220, Turkey.
| | - Serkan Akogul
- Department of Statistics, Yildiz Technical University, Istanbul 34220, Turkey.
| |
Collapse
|
10
|
Biswas S, Dutta S, Acharyya S. Identification of Disease Critical Genes Using Collective Meta-heuristic Approaches: An Application to Preeclampsia. Interdiscip Sci 2017; 11:444-459. [PMID: 29196984 DOI: 10.1007/s12539-017-0276-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2017] [Revised: 11/06/2017] [Accepted: 11/21/2017] [Indexed: 10/18/2022]
Abstract
Identifying a small subset of disease critical genes out of a large size of microarray gene expression data is a challenge in computational life sciences. This paper has applied four meta-heuristic algorithms, namely, honey bee mating optimization (HBMO), harmony search (HS), differential evolution (DE) and genetic algorithm (basic version GA) to find disease critical genes of preeclampsia which affects women during gestation. Two hybrid algorithms, namely, HBMO-kNN and HS-kNN have been newly proposed here where kNN (k nearest neighbor classifier) is used for sample classification. Performances of these new approaches have been compared with other two hybrid algorithms, namely, DE-kNN and SGA-kNN. Three datasets of different sizes have been used. In a dataset, the set of genes found common in the output of each algorithm is considered here as disease critical genes. In different datasets, the percentage of classification or classification accuracy of meta-heuristic algorithms varied between 92.46 and 100%. HBMO-kNN has the best performance (99.64-100%) in almost all data sets. DE-kNN secures the second position (99.42-100%). Disease critical genes obtained here match with clinically revealed preeclampsia genes to a large extent.
Collapse
Affiliation(s)
- Surama Biswas
- Department of Computer Science and Engineering, Maulana Abul Kalam Azad University of Technology, West Bengal (MAKAUT, WB), BF-142, Sector-I, Salt Lake, Kolkata, West Bengal, 700064, India.
| | - Subarna Dutta
- Department of Computer Science and Engineering, Maulana Abul Kalam Azad University of Technology, West Bengal (MAKAUT, WB), BF-142, Sector-I, Salt Lake, Kolkata, West Bengal, 700064, India
| | - Sriyankar Acharyya
- Department of Computer Science and Engineering, Maulana Abul Kalam Azad University of Technology, West Bengal (MAKAUT, WB), BF-142, Sector-I, Salt Lake, Kolkata, West Bengal, 700064, India
| |
Collapse
|
11
|
Zhang Y, Li T, Luo C, Zhang J, Chen H. Incremental updating of rough approximations in interval-valued information systems under attribute generalization. Inf Sci (N Y) 2016. [DOI: 10.1016/j.ins.2016.09.018] [Citation(s) in RCA: 43] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|