1
|
Arukonda S, Cheruku R. Nested genetic algorithm-based classifier selection and placement in multi-level ensemble framework for effective disease diagnosis. Comput Methods Biomech Biomed Engin 2025; 28:487-510. [PMID: 38126276 DOI: 10.1080/10255842.2023.2294264] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2023] [Revised: 10/15/2023] [Accepted: 12/05/2023] [Indexed: 12/23/2023]
Abstract
Effective disease diagnosis is a critical unmet need on a global scale. The intricacies of the numerous disease mechanisms and underlying symptoms make developing a model for early diagnosis and effective treatment extremely difficult. Machine learning (ML) can help to solve some of these issues. Recently, various ensemble-based ML models have benefited clinicians in early diagnosis. However, one of the most difficult challenges in multi-level ensemble approaches is the classifier selection and their placement in the ensemble framework as it improves the overall performance. Let m classifiers have to select from n classifiers there are ( n m ) ways. Again, these ( n m ) possibilities can be arranged in m ! ways. Finding the best m classifiers and their positions from total ( n m ) m ! ways is a challenging and hard problem. To address this challenge, a dynamic three-level ensemble framework is proposed. A nested Genetic Algorithm (GA) and ensemble-based fitness function are employed to optimize the classifier selection and their placement in a three-level ensemble framework. Our approach used eleven classifiers and chose seven classifiers by maximizing the fitness function. The proposed model experiments on 12 disease datasets. The proposed model outperformed in terms of accuracy, F1, and G-measure on the Chronic Kidney Disease (CKD) dataset is 0.987, 0.988, and 0.989, respectively. In terms of AUC on the Heart disease dataset (HDD) is 0.998 and in terms of recall on the Hypothyroid disease dataset (HyDD) is 0.988. In addition, the proposed model superiority is statically evaluated by Wilcoxon-Signed-Rank (WSR) test compared with other ensemble models, such as random forest (RF), bagging classifier (BC), XGBoost (XGB), and gradient boost classifier (GBC) with probability value p < 0.05 results shows all the traditional ensemble model differs with proposed model and also effective size evaluated with using the matched-pairs rank biserial correlation coefficient wc and statistical results shows effective size is large with RF and BC and effective size is medium with XGB and GBC. Proposed model has outperformed comparing with State-Of-The-Art (SOTA) ensemble and non-ensemble models. Further, the proposed model outperformed in terms of the ROC curve in the majority of the disease datasets. The results suggest the usage of the proposed model for disease diagnosis applications.
Collapse
Affiliation(s)
- Srinivas Arukonda
- Department of Computer Science and Engineering, National Institute of Technology Warangal, Hanamkonda, India
| | - Ramalingaswamy Cheruku
- Department of Computer Science and Engineering, National Institute of Technology Warangal, Hanamkonda, India
| |
Collapse
|
2
|
Abousaber I, Abdallah HF, El-Ghaish H. Robust predictive framework for diabetes classification using optimized machine learning on imbalanced datasets. Front Artif Intell 2025; 7:1499530. [PMID: 39839971 PMCID: PMC11747138 DOI: 10.3389/frai.2024.1499530] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2024] [Accepted: 12/12/2024] [Indexed: 01/23/2025] Open
Abstract
Introduction Diabetes prediction using clinical datasets is crucial for medical data analysis. However, class imbalances, where non-diabetic cases dominate, can significantly affect machine learning model performance, leading to biased predictions and reduced generalization. Methods A novel predictive framework employing cutting-edge machine learning algorithms and advanced imbalance handling techniques was developed. The framework integrates feature engineering and resampling strategies to enhance predictive accuracy. Results Rigorous testing was conducted on three datasets-PIMA, Diabetes Dataset 2019, and BIT_2019-demonstrating the robustness and adaptability of the methodology across varying data environments. Discussion The experimental results highlight the critical role of model selection and imbalance mitigation in achieving reliable and generalizable diabetes predictions. This study offers significant contributions to medical informatics by proposing a robust data-driven framework that addresses class imbalance challenges, thereby advancing diabetes prediction accuracy.
Collapse
Affiliation(s)
- Inam Abousaber
- Department of Information Technology, Faculty of Computers and Information Technology, University of Tabuk, Tabuk, Saudi Arabia
| | - Haitham F. Abdallah
- Department of Electronics and Electrical Communication, Higher Institute of Engineering and Technology, Kafr El Sheikh, Egypt
| | - Hany El-Ghaish
- Department of Computer and Automatic Control, Faculty of Engineering, Tanta University, Tanta, Egypt
| |
Collapse
|
3
|
Cattelani L, Ghosh A, Rintala TJ, Fortino V. A Comprehensive Evaluation Framework for Benchmarking Multi-Objective Feature Selection in Omics-Based Biomarker Discovery. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:2432-2446. [PMID: 39401114 DOI: 10.1109/tcbb.2024.3480150] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/17/2024]
Abstract
Machine learning algorithms have been extensively used for accurate classification of cancer subtypes driven by gene expression-based biomarkers. However, biomarker models combining multiple gene expression signatures are often not reproducible in external validation datasets and their feature set size is often not optimized, jeopardizing their translatability into cost-effective clinical tools. We investigated how to solve the multi-objective problem of finding the best trade-offs between classification performance and set size applying seven algorithms for machine learning-driven feature subset selection and analyse how they perform in a benchmark with eight large-scale transcriptome datasets of cancer, covering both training and external validation sets. The benchmark includes evaluation metrics assessing the performance of the individual biomarkers and the solution sets, according to their accuracy, diversity, and stability of the composing genes. Moreover, a new evaluation metric for cross-validation studies is proposed that generalizes the hypervolume, which is commonly used to assess the performance of multi-objective optimization algorithms. Biomarkers exhibiting 0.8 of balanced accuracy on the external dataset for breast, kidney and ovarian cancer using respectively 4, 2 and 7 features, were obtained. Genetic algorithms often provided better performance than other considered algorithms, and the recently proposed NSGA2-CH and NSGA2-CHS were the best performing methods in most cases.
Collapse
|
4
|
Wang X, Wang W, Ren H, Li X, Wen Y. Prediction and analysis of risk factors for diabetic retinopathy based on machine learning and interpretable models. Heliyon 2024; 10:e29497. [PMID: 38699007 PMCID: PMC11064081 DOI: 10.1016/j.heliyon.2024.e29497] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Revised: 04/09/2024] [Accepted: 04/09/2024] [Indexed: 05/05/2024] Open
Abstract
Objective Diabetic retinopathy is one of the major complications of diabetes. In this study, a diabetic retinopathy risk prediction model integrating machine learning models and SHAP was established to increase the accuracy of risk prediction for diabetic retinopathy, explain the rationality of the findings from model prediction and improve the reliability of prediction results. Methods Data were preprocessed for missing values and outliers, features selected through information gain, a diabetic retinopathy risk prediction model established using the CatBoost and the outputs of the mode interpreted using the SHAP model. Results One thousand early warning data of diabetes complications derived from diabetes complication early warning dataset from the National Clinical Medical Sciences Data Center were used in this study. The CatBoost-based model for diabetic retinopathy prediction performed the best in the comparative model test. ALB_CR, HbA1c, UPR_24, NEPHROPATHY and SCR were positively correlated with diabetic retinopathy, while CP, HB, ALB, DBILI and CRP were negatively correlated with diabetic retinopathy. The relationships between HEIGHT, WEIGHT and ESR characteristics and diabetic retinopathy were not significant. Conclusion The risk factors for diabetic retinopathy include poor renal function, elevated blood glucose level, liver disease, hematonosis and dysarteriotony, among others. Diabetic retinopathy can be prevented by monitoring and effectively controlling relevant indices. In this study, the influence relationships between the features were also analyzed to further explore the potential factors of diabetic retinopathy, which can provide new methods and new ideas for the early prevention and clinical diagnosis of subsequent diabetic retinopathy.
Collapse
Affiliation(s)
- Xu Wang
- Institute of Medical Information/Medical Library, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Weijie Wang
- Institute of Medical Information/Medical Library, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Huiling Ren
- Institute of Medical Information/Medical Library, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Xiaoying Li
- Institute of Medical Information/Medical Library, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Yili Wen
- Institute of Medical Information/Medical Library, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| |
Collapse
|
5
|
Brar AS, Singh K. A multi-objective stacked regression method for distance based colour measuring device. Sci Rep 2024; 14:5530. [PMID: 38448462 PMCID: PMC10918078 DOI: 10.1038/s41598-024-54785-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2023] [Accepted: 02/16/2024] [Indexed: 03/08/2024] Open
Abstract
Identifying colour from a distance is challenging due to the external noise associated with the measurement process. The present study focuses on developing a colour measuring system and a novel Multi-target Regression (MTR) model for accurate colour measurement from distance. Herein, a novel MTR method, referred as Multi-Objective Stacked Regression (MOSR) is proposed. The core idea behind MOSR is based on stacking as an ensemble approach with multi-objective evolutionary learning using NSGA-II. A multi-objective optimization approach is used for selecting base learners that maximises prediction accuracy while minimising ensemble complexity, which is further compared with six state-of-the-art methods over the colour dataset. Classification and regression tree (CART), Random Forest (RF) and Support Vector Machine (SVM) were used as regressor algorithms. MOSR outperformed all compared methods with the highest coefficient of determination values for all three targets of the colour dataset. Rigorous comparison with state-of-the-art methods over 18 benchmarked datasets showed MOSR outperformed in 15 datasets when CART was used as a regressor algorithm and 11 datasets when RF and SVM were used as regressor algorithms. The MOSR method was statistically superior to compared methods and can be effectively used to measure accurate colour values in the distance-based colour measuring device.
Collapse
Affiliation(s)
- Amrinder Singh Brar
- Department of Computer Science and Engineering, Punjabi University, Patiala, 147002, India.
| | - Kawaljeet Singh
- University Computer Centre, Punjabi University, Patiala, 147002, India
| |
Collapse
|
6
|
Nilashi M, Abumalloh RA, Alyami S, Alghamdi A, Alrizq M. Parkinson’s Disease Diagnosis Using Laplacian Score, Gaussian Process Regression and Self-Organizing Maps. Brain Sci 2023; 13:brainsci13040543. [PMID: 37190508 DOI: 10.3390/brainsci13040543] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2023] [Revised: 03/10/2023] [Accepted: 03/18/2023] [Indexed: 03/29/2023] Open
Abstract
Parkinson’s disease (PD) is a complex degenerative brain disease that affects nerve cells in the brain responsible for body movement. Machine learning is widely used to track the progression of PD in its early stages by predicting unified Parkinson’s disease rating scale (UPDRS) scores. In this paper, we aim to develop a new method for PD diagnosis with the aid of supervised and unsupervised learning techniques. Our method is developed using the Laplacian score, Gaussian process regression (GPR) and self-organizing maps (SOM). SOM is used to segment the data to handle large PD datasets. The models are then constructed using GPR for the prediction of the UPDRS scores. To select the important features in the PD dataset, we use the Laplacian score in the method. We evaluate the developed approach on a PD dataset including a set of speech signals. The method was evaluated through root-mean-square error (RMSE) and adjusted R-squared (adjusted R²). Our findings reveal that the proposed method is efficient in the prediction of UPDRS scores through a set of speech signals (dysphonia measures). The method evaluation showed that SOM combined with the Laplacian score and Gaussian process regression with the exponential kernel provides the best results for R-squared (Motor-UPDRS = 0.9489; Total-UPDRS = 0.9516) and RMSE (Motor-UPDRS = 0.5144; Total-UPDRS = 0.5105) in predicting UPDRS compared with the other kernels in Gaussian process regression.
Collapse
|
7
|
El-Sappagh S, Alonso-Moral JM, Abuhmed T, Ali F, Bugarín-Diz A. Trustworthy artificial intelligence in Alzheimer’s disease: state of the art, opportunities, and challenges. Artif Intell Rev 2023. [DOI: 10.1007/s10462-023-10415-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/28/2023]
|
8
|
An Ensemble of Light Gradient Boosting Machine and Adaptive Boosting for Prediction of Type-2 Diabetes. INT J COMPUT INT SYS 2023. [DOI: 10.1007/s44196-023-00184-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/16/2023] Open
Abstract
AbstractMachine learning helps construct predictive models in clinical data analysis, predicting stock prices, picture recognition, financial modelling, disease prediction, and diagnostics. This paper proposes machine learning ensemble algorithms to forecast diabetes. The ensemble combines k-NN, Naive Bayes (Gaussian), Random Forest (RF), Adaboost, and a recently designed Light Gradient Boosting Machine. The proposed ensembles inherit detection ability of LightGBM to boost accuracy. Under fivefold cross-validation, the proposed ensemble models perform better than other recent models. The k-NN, Adaboost, and LightGBM jointly achieve 90.76% detection accuracy. The receiver operating curve analysis shows that $$k$$
k
-NN, RF, and LightGBM successfully solve class imbalance issue of the underlying dataset.
Collapse
|
9
|
Simaiya S, Kaur R, Sandhu JK, Alsafyani M, Alroobaea R, alsekait DM, Margala M, Chakrabarti P. A novel multistage ensemble approach for prediction and classification of diabetes. Front Physiol 2022; 13:1085240. [PMID: 36601350 PMCID: PMC9807241 DOI: 10.3389/fphys.2022.1085240] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2022] [Accepted: 11/22/2022] [Indexed: 12/23/2022] Open
Abstract
Diabetes mellitus is a metabolic syndrome affecting millions of people worldwide. Every year, the rate of occurrence rises drastically. Diabetes-related problems across several vital organs of the body can be fatal if left untreated. Diabetes must be detected early to receive proper treatment, preventing the condition from escalating to severe problems. Tremendous health sciences and biotechnology advancements have resulted in massive data that generated massive Electronic Health Records and clinical information. The exponential increase of electronically gathered information has resulted in more complicated, accurate prediction models that can be updated continuously using machine learning techniques. This research mainly emphasizes discovering the best ensemble model for predicting diabetes. A new multistage ensemble model is proposed for diabetes prediction. In this model, accuracy is predicated on the Pima Indian Diabetes dataset. The accuracy of the proposed ensemble model is compared with the existing machine learning model, and the experimental results demonstrate the performance of the proposed model in terms of higher Precision, f-measure, Recall, and area under the curve.
Collapse
Affiliation(s)
- Sarita Simaiya
- Department of Computer Science and Engineering, Institute of Engineering and Technology, Chandigarh University, Mohali, Punjab, India,School of Computing and Informatics, University of Louisiana, Lafayette, LA, United States,*Correspondence: Sarita Simaiya, ; Martin Margala,
| | - Rajwinder Kaur
- Chitkara University Institute of Engineering and Technology, Chitkara University, Rajpura, Punjab, India
| | - Jasminder Kaur Sandhu
- Department of Computer Science and Engineering, Institute of Engineering and Technology, Chandigarh University, Mohali, Punjab, India
| | - Majed Alsafyani
- Department Science, College of Computers and Information Technology, Taif University, Taif, Saudi Arabia
| | - Roobaea Alroobaea
- Department of Computer Science, College of Computers and Information Technology, Taif University, Taif, Saudi Arabia
| | - Deema mohammed alsekait
- Department of Computer Science and Information Technology, Applied College, Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia
| | - Martin Margala
- School of Computing and Informatics, University of Louisiana, Lafayette, LA, United States,*Correspondence: Sarita Simaiya, ; Martin Margala,
| | | |
Collapse
|
10
|
Joseph LP, Joseph EA, Prasad R. Explainable diabetes classification using hybrid Bayesian-optimized TabNet architecture. Comput Biol Med 2022; 151:106178. [PMID: 36306578 DOI: 10.1016/j.compbiomed.2022.106178] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2022] [Revised: 09/23/2022] [Accepted: 10/01/2022] [Indexed: 12/27/2022]
Abstract
Diabetes is a deadly chronic disease that occurs when the pancreas is not able to produce ample insulin or when the body cannot use insulin effectively. If undetected, it may lead to a host of health complications. Hence, accurate and explainable early-stage detection of diabetes is essential for the proper administration of treatment options in leading a healthy and productive life. For this, we developed an interpretable TabNet model tuned via Bayesian optimization (BO). To achieve model-specific interpretability, the attention mechanism of TabNet architecture was used, which offered the local and global model explanations on the influence of the attributes on the outcomes. The model was further explained locally and globally using more robust model-agnostic LIME and SHAP eXplainable Artificial Intelligence (XAI) tools. The proposed model outperformed all benchmarked models by obtaining high accuracy of 92.2% and 99.4% using the Pima Indians diabetes dataset (PIDD) and the early-stage diabetes risk prediction dataset (ESDRPD), respectively. Based on the XAI results, it was clear that the most influential attribute for diabetes classification using PIDD and ESDRPD were Insulin and Polyuria, respectively. The feature importance values registered for insulin was 0.301 (PIDD) and for polyuria 0.206 was registered (ESDRPD). The high accuracy and ancillary interpretability of our objective model is expected to increase end-users trust and confidence in early-stage detection of diabetes.
Collapse
Affiliation(s)
- Lionel P Joseph
- School of Mathematics, Physics, and Computing, University of Southern Queensland, Springfield, QLD, 4300, Australia
| | - Erica A Joseph
- Umanand Prasad School of Medicine and Health Sciences, The University of Fiji, Saweni, Lautoka, Fiji
| | - Ramendra Prasad
- Department of Science, School of Science and Technology, The University of Fiji, Saweni, Lautoka, Fiji.
| |
Collapse
|
11
|
Predictive Analysis of Diabetes-Risk with Class Imbalance. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:3078025. [PMID: 36268149 PMCID: PMC9578843 DOI: 10.1155/2022/3078025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/09/2022] [Revised: 08/31/2022] [Accepted: 09/05/2022] [Indexed: 11/29/2022]
Abstract
Diabetes type 2 (T2DM) is a common chronic disease, increasingly leading to many complications and affecting vital organs. Hyperglycemia is the main characteristic caused by insufficient insulin secretion and poses a serious risk to human health. The objective is to construct a type-2 diabetes prediction model with high classification accuracy. Advanced machine learning and predictive model techniques are utilized to achieve cutting-edge techniques for the early diagnosis of diabetes. This paper proposes an efficient performance model to predict and classify the minority class of type-2 diabetes. The impact of oversampling and undersampling approaches to reduce the effect of an unbalanced class has been compared to classification performance algorithms. Synthetic Minority Oversampling (SMOTE) and Tomek-links techniques are applied and examined. The outcomes were then compared to the original unbalanced dataset using an artificial neural network (ANN) predictive model. The model is compared with other state-of-the-art classifiers such as support vector machine (SVM), random forest (RF), and decision tree (DT). The tuned model had the best accuracy of 92.2%. The experimental findings clearly manifest the improvement in accuracy and evaluation metrics in terms of AUC and F1-measure using the SMOTE oversampling strategy rather than the baseline and undersampling schemes. The study recommends adopting dynamic hyperparameter optimization to further improve accuracy.
Collapse
|
12
|
Hybrid stacked ensemble combined with genetic algorithms for diabetes prediction. IRAN JOURNAL OF COMPUTER SCIENCE 2022. [PMCID: PMC8935256 DOI: 10.1007/s42044-022-00100-1] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Diabetes is currently one of the most common, dangerous, and costly diseases globally caused by increased blood sugar or a decrease in insulin in the body. Diabetes can have detrimental effects on people’s health if diagnosed late. Today, diabetes has become one of the challenges for health and government officials. Prevention is a priority, and taking care of people’s health without compromising their comfort is an essential need. In this study, the ensemble training methodology based on genetic algorithms was used to diagnose and predict the outcomes of diabetes mellitus accurately. This study uses the experimental data, actual data on Indian diabetics on the University of California website. Current developments in ICT, such as the Internet of Things, machine learning, and data mining, allow us to provide health strategies with more intelligent capabilities to accurately predict the outcomes of the disease in daily life and the hospital and prevent the progression of this disease and its many complications. The results show the high performance of the proposed method in diagnosing the disease, which has reached 98.8%, and 99% accuracy in this study.
Collapse
|
13
|
Morgan-Benita JA, Galván-Tejada CE, Cruz M, Galván-Tejada JI, Gamboa-Rosales H, Arceo-Olague JG, Luna-García H, Celaya-Padilla JM. Hard Voting Ensemble Approach for the Detection of Type 2 Diabetes in Mexican Population with Non-Glucose Related Features. Healthcare (Basel) 2022; 10:healthcare10081362. [PMID: 35893185 PMCID: PMC9331873 DOI: 10.3390/healthcare10081362] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Revised: 07/11/2022] [Accepted: 07/15/2022] [Indexed: 11/16/2022] Open
Abstract
Type 2 diabetes mellitus (T2DM) represents one of the biggest health problems in Mexico, and it is extremely important to early detect this disease and its complications. For a noninvasive detection of T2DM, a machine learning (ML) approach that uses ensemble classification models with dichotomous output that is also fast and effective for early detection and prediction of T2D can be used. In this article, an ensemble technique by hard voting is designed and implemented using generalized linear regression (GLM), support vector machines (SVM) and artificial neural networks (ANN) for the classification of T2DM patients. In the materials and methods as a first step, the data is balanced, standardized, imputed and integrated into the three models to classify the patients in a dichotomous result. For the selection of features, an implementation of LASSO is developed, with a 10-fold cross-validation and for the final validation, the Area Under the Curve (AUC) is used. The results in LASSO showed 12 features, which are used in the implemented models to obtain the best possible scenario in the developed ensemble model. The algorithm with the best performance of the three is SVM, this model obtained an AUC of 92% ± 3%. The ensemble model built with GLM, SVM and ANN obtained an AUC of 90% ± 3%.
Collapse
Affiliation(s)
- Jorge A. Morgan-Benita
- Unidad Académica de Ingeniería Eléctrica, Universidad Autónoma de Zacatecas, Jardín Juárez 147, Centro, Zacatecas 98000, Mexico; (J.A.M.-B.); (C.E.G.-T.); (J.I.G.-T.); (H.G.-R.); (J.G.A.-O.)
| | - Carlos E. Galván-Tejada
- Unidad Académica de Ingeniería Eléctrica, Universidad Autónoma de Zacatecas, Jardín Juárez 147, Centro, Zacatecas 98000, Mexico; (J.A.M.-B.); (C.E.G.-T.); (J.I.G.-T.); (H.G.-R.); (J.G.A.-O.)
| | - Miguel Cruz
- Unidad de Investigación Médica en Bioquímica, Hospital de Especialidades, Centro Médico Nacional Siglo XXI, Instituto Mexicano del Seguro Social, Av. Cuauhtémoc 330, Col. Doctores, Del. Cuauhtémoc, Mexico City 06720, Mexico;
| | - Jorge I. Galván-Tejada
- Unidad Académica de Ingeniería Eléctrica, Universidad Autónoma de Zacatecas, Jardín Juárez 147, Centro, Zacatecas 98000, Mexico; (J.A.M.-B.); (C.E.G.-T.); (J.I.G.-T.); (H.G.-R.); (J.G.A.-O.)
| | - Hamurabi Gamboa-Rosales
- Unidad Académica de Ingeniería Eléctrica, Universidad Autónoma de Zacatecas, Jardín Juárez 147, Centro, Zacatecas 98000, Mexico; (J.A.M.-B.); (C.E.G.-T.); (J.I.G.-T.); (H.G.-R.); (J.G.A.-O.)
| | - Jose G. Arceo-Olague
- Unidad Académica de Ingeniería Eléctrica, Universidad Autónoma de Zacatecas, Jardín Juárez 147, Centro, Zacatecas 98000, Mexico; (J.A.M.-B.); (C.E.G.-T.); (J.I.G.-T.); (H.G.-R.); (J.G.A.-O.)
| | - Huizilopoztli Luna-García
- Unidad Académica de Ingeniería Eléctrica, Universidad Autónoma de Zacatecas, Jardín Juárez 147, Centro, Zacatecas 98000, Mexico; (J.A.M.-B.); (C.E.G.-T.); (J.I.G.-T.); (H.G.-R.); (J.G.A.-O.)
- Correspondence: (H.L.-G.); (J.M.C.-P.)
| | - José M. Celaya-Padilla
- Unidad Académica de Ingeniería Eléctrica, Universidad Autónoma de Zacatecas, Jardín Juárez 147, Centro, Zacatecas 98000, Mexico; (J.A.M.-B.); (C.E.G.-T.); (J.I.G.-T.); (H.G.-R.); (J.G.A.-O.)
- Correspondence: (H.L.-G.); (J.M.C.-P.)
| |
Collapse
|
14
|
Olisah CC, Smith L, Smith M. Diabetes mellitus prediction and diagnosis from a data preprocessing and machine learning perspective. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2022; 220:106773. [PMID: 35429810 DOI: 10.1016/j.cmpb.2022.106773] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Revised: 01/25/2022] [Accepted: 03/22/2022] [Indexed: 06/14/2023]
Abstract
BACKGROUND AND OBJECTIVE Diabetes mellitus is a metabolic disorder characterized by hyperglycemia, which results from the inadequacy of the body to secrete and respond to insulin. If not properly managed or diagnosed on time, diabetes can pose a risk to vital body organs such as the eyes, kidneys, nerves, heart, and blood vessels and so can be life-threatening. The many years of research in computational diagnosis of diabetes have pointed to machine learning to as a viable solution for the prediction of diabetes. However, the accuracy rate to date suggests that there is still much room for improvement. In this paper, we are proposing a machine learning framework for diabetes prediction and diagnosis using the PIMA Indian dataset and the laboratory of the Medical City Hospital (LMCH) diabetes dataset. We hypothesize that adopting feature selection and missing value imputation methods can scale up the performance of classification models in diabetes prediction and diagnosis. METHODS In this paper, a robust framework for building a diabetes prediction model to aid in the clinical diagnosis of diabetes is proposed. The framework includes the adoption of Spearman correlation and polynomial regression for feature selection and missing value imputation, respectively, from a perspective that strengthens their performances. Further, different supervised machine learning models, the random forest (RF) model, support vector machine (SVM) model, and our designed twice-growth deep neural network (2GDNN) model are proposed for classification. The models are optimized by tuning the hyperparameters of the models using grid search and repeated stratified k-fold cross-validation and evaluated for their ability to scale to the prediction problem. RESULTS Through experiments on the PIMA Indian and LMCH diabetes datasets, precision, sensitivity, F1-score, train-accuracy, and test-accuracy scores of 97.34%, 97.24%, 97.26%, 99.01%, 97.25 and 97.28%, 97.33%, 97.27%, 99.57%, 97.33, are achieved with the proposed 2GDNN model, respectively. CONCLUSION The data preprocessing approaches and the classifiers with hyperparameter optimization proposed within the machine learning framework yield a robust machine learning model that outperforms state-of-the-art results in diabetes mellitus prediction and diagnosis. The source code for the models of the proposed machine learning framework has been made publicly available.
Collapse
Affiliation(s)
- Chollette C Olisah
- Centre for Machine Vision, Bristol Robotics Laboratory, University of the West of England, Bristol, UK.
| | - Lyndon Smith
- Centre for Machine Vision, Bristol Robotics Laboratory, University of the West of England, Bristol, UK
| | - Melvyn Smith
- Centre for Machine Vision, Bristol Robotics Laboratory, University of the West of England, Bristol, UK
| |
Collapse
|
15
|
Machine Learning Based Diabetes Classification and Prediction for Healthcare Applications. JOURNAL OF HEALTHCARE ENGINEERING 2021; 2021:9930985. [PMID: 34631003 PMCID: PMC8500744 DOI: 10.1155/2021/9930985] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/26/2021] [Revised: 05/17/2021] [Accepted: 08/16/2021] [Indexed: 11/17/2022]
Abstract
The remarkable advancements in biotechnology and public healthcare infrastructures have led to a momentous production of critical and sensitive healthcare data. By applying intelligent data analysis techniques, many interesting patterns are identified for the early and onset detection and prevention of several fatal diseases. Diabetes mellitus is an extremely life-threatening disease because it contributes to other lethal diseases, i.e., heart, kidney, and nerve damage. In this paper, a machine learning based approach has been proposed for the classification, early-stage identification, and prediction of diabetes. Furthermore, it also presents an IoT-based hypothetical diabetes monitoring system for a healthy and affected person to monitor his blood glucose (BG) level. For diabetes classification, three different classifiers have been employed, i.e., random forest (RF), multilayer perceptron (MLP), and logistic regression (LR). For predictive analysis, we have employed long short-term memory (LSTM), moving averages (MA), and linear regression (LR). For experimental evaluation, a benchmark PIMA Indian Diabetes dataset is used. During the analysis, it is observed that MLP outperforms other classifiers with 86.08% of accuracy and LSTM improves the significant prediction with 87.26% accuracy of diabetes. Moreover, a comparative analysis of the proposed approach is also performed with existing state-of-the-art techniques, demonstrating the adaptability of the proposed approach in many public healthcare applications.
Collapse
|
16
|
A novel bio-inspired hybrid multi-filter wrapper gene selection method with ensemble classifier for microarray data. Neural Comput Appl 2021; 35:11531-11561. [PMID: 34539088 PMCID: PMC8435304 DOI: 10.1007/s00521-021-06459-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2020] [Accepted: 08/26/2021] [Indexed: 01/04/2023]
Abstract
Microarray technology is known as one of the most important tools for collecting DNA expression data. This technology allows researchers to investigate and examine types of diseases and their origins. However, microarray data are often associated with a small sample size, a significant number of genes, imbalanced data, etc., making classification models inefficient. Thus, a new hybrid solution based on a multi-filter and adaptive chaotic multi-objective forest optimization algorithm (AC-MOFOA) is presented to solve the gene selection problem and construct the Ensemble Classifier. In the proposed solution, a multi-filter model (i.e., ensemble filter) is proposed as preprocessing step to reduce the dataset's dimensions, using a combination of five filter methods to remove redundant and irrelevant genes. Accordingly, the results of the five filter methods are combined using a voting-based function. Additionally, the results of the proposed multi-filter indicate that it has good capability in reducing the gene subset size and selecting relevant genes. Then, an AC-MOFOA based on the concepts of non-dominated sorting, crowding distance, chaos theory, and adaptive operators is presented. AC-MOFOA as a wrapper method aimed at reducing dataset dimensions, optimizing KELM, and increasing the accuracy of the classification, simultaneously. Next, in this method, an ensemble classifier model is presented using AC-MOFOA results to classify microarray data. The performance of the proposed algorithm was evaluated on nine public microarray datasets, and its results were compared in terms of the number of selected genes, classification efficiency, execution time, time complexity, hypervolume indicator, and spacing metric with five hybrid multi-objective methods, and three hybrid single-objective methods. According to the results, the proposed hybrid method could increase the accuracy of the KELM in most datasets by reducing the dataset's dimensions and achieve similar or superior performance compared to other multi-objective methods. Furthermore, the proposed Ensemble Classifier model could provide better classification accuracy and generalizability in the seven of nine microarray datasets compared to conventional ensemble methods. Moreover, the comparison results of the Ensemble Classifier model with three state-of-the-art ensemble generation methods indicate its competitive performance in which the proposed ensemble model achieved better results in the five of nine datasets.
Collapse
|
17
|
Development of ensemble learning classification with density peak decomposition-based evolutionary multi-objective optimization. INT J MACH LEARN CYB 2021. [DOI: 10.1007/s13042-020-01271-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
|
18
|
García-Ordás MT, Benavides C, Benítez-Andrades JA, Alaiz-Moretón H, García-Rodríguez I. Diabetes detection using deep learning techniques with oversampling and feature augmentation. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2021; 202:105968. [PMID: 33631638 DOI: 10.1016/j.cmpb.2021.105968] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/30/2020] [Accepted: 01/30/2021] [Indexed: 06/12/2023]
Abstract
BACKGROUND AND OBJECTIVE Diabetes is a chronic pathology which is affecting more and more people over the years. It gives rise to a large number of deaths each year. Furthermore, many people living with the disease do not realize the seriousness of their health status early enough. Late diagnosis brings about numerous health problems and a large number of deaths each year so the development of methods for the early diagnosis of this pathology is essential. METHODS In this paper, a pipeline based on deep learning techniques is proposed to predict diabetic people. It includes data augmentation using a variational autoencoder (VAE), feature augmentation using an sparse autoencoder (SAE) and a convolutional neural network for classification. Pima Indians Diabetes Database, which takes into account information on the patients such as the number of pregnancies, glucose or insulin level, blood pressure or age, has been evaluated. RESULTS A 92.31% of accuracy was obtained when CNN classifier is trained jointly the SAE for featuring augmentation over a well balanced dataset. This means an increment of 3.17% of accuracy with respect the state-of-the-art. CONCLUSIONS Using a full deep learning pipeline for data preprocessing and classification has demonstrate to be very promising in the diabetes detection field outperforming the state-of-the-art proposals.
Collapse
Affiliation(s)
- María Teresa García-Ordás
- SECOMUCI Research Groups, Escuela de Ingenierías Industrial e Informática, Universidad de León, Campus de Vegazana s/n, León C.P. 24071, Spain.
| | - Carmen Benavides
- SALBIS Research Group, Department of Electric, Systems and Automatics Engineering, Universidad de León, Campus of Vegazana s/n, León, León, 24071, Spain.
| | - José Alberto Benítez-Andrades
- SALBIS Research Group, Department of Electric, Systems and Automatics Engineering, Universidad de León, Campus of Vegazana s/n, León, León, 24071, Spain.
| | - Héctor Alaiz-Moretón
- SECOMUCI Research Groups, Escuela de Ingenierías Industrial e Informática, Universidad de León, Campus de Vegazana s/n, León C.P. 24071, Spain.
| | - Isaías García-Rodríguez
- SECOMUCI Research Groups, Escuela de Ingenierías Industrial e Informática, Universidad de León, Campus de Vegazana s/n, León C.P. 24071, Spain.
| |
Collapse
|
19
|
Kanimozhi N, Singaravel G. Hybrid artificial fish particle swarm optimizer and kernel extreme learning machine for type-II diabetes predictive model. Med Biol Eng Comput 2021; 59:841-867. [PMID: 33738640 DOI: 10.1007/s11517-021-02333-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2020] [Accepted: 02/03/2021] [Indexed: 10/21/2022]
Abstract
The World Health Organization (WHO) estimated that in 2016, 1.6 million deaths caused were due to diabetes. Precise and on-time diagnosis of type-II diabetes is crucial to reduce the risk of various diseases such as heart disease, stroke, kidney disease, diabetic retinopathy, diabetic neuropathy, and macrovascular problems. The non-invasive methods like machine learning are reliable and efficient in classifying the people subjected to type-II diabetics risk and healthy people into two different categories. This present study aims to develop a stacking-based integrated kernel extreme learning machine (KELM) model for identifying the risk of type-II diabetic patients based on the follow-up time on the diabetes research center dataset. The Pima Indian Diabetic Dataset (PIDD) and a Diabetic Research Center dataset are used in this study. A min-max normalization is used to preprocess the noisy datasets. The Hybrid Particle Swarm Optimization-Artificial Fish Swarm Optimization (HAFPSO) algorithm used satisfies the multi-objective problem by increasing the Classification Accuracy (CA) and decreasing the kernel complexity of the optimal learners (NBC) selected. At last, the model is integrated by utilizing the KELM as a meta-classifier which combines the predictions of the twenty Base Learners as a whole. The proposed classification method helps the clinicians to predict the patients who are at a high risk of type-II diabetes in the future with the highest accuracy of 98.5%. The proposed method is tested with different measures such as accuracy, sensitivity, specificity, Mathews Correlation Coefficient, and Kappa Statistics are calculated. The results obtained show that the KELM-HAFPSO approach is a promising new tool for identifying type-II diabetes.
Collapse
Affiliation(s)
- N Kanimozhi
- Department of Computer Science and Engineering, GKM College of Engineering and Technology, Chennai, India.
| | - G Singaravel
- Department of Information Technology, K S Rangasamy College of Engineering, Tiruchengode, India
| |
Collapse
|
20
|
Ray A, Chaudhuri AK. Smart healthcare disease diagnosis and patient management: Innovation, improvement and skill development. MACHINE LEARNING WITH APPLICATIONS 2021. [DOI: 10.1016/j.mlwa.2020.100011] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
|
21
|
Asadi S, Roshan SE. A bi-objective optimization method to produce a near-optimal number of classifiers and increase diversity in Bagging. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2020.106656] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
22
|
Jadhav AS, Patil PB, Biradar S. Analysis on diagnosing diabetic retinopathy by segmenting blood vessels, optic disc and retinal abnormalities. J Med Eng Technol 2020; 44:299-316. [PMID: 32729345 DOI: 10.1080/03091902.2020.1791986] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
The main intention of mass screening programmes for Diabetic Retinopathy (DR) is to detect and diagnose the disorder earlier than it leads to vision loss. Automated analysis of retinal images has the likelihood to improve the efficacy of screening programmes when compared over the manual image analysis. This article plans to develop a framework for the detection of DR from the retinal fundus images using three evaluations based on optic disc, blood vessels and retinal abnormalities. Initially, the pre-processing steps like green channel conversion and Contrast Limited Adaptive Histogram Equalisation is done. Further, the segmentation procedure starts with optic disc segmentation by open-close watershed transform, blood vessel segmentation by grey level thresholding and abnormality segmentation (hard exudates, haemorrhages, Microaneurysm and soft exudates) by top hat transform and Gabor filtering mechanisms. From the three segmented images, the feature like local binary pattern, texture energy measurement, Shanon's and Kapur's entropy are extracted, which is subjected to optimal feature selection process using the new hybrid optimisation algorithm termed as Trial-based Bypass Improved Dragonfly Algorithm (TB - DA). These features are given to hybrid machine learning algorithm with the combination of NN and DBN. As a modification, the same hybrid TB - DA is used to enhance the training of hybrid classifier, which outputs the categorisation as normal, mild, moderate or severe images based on three components.
Collapse
Affiliation(s)
- Ambaji S Jadhav
- Department of Electrical and Electronics, B.L.D.E.A's V.P. Dr. P.G. Halakatti College of Engineering & Technology (Affiliated to Visvesvaraya Technological University, Belagavi), Vijayapur, India
| | - Pushpa B Patil
- Department of Computer Science & Engineering, B.L.D.E.A's V.P. Dr. P.G. Halakatti College of Engineering & Technology (Affiliated to Visvesvaraya Technological University, Belagavi), Vijayapur, India
| | - Sunil Biradar
- Department of Ophthalmology, Shri B.M. Patil Medical College Hospital and Research Center, Vijayapur, India
| |
Collapse
|