1
|
Abdelwanis M, Moawad K, Mohammed S, Hummieda A, Syed S, Maalouf M, Jelinek HF. Sequential classification approach for enhancing the assessment of cardiac autonomic neuropathy. Comput Biol Med 2025; 190:109999. [PMID: 40112561 DOI: 10.1016/j.compbiomed.2025.109999] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Revised: 03/03/2025] [Accepted: 03/04/2025] [Indexed: 03/22/2025]
Abstract
Cardiac autonomic neuropathy (CAN) is a progressive condition associated with chronic diseases like diabetes, requiring regular reviews. Current CAN diagnostic methods are often time-consuming and lack precision. This study presents a novel, two-stage classification model designed to improve CAN diagnostic efficiency. Using a dataset of 1335 patient entries, including inflammatory markers and autonomic function tests (CARTs), the model first classifies patients based on six inflammatory markers- Interleukin-6 (IL-6), C-reactive protein (CRP), Interleukin-1 beta (IL-1beta), Interleukin-10 (IL-10), Monocyte Chemoattractant Protein-1 (MCP-1), and Insulin-like growth factor-1 (IGF-1). In this initial stage, the model achieves 0.893 accuracy for 31.46% of cases in the three-class CAN model at a 0.80 threshold. For cases requiring further assessment, the second stage incorporates CARTs, improving overall accuracy to 0.933. Notably, 98.87% of cases are accurately classified using only a subset of CARTs, with just 1.12% needing all five tests. Additionally, we developed a web application that utilizes Shapley plots to visualize and explain the contribution of each marker, facilitating interpretation for clinical use. This two-stage approach underscores the diagnostic relevance of inflammatory markers, providing clinicians with a streamlined, resource-efficient tool for timely CAN diagnosis and intervention.
Collapse
Affiliation(s)
- Moustafa Abdelwanis
- Department of Management Science and Engineering, Khalifa University, Abu Dhabi, 127788, United Arab Emirates.
| | - Karim Moawad
- Department of Management Science and Engineering, Khalifa University, Abu Dhabi, 127788, United Arab Emirates.
| | - Shahmir Mohammed
- Department of Electrical Engineering & Computer Science, Khalifa University, Abu Dhabi, 127788, United Arab Emirates.
| | - Ammar Hummieda
- Department of Management Science and Engineering, Khalifa University, Abu Dhabi, 127788, United Arab Emirates.
| | - Shayaan Syed
- Department of Management Science and Engineering, Khalifa University, Abu Dhabi, 127788, United Arab Emirates.
| | - Maher Maalouf
- Department of Management Science and Engineering, Khalifa University, Abu Dhabi, 127788, United Arab Emirates.
| | - Herbert F Jelinek
- Department of Medical Sciences & Biotechnology Center, Khalifa University, Abu Dhabi, 127788, United Arab Emirates; Biotechnology Center, Khalifa University, Abu Dhabi, 127788, United Arab Emirates.
| |
Collapse
|
2
|
Sambyal N, Saini P, Syal R. A Review of Statistical and Machine Learning Techniques for Microvascular Complications in Type 2 Diabetes. Curr Diabetes Rev 2021; 17:143-155. [PMID: 32389114 DOI: 10.2174/1573399816666200511003357] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/25/2019] [Revised: 04/16/2020] [Accepted: 04/22/2020] [Indexed: 11/22/2022]
Abstract
UNLABELLED Background and Introduction: Diabetes mellitus is a metabolic disorder that has emerged as a serious public health issue worldwide. According to the World Health Organization (WHO), without interventions, the number of diabetic incidences is expected to be at least 629 million by 2045. Uncontrolled diabetes gradually leads to progressive damage to eyes, heart, kidneys, blood vessels, and nerves. METHODS The paper presents a critical review of existing statistical and Artificial Intelligence (AI) based machine learning techniques with respect to DM complications, mainly retinopathy, neuropathy, and nephropathy. The statistical and machine learning analytic techniques are used to structure the subsequent content review. RESULTS It has been observed that statistical analysis can help only in inferential and descriptive analysis whereas, AI-based machine learning models can even provide actionable prediction models for faster and accurate diagnosis of complications associated with DM. CONCLUSION The integration of AI-based analytics techniques, like machine learning and deep learning in clinical medicine, will result in improved disease management through faster disease detection and cost reduction for the treatment.
Collapse
Affiliation(s)
- Nitigya Sambyal
- Department of Computer Science & Engineering, Punjab Engineering College, Sector 12, Chandigarh-160012, India
| | - Poonam Saini
- Department of Computer Science & Engineering, Punjab Engineering College, Sector 12, Chandigarh-160012, India
| | - Rupali Syal
- Department of Computer Science & Engineering, Punjab Engineering College, Sector 12, Chandigarh-160012, India
| |
Collapse
|
3
|
Verda D, Parodi S, Ferrari E, Muselli M. Analyzing gene expression data for pediatric and adult cancer diagnosis using logic learning machine and standard supervised methods. BMC Bioinformatics 2019; 20:390. [PMID: 31757200 PMCID: PMC6873393 DOI: 10.1186/s12859-019-2953-8] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2019] [Accepted: 06/14/2019] [Indexed: 12/27/2022] Open
Abstract
Background Logic Learning Machine (LLM) is an innovative method of supervised analysis capable of constructing models based on simple and intelligible rules. In this investigation the performance of LLM in classifying patients with cancer was evaluated using a set of eight publicly available gene expression databases for cancer diagnosis. LLM accuracy was assessed by summary ROC curve (sROC) analysis and estimated by the area under an sROC curve (sAUC). Its performance was compared in cross validation with that of standard supervised methods, namely: decision tree, artificial neural network, support vector machine (SVM) and k-nearest neighbor classifier. Results LLM showed an excellent accuracy (sAUC = 0.99, 95%CI: 0.98–1.0) and outperformed any other method except SVM. Conclusions LLM is a new powerful tool for the analysis of gene expression data for cancer diagnosis. Simple rules generated by LLM could contribute to a better understanding of cancer biology, potentially addressing therapeutic approaches.
Collapse
Affiliation(s)
| | - Stefano Parodi
- Epidemiology and Biostatistics Unit, IRCCS Istituto Giannina Gaslini, Genoa, Italy
| | | | - Marco Muselli
- Rulex Inc., Newton, MA, USA. .,Institute of Electronics, Computer and Telecommunication Engineering National Research Council of Italy, Via De Marini, 6, 16149, Genoa, Italy.
| |
Collapse
|
4
|
Murphree DH, Arabmakki E, Ngufor C, Storlie CB, McCoy RG. Stacked classifiers for individualized prediction of glycemic control following initiation of metformin therapy in type 2 diabetes. Comput Biol Med 2018; 103:109-115. [PMID: 30347342 DOI: 10.1016/j.compbiomed.2018.10.017] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2018] [Revised: 10/14/2018] [Accepted: 10/15/2018] [Indexed: 01/11/2023]
Abstract
OBJECTIVE Metformin is the preferred first-line medication for management of type 2 diabetes and prediabetes. However, over a third of patients experience primary or secondary therapeutic failure. We developed machine learning models to predict which patients initially prescribed metformin will achieve and maintain control of their blood glucose after one year of therapy. MATERIALS AND METHODS We performed a retrospective analysis of administrative claims data for 12,147 commercially-insured adults and Medicare Advantage beneficiaries with prediabetes or diabetes. Several machine learning models were trained using variables available at the time of metformin initiation to predict achievement and maintenance of hemoglobin A1c (HbA1c) < 7.0% after one year of therapy. RESULTS AUC performances based on five-fold cross-validation ranged from 0.58 to 0.75. The most influential variables driving the predictions were baseline HbA1c, starting metformin dosage, and presence of diabetes with complications. CONCLUSIONS Machine learning models can effectively predict primary or secondary metformin treatment failure within one year. This information can help identify effective individualized treatment strategies. Most of the implemented models outperformed traditional logistic regression, highlighting the potential for applying machine learning to problems in medicine.
Collapse
Affiliation(s)
- Dennis H Murphree
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN 55905, USA.
| | - Elaheh Arabmakki
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN 55905, USA
| | - Che Ngufor
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN 55905, USA
| | - Curtis B Storlie
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN 55905, USA
| | - Rozalina G McCoy
- Division of Community Internal Medicine, Department of Medicine, Mayo Clinic, Rochester, MN, 55905, USA; Division of Health Care Policy & Research, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, 55905, USA; Mayo Clinic Robert D. and Patricia E. Kern Center for the Science of Health Care Delivery, Mayo Clinic, Rochester, MN 55905, USA
| |
Collapse
|
5
|
Abawajy J, Kelarev A, Yi X, Jelinek HF. Minimal ensemble based on subset selection using ECG to diagnose categories of CAN. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2018; 160:85-94. [PMID: 29728250 DOI: 10.1016/j.cmpb.2018.01.019] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/22/2017] [Revised: 12/06/2017] [Accepted: 01/15/2018] [Indexed: 06/08/2023]
Abstract
BACKGROUND AND OBJECTIVE Early diagnosis of cardiac autonomic neuropathy (CAN) is critical for reversing or decreasing its progression and prevent complications. Diagnostic accuracy or precision is one of the core requirements of CAN detection. As the standard Ewing battery tests suffer from a number of shortcomings, research in automating and improving the early detection of CAN has recently received serious attention in identifying additional clinical variables and designing advanced ensembles of classifiers to improve the accuracy or precision of CAN diagnostics. Although large ensembles are commonly proposed for the automated diagnosis of CAN, large ensembles are characterized by slow processing speed and computational complexity. This paper applies ECG features and proposes a new ensemble-based approach for diagnosis of CAN progression. METHODS We introduce a Minimal Ensemble Based On Subset Selection (MEBOSS) for the diagnosis of all categories of CAN including early, definite and atypical CAN. MEBOSS is based on a novel multi-tier architecture applying classifier subset selection as well as the training subset selection during several steps of its operation. Our experiments determined the diagnostic accuracy or precision obtained in 5 × 2 cross-validation for various options employed in MEBOSS and other classification systems. RESULTS The experiments demonstrate the operation of the MEBOSS procedure invoking the most effective classifiers available in the open source software environment SageMath. The results of our experiments show that for the large DiabHealth database of CAN related parameters MEBOSS outperformed other classification systems available in SageMath and achieved 94% to 97% precision in 5 × 2 cross-validation correctly distinguishing any two CAN categories to a maximum of five categorizations including control, early, definite, severe and atypical CAN. CONCLUSIONS These results show that MEBOSS architecture is effective and can be recommended for practical implementations in systems for the diagnosis of CAN progression.
Collapse
Affiliation(s)
- Jemal Abawajy
- School of Information Technology, Deakin University, 221 Burwood Hwy, Victoria 3125, Australia.
| | - Andrei Kelarev
- School of Information Technology, Deakin University, 221 Burwood Hwy, Victoria 3125, Australia; School of Science, RMIT University, GPO Box 2476, Melbourne, VIC 3001, Australia.
| | - Xun Yi
- School of Science, RMIT University, GPO Box 2476, Melbourne, VIC 3001, Australia.
| | - Herbert F Jelinek
- School of Community Health, Charles Sturt University, PO Box 789, Albury, NSW 2640, Australia.
| |
Collapse
|
6
|
V. Kelarev A, Yi X, Cui H, Rylands L, F. Jelinek H. A survey of state-of-the-art methods for securing medical databases. AIMS MEDICAL SCIENCE 2018. [DOI: 10.3934/medsci.2018.1.1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
|
7
|
A Time-Series Water Level Forecasting Model Based on Imputation and Variable Selection Method. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2017; 2017:8734214. [PMID: 29250110 PMCID: PMC5700551 DOI: 10.1155/2017/8734214] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/20/2017] [Revised: 09/18/2017] [Accepted: 09/27/2017] [Indexed: 11/26/2022]
Abstract
Reservoirs are important for households and impact the national economy. This paper proposed a time-series forecasting model based on estimating a missing value followed by variable selection to forecast the reservoir's water level. This study collected data from the Taiwan Shimen Reservoir as well as daily atmospheric data from 2008 to 2015. The two datasets are concatenated into an integrated dataset based on ordering of the data as a research dataset. The proposed time-series forecasting model summarily has three foci. First, this study uses five imputation methods to directly delete the missing value. Second, we identified the key variable via factor analysis and then deleted the unimportant variables sequentially via the variable selection method. Finally, the proposed model uses a Random Forest to build the forecasting model of the reservoir's water level. This was done to compare with the listing method under the forecasting error. These experimental results indicate that the Random Forest forecasting model when applied to variable selection with full variables has better forecasting performance than the listing model. In addition, this experiment shows that the proposed variable selection can help determine five forecast methods used here to improve the forecasting capability.
Collapse
|
8
|
Kavakiotis I, Tsave O, Salifoglou A, Maglaveras N, Vlahavas I, Chouvarda I. Machine Learning and Data Mining Methods in Diabetes Research. Comput Struct Biotechnol J 2017; 15:104-116. [PMID: 28138367 PMCID: PMC5257026 DOI: 10.1016/j.csbj.2016.12.005] [Citation(s) in RCA: 358] [Impact Index Per Article: 44.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2016] [Revised: 12/20/2016] [Accepted: 12/27/2016] [Indexed: 12/14/2022] Open
Abstract
The remarkable advances in biotechnology and health sciences have led to a significant production of data, such as high throughput genetic data and clinical information, generated from large Electronic Health Records (EHRs). To this end, application of machine learning and data mining methods in biosciences is presently, more than ever before, vital and indispensable in efforts to transform intelligently all available information into valuable knowledge. Diabetes mellitus (DM) is defined as a group of metabolic disorders exerting significant pressure on human health worldwide. Extensive research in all aspects of diabetes (diagnosis, etiopathophysiology, therapy, etc.) has led to the generation of huge amounts of data. The aim of the present study is to conduct a systematic review of the applications of machine learning, data mining techniques and tools in the field of diabetes research with respect to a) Prediction and Diagnosis, b) Diabetic Complications, c) Genetic Background and Environment, and e) Health Care and Management with the first category appearing to be the most popular. A wide range of machine learning algorithms were employed. In general, 85% of those used were characterized by supervised learning approaches and 15% by unsupervised ones, and more specifically, association rules. Support vector machines (SVM) arise as the most successful and widely used algorithm. Concerning the type of data, clinical datasets were mainly used. The title applications in the selected articles project the usefulness of extracting valuable knowledge leading to new hypotheses targeting deeper understanding and further investigation in DM.
Collapse
Affiliation(s)
- Ioannis Kavakiotis
- Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki 54124, Greece
- Institute of Applied Biosciences, CERTH, Thessaloniki, Greece
| | - Olga Tsave
- Laboratory of Inorganic Chemistry, Department of Chemical Engineering, Aristotle University of Thessaloniki, Thessaloniki 54124, Greece
| | - Athanasios Salifoglou
- Laboratory of Inorganic Chemistry, Department of Chemical Engineering, Aristotle University of Thessaloniki, Thessaloniki 54124, Greece
| | - Nicos Maglaveras
- Institute of Applied Biosciences, CERTH, Thessaloniki, Greece
- Lab of Computing and Medical Informatics, Medical School, Aristotle University of Thessaloniki, Thessaloniki 54124, Greece
| | - Ioannis Vlahavas
- Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki 54124, Greece
| | - Ioanna Chouvarda
- Institute of Applied Biosciences, CERTH, Thessaloniki, Greece
- Lab of Computing and Medical Informatics, Medical School, Aristotle University of Thessaloniki, Thessaloniki 54124, Greece
| |
Collapse
|
9
|
Data analytics identify glycated haemoglobin co-markers for type 2 diabetes mellitus diagnosis. Comput Biol Med 2016; 75:90-7. [PMID: 27268735 DOI: 10.1016/j.compbiomed.2016.05.005] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2015] [Revised: 05/10/2016] [Accepted: 05/12/2016] [Indexed: 01/28/2023]
Abstract
Glycated haemoglobin (HbA1c) is being more commonly used as an alternative test for the identification of type 2 diabetes mellitus (T2DM) or to add to fasting blood glucose level and oral glucose tolerance test results, because it is easily obtained using point-of-care technology and represents long-term blood sugar levels. HbA1c cut-off values of 6.5% or above have been recommended for clinical use based on the presence of diabetic comorbidities from population studies. However, outcomes of large trials with a HbA1c of 6.5% as a cut-off have been inconsistent for a diagnosis of T2DM. This suggests that a HbA1c cut-off of 6.5% as a single marker may not be sensitive enough or be too simple and miss individuals at risk or with already overt, undiagnosed diabetes. In this study, data mining algorithms have been applied on a large clinical dataset to identify an optimal cut-off value for HbA1c and to identify whether additional biomarkers can be used together with HbA1c to enhance diagnostic accuracy of T2DM. T2DM classification accuracy increased if 8-hydroxy-2-deoxyguanosine (8-OhdG), an oxidative stress marker, was included in the algorithm from 78.71% for HbA1c at 6.5% to 86.64%. A similar result was obtained when interleukin-6 (IL-6) was included (accuracy=85.63%) but with a lower optimal HbA1c range between 5.73 and 6.22%. The application of data analytics to medical records from the Diabetes Screening programme demonstrates that data analytics, combined with large clinical datasets can be used to identify clinically appropriate cut-off values and identify novel biomarkers that when included improve the accuracy of T2DM diagnosis even when HbA1c levels are below or equal to the current cut-off of 6.5%.
Collapse
|
10
|
Missing data imputation on the 5-year survival prediction of breast cancer patients with unknown discrete values. Comput Biol Med 2015; 59:125-133. [PMID: 25725446 DOI: 10.1016/j.compbiomed.2015.02.006] [Citation(s) in RCA: 67] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2014] [Revised: 02/07/2015] [Accepted: 02/09/2015] [Indexed: 11/22/2022]
Abstract
Breast cancer is the most frequently diagnosed cancer in women. Using historical patient information stored in clinical datasets, data mining and machine learning approaches can be applied to predict the survival of breast cancer patients. A common drawback is the absence of information, i.e., missing data, in certain clinical trials. However, most standard prediction methods are not able to handle incomplete samples and, then, missing data imputation is a widely applied approach for solving this inconvenience. Therefore, and taking into account the characteristics of each breast cancer dataset, it is required to perform a detailed analysis to determine the most appropriate imputation and prediction methods in each clinical environment. This research work analyzes a real breast cancer dataset from Institute Portuguese of Oncology of Porto with a high percentage of unknown categorical information (most clinical data of the patients are incomplete), which is a challenge in terms of complexity. Four scenarios are evaluated: (I) 5-year survival prediction without imputation and 5-year survival prediction from cleaned dataset with (II) Mode imputation, (III) Expectation-Maximization imputation and (IV) K-Nearest Neighbors imputation. Prediction models for breast cancer survivability are constructed using four different methods: K-Nearest Neighbors, Classification Trees, Logistic Regression and Support Vector Machines. Experiments are performed in a nested ten-fold cross-validation procedure and, according to the obtained results, the best results are provided by the K-Nearest Neighbors algorithm: more than 81% of accuracy and more than 0.78 of area under the Receiver Operator Characteristic curve, which constitutes very good results in this complex scenario.
Collapse
|