1
|
A Novel Framework for Generating Personalized Network Datasets for NIDS Based on Traffic Aggregation. SENSORS 2022; 22:s22051847. [PMID: 35270994 PMCID: PMC8914796 DOI: 10.3390/s22051847] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Revised: 01/27/2022] [Accepted: 02/06/2022] [Indexed: 12/02/2022]
Abstract
In this paper, we addressed the problem of dataset scarcity for the task of network intrusion detection. Our main contribution was to develop a framework that provides a complete process for generating network traffic datasets based on the aggregation of real network traces. In addition, we proposed a set of tools for attribute extraction and labeling of traffic sessions. A new dataset with botnet network traffic was generated by the framework to assess our proposed method with machine learning algorithms suitable for unbalanced data. The performance of the classifiers was evaluated in terms of macro-averages of F1-score (0.97) and the Matthews Correlation Coefficient (0.94), showing a good overall performance average.
Collapse
|
2
|
Dekamin A, Wahab MIM, Guergachi A, Keshavjee K. FIUS: Fixed partitioning undersampling method. Clin Chim Acta 2021; 522:174-183. [PMID: 34425104 DOI: 10.1016/j.cca.2021.08.023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2021] [Revised: 07/30/2021] [Accepted: 08/18/2021] [Indexed: 11/28/2022]
Abstract
BACKGROUND AND OBJECTIVE In the medical field, data techniques for prediction and finding patterns of prevalent diseases are of increasing interest. Classification is one of the methods used to provide insight into predicting the future onset of type 2 diabetes of those at high risk of progression from pre-diabetes to diabetes. When applying classification techniques to real-world datasets, imbalanced class distribution has been one of the most significant limitations that leads to patients' misclassification. In this paper, we propose a novel balancing method to improve the prediction performance of type 2 diabetes mellitus in imbalanced electronic medical records (EMR). METHODS A novel undersampling method is proposed by utilizing a fixed partitioning distribution scheme in a regular grid. The proposed approach retains valuable information when balancing methods are applied to datasets. RESULTS The best AUC of 80% compared to other classifiers was obtained from the logistic regression (LR) classifier for EMR by applying our proposed undersampling method to balance the data. The new method improved the performance of the LR classifier compared to existing undersampling methods used in the balancing stage. CONCLUSION The results demonstrate the effectiveness and high performance of the proposed method for predicting diabetes in a Canadian imbalanced dataset. Our methodology can be used in other areas to overcome the limitations of imbalanced class distributions.
Collapse
Affiliation(s)
- Azam Dekamin
- Department of Mechanical and Industrial Engineering, Ryerson University, 350 Victoria Street, Toronto, ON M5B 2K3, Canada.
| | - M I M Wahab
- Department of Mechanical and Industrial Engineering, Ryerson University, 350 Victoria Street, Toronto, ON M5B 2K3, Canada
| | - Aziz Guergachi
- Ted Rogers, School of Information Technology Management, Ryerson University, 350 Victoria Street, Toronto, ON M5B 2K3, Canada
| | - Karim Keshavjee
- Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, ON M5T 3M6, Canada
| |
Collapse
|
3
|
An automatic sampling ratio detection method based on genetic algorithm for imbalanced data classification. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2021.106800] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
4
|
Jeong B, Cho H, Kim J, Kwon SK, Hong S, Lee C, Kim T, Park MS, Hong S, Heo TY. Comparison between Statistical Models and Machine Learning Methods on Classification for Highly Imbalanced Multiclass Kidney Data. Diagnostics (Basel) 2020; 10:E415. [PMID: 32570782 PMCID: PMC7345590 DOI: 10.3390/diagnostics10060415] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2020] [Revised: 06/03/2020] [Accepted: 06/16/2020] [Indexed: 11/16/2022] Open
Abstract
This study aims to compare the classification performance of statistical models on highly imbalanced kidney data. The health examination cohort database provided by the National Health Insurance Service in Korea is utilized to build models with various machine learning methods. The glomerular filtration rate (GFR) is used to diagnose chronic kidney disease (CKD). It is calculated using the Modification of Diet in Renal Disease method and classified into five stages (1, 2, 3A and 3B, 4, and 5). Different CKD stages based on the estimated GFR are considered as six classes of the response variable. This study utilizes two representative generalized linear models for classification, namely, multinomial logistic regression (multinomial LR) and ordinal logistic regression (ordinal LR), as well as two machine learning models, namely, random forest (RF) and autoencoder (AE). The classification performance of the four models is compared in terms of accuracy, sensitivity, specificity, precision, and F1-Measure. To find the best model that classifies CKD stages correctly, the data are divided into a 10-fold dataset with the same rate for each CKD stage. Results indicate that RF and AE show better performance in accuracy than the multinomial and ordinal LR models when classifying the response variable. However, when a highly imbalanced dataset is modeled, the accuracy of the model performance can distort the actual performance. This occurs because accuracy is high even if a statistical model classifies a minority class into a majority class. To solve this problem in performance interpretation, we not only consider accuracy from the confusion matrix but also sensitivity, specificity, precision, and F-1 measure for each class. To present classification performance with a single value for each model, we calculate the macro-average and micro-weighted values for each model. We conclude that AE is the best model classifying CKD stages correctly for all performance indices.
Collapse
Affiliation(s)
- Bomi Jeong
- Department of Information & Statistics, Chungbuk National University, Chungbuk 28644, Korea; (B.J.); (J.K.)
| | - Hyunjeong Cho
- Department of Internal Medicine, Chungbuk National University College of Medicine, Chungbuk 28644, Korea; (H.C.); (S.K.K.)
- Department of Internal Medicine, Chungbuk National University Hospital, Chungbuk 28644, Korea
| | - Jieun Kim
- Department of Information & Statistics, Chungbuk National University, Chungbuk 28644, Korea; (B.J.); (J.K.)
| | - Soon Kil Kwon
- Department of Internal Medicine, Chungbuk National University College of Medicine, Chungbuk 28644, Korea; (H.C.); (S.K.K.)
- Department of Internal Medicine, Chungbuk National University Hospital, Chungbuk 28644, Korea
| | - SeungWoo Hong
- Intelligent Network Research Section, Electronics and Telecommunications Research Institute, 218 Gajeong-ro, Yuseong-gu, Daejeon 34129, Korea; (S.H.); (C.L.); (T.K.)
| | - ChangSik Lee
- Intelligent Network Research Section, Electronics and Telecommunications Research Institute, 218 Gajeong-ro, Yuseong-gu, Daejeon 34129, Korea; (S.H.); (C.L.); (T.K.)
| | - TaeYeon Kim
- Intelligent Network Research Section, Electronics and Telecommunications Research Institute, 218 Gajeong-ro, Yuseong-gu, Daejeon 34129, Korea; (S.H.); (C.L.); (T.K.)
| | - Man Sik Park
- Department of Statistics, Sungshin Women’s University, Seoul 02844, Korea;
| | - Seoksu Hong
- Department of Information & Statistics, Chungbuk National University, Chungbuk 28644, Korea; (B.J.); (J.K.)
| | - Tae-Young Heo
- Department of Information & Statistics, Chungbuk National University, Chungbuk 28644, Korea; (B.J.); (J.K.)
| |
Collapse
|
5
|
Mohammed A, Podila PSB, Davis RL, Ataga KI, Hankins JS, Kamaleswaran R. Using Machine Learning to Predict Early Onset Acute Organ Failure in Critically Ill Intensive Care Unit Patients With Sickle Cell Disease: Retrospective Study. J Med Internet Res 2020; 22:e14693. [PMID: 32401216 PMCID: PMC7254279 DOI: 10.2196/14693] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2019] [Revised: 08/18/2019] [Accepted: 01/28/2020] [Indexed: 12/22/2022] Open
Abstract
Background Sickle cell disease (SCD) is a genetic disorder of the red blood cells, resulting in multiple acute and chronic complications, including pain episodes, stroke, and kidney disease. Patients with SCD develop chronic organ dysfunction, which may progress to organ failure during disease exacerbations. Early detection of acute physiological deterioration leading to organ failure is not always attainable. Machine learning techniques that allow for prediction of organ failure may enable early identification and treatment and potentially reduce mortality. Objective The aim of this study was to test the hypothesis that machine learning physiomarkers can predict the development of organ dysfunction in a sample of adult patients with SCD admitted to intensive care units (ICUs). Methods We applied diverse machine learning methods, statistical methods, and data visualization techniques to develop classification models to distinguish SCD from controls. Results We studied 63 sequential SCD patients admitted to ICUs with 163 patient encounters (mean age 30.7 years, SD 9.8 years). A subset of these patient encounters, 22.7% (37/163), met the sequential organ failure assessment criteria. The other 126 SCD patient encounters served as controls. A set of signal processing features (such as fast Fourier transform, energy, and continuous wavelet transform) derived from heart rate, blood pressure, and respiratory rate was identified to distinguish patients with SCD who developed acute physiological deterioration leading to organ failure from patients with SCD who did not meet the criteria. A multilayer perceptron model accurately predicted organ failure up to 6 hours before onset, with an average sensitivity and specificity of 96% and 98%, respectively. Conclusions This retrospective study demonstrated the viability of using machine learning to predict acute organ failure among hospitalized adults with SCD. The discovery of salient physiomarkers through machine learning techniques has the potential to further accelerate the development and implementation of innovative care delivery protocols and strategies for medically vulnerable patients.
Collapse
Affiliation(s)
- Akram Mohammed
- Center for Biomedical Informatics, University of Tennessee Health Science Center, Memphis, TN, United States
| | - Pradeep S B Podila
- Faith and Health Division, Methodist Le Bonheur Healthcare, Memphis, TN, United States
| | - Robert L Davis
- Center for Biomedical Informatics, University of Tennessee Health Science Center, Memphis, TN, United States
| | - Kenneth I Ataga
- Center for Sickle Cell Disease, University of Tennessee Health Science Center, Memphis, TN, United States
| | - Jane S Hankins
- Department of Hematology, St Jude Children's Research Hospital, Memphis, TN, United States
| | - Rishikesan Kamaleswaran
- Department of Biomedical Informatics, Emory University School of Medicine, Atlanta, GA, United States
| |
Collapse
|
6
|
Zheng M, Li T, Zhu R, Tang Y, Tang M, Lin L, Ma Z. Conditional Wasserstein generative adversarial network-gradient penalty-based approach to alleviating imbalanced data classification. Inf Sci (N Y) 2020. [DOI: 10.1016/j.ins.2019.10.014] [Citation(s) in RCA: 34] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
7
|
|
8
|
Prediction of Defective Software Modules Using Class Imbalance Learning. APPLIED COMPUTATIONAL INTELLIGENCE AND SOFT COMPUTING 2016. [DOI: 10.1155/2016/7658207] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Software defect predictors are useful to maintain the high quality of software products effectively. The early prediction of defective software modules can help the software developers to allocate the available resources to deliver high quality software products. The objective of software defect prediction system is to find as many defective software modules as possible without affecting the overall performance. The learning process of a software defect predictor is difficult due to the imbalanced distribution of software modules between defective and nondefective classes. Misclassification cost of defective software modules generally incurs much higher cost than the misclassification of nondefective one. Therefore, on considering the misclassification cost issue, we have developed a software defect prediction system using Weighted Least Squares Twin Support Vector Machine (WLSTSVM). This system assigns higher misclassification cost to the data samples of defective classes and lower cost to the data samples of nondefective classes. The experiments on eight software defect prediction datasets have proved the validity of the proposed defect prediction system. The significance of the results has been tested via statistical analysis performed by using nonparametric Wilcoxon signed rank test.
Collapse
|
9
|
Yu H, Sun C, Yang X, Yang W, Shen J, Qi Y. ODOC-ELM: Optimal decision outputs compensation-based extreme learning machine for classifying imbalanced data. Knowl Based Syst 2016. [DOI: 10.1016/j.knosys.2015.10.012] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
10
|
Tomar D, Agarwal S. An effective Weighted Multi-class Least Squares Twin Support Vector Machine for Imbalanced data classification. INT J COMPUT INT SYS 2015. [DOI: 10.1080/18756891.2015.1061395] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022] Open
|
11
|
López V, Fernández A, García S, Palade V, Herrera F. An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Inf Sci (N Y) 2013. [DOI: 10.1016/j.ins.2013.07.007] [Citation(s) in RCA: 932] [Impact Index Per Article: 77.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
12
|
Mena LJ, Felix VG, Ostos R, Gonzalez JA, Cervantes A, Ochoa A, Ruiz C, Ramos R, Maestre GE. Mobile personal health system for ambulatory blood pressure monitoring. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2013; 2013:598196. [PMID: 23762189 PMCID: PMC3665224 DOI: 10.1155/2013/598196] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/10/2012] [Accepted: 04/12/2013] [Indexed: 01/12/2023]
Abstract
The ARVmobile v1.0 is a multiplatform mobile personal health monitor (PHM) application for ambulatory blood pressure (ABP) monitoring that has the potential to aid in the acquisition and analysis of detailed profile of ABP and heart rate (HR), improve the early detection and intervention of hypertension, and detect potential abnormal BP and HR levels for timely medical feedback. The PHM system consisted of ABP sensor to detect BP and HR signals and smartphone as receiver to collect the transmitted digital data and process them to provide immediate personalized information to the user. Android and Blackberry platforms were developed to detect and alert of potential abnormal values, offer friendly graphical user interface for elderly people, and provide feedback to professional healthcare providers via e-mail. ABP data were obtained from twenty-one healthy individuals (>51 years) to test the utility of the PHM application. The ARVmobile v1.0 was able to reliably receive and process the ABP readings from the volunteers. The preliminary results demonstrate that the ARVmobile 1.0 application could be used to perform a detailed profile of ABP and HR in an ordinary daily life environment, bedsides of estimating potential diagnostic thresholds of abnormal BP variability measured as average real variability.
Collapse
Affiliation(s)
- Luis J Mena
- Department of Computer Engineering, Polytechnic University of Sinaloa, 82199 Mazatlan, SIN, Mexico.
| | | | | | | | | | | | | | | | | |
Collapse
|
13
|
Machine learning approach to extract diagnostic and prognostic thresholds: application in prognosis of cardiovascular mortality. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2012; 2012:750151. [PMID: 22924062 PMCID: PMC3424632 DOI: 10.1155/2012/750151] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/30/2012] [Revised: 06/25/2012] [Accepted: 07/03/2012] [Indexed: 01/25/2023]
Abstract
Machine learning has become a powerful tool for analysing medical domains, assessing the importance of clinical parameters, and extracting medical knowledge for outcomes research. In this paper, we present a machine learning method for extracting diagnostic and prognostic thresholds, based on a symbolic classification algorithm called REMED. We evaluated the performance of our method by determining new prognostic thresholds for well-known and potential cardiovascular risk factors that are used to support medical decisions in the prognosis of fatal cardiovascular diseases. Our approach predicted 36% of cardiovascular deaths with 80% specificity and 75% general accuracy. The new method provides an innovative approach that might be useful to support decisions about medical diagnoses and prognoses.
Collapse
|