1
|
Data quality issues in software fault prediction: a systematic literature review. Artif Intell Rev 2022. [DOI: 10.1007/s10462-022-10371-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
2
|
Amanullah M, Ramya S, Sudha M, Pushparathi VG, Haldorai A, Pant B. Data sampling approach using heuristic Learning Vector Quantization (LVQ) classifier for software defect prediction. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2022. [DOI: 10.3233/jifs-220480] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
On the basis of quality estimate, early prediction and identification of software flaws is crucial in the software area. Prediction of Software Defects SDP is defined as the process of exposing software to flaws through the use of prediction models and defect datasets. This study recommended a method for dealing with the class imbalance problem based on Improved Random Synthetic Minority Oversampling Technique (SMOTE), followed by Linear Pearson Correlation Technique to perform feature selection to predict software failure. On the basis of the SMOTE data sampling approach, a strategy for software defect prediction is given in this paper. To address the class imbalance, the defect datasets were initially processed using the Improved Random-SMOTE Oversampling technique. Then, using the Linear Pearson Correlation approach, the features were chosen, and using the k-fold cross validation process, the samples were split into training and testing datasets. Finally, Heuristic Learning Vector Quantization is used to classify data in order to predict software problems. Based on measures like sensitivity, specificity, FPR, and accuracy rate for two separate datasets, the performance of the proposed strategy is contrasted with the approaches to classification that presently exist.
Collapse
Affiliation(s)
- M. Amanullah
- Department of Information Technology, Aalim Muhammad Salegh College of Engineering, Chennai, India
| | - S.Thanga Ramya
- Department of Computer Science and Engineering, R.M.K. Engineering College, Kavaraipettai, Chennai, India
| | - M. Sudha
- Department of Electronics and Communication, Srinivasa Ramanujan Centre, SASTRA Deemed to be University, Kumbakonam, India
| | - V.P. Gladis Pushparathi
- Department of Computer Science and Engineering, Velammal Institute of Technology, Pancheeti, Chennai, India
| | - Anandakumar Haldorai
- Department of Computer Science and Engineering, Sri Eshwar College of Engineering, Coimbatore, India
| | - Bhaskar Pant
- Department of Computer Science and Engineering, Graphic Era Deemed to be University, Bell Road, Clement Town, Dehradun, Uttarakhand, India
| |
Collapse
|
3
|
Abstract
AbstractSoftware testing guarantees the delivery of high-quality software products, and software defect prediction (SDP) has become an important part of software testing. Software defect prediction is divided into traditional software defect prediction and just-in-time software defect prediction (JIT-SDP). However, most of the existing software defect prediction frameworks are relatively simplified, which makes it extremely difficult to provide developers with more detailed reference information. To improve the effectiveness of software defect prediction and realize effective software testing resource allocation, this paper proposes a software defect prediction framework based on Nested-Stacking and heterogeneous feature selection. The framework includes three stages: data set preprocessing and feature selection, Nested-Stacking classifier, and model classification performance evaluation. The novel heterogeneous feature selection and nested custom classifiers in the framework can effectively improve the accuracy of software defect prediction. This paper conducts experiments on two software defect data sets (Kamei, PROMISE), and demonstrates the classification performance of the model through two comprehensive evaluation indicators, AUC, and F1-score. The experiment carried out large-scale within-project defect prediction (WPDP) and cross-project defect prediction (CPDP). The results show that the framework proposed in this paper has an excellent classification performance on the two types of software defect data sets, and has been greatly improved compared with the baseline models.
Collapse
|
4
|
Addressing Class Overlap under Imbalanced Distribution: An Improved Method and Two Metrics. Symmetry (Basel) 2021. [DOI: 10.3390/sym13091649] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Class imbalance, as a phenomenon of asymmetry, has an adverse effect on the performance of most machine learning and overlap is another important factor that affects the classification performance of machine learning algorithms. This paper deals with the two factors simultaneously, addressing the class overlap under imbalanced distribution. In this paper, a theoretical analysis is firstly conducted on the existing class overlap metrics. Then, an improved method and the corresponding metrics to evaluate the class overlap under imbalance distributions are proposed based on the theoretical analysis. A well-known collection of the imbalanced datasets is used to compare the performance of different metrics and the performance is evaluated based on the Pearson correlation coefficient and the ξ correlation coefficient. The experimental results demonstrate that the proposed class overlap metrics outperform other compared metrics for the imbalanced datasets and the Pearson correlation coefficient with the AUC metric of eight algorithms can be improved by 34.7488% in average.
Collapse
|
5
|
Jamal A, Zahid M, Tauhidur Rahman M, Al-Ahmadi HM, Almoshaogeh M, Farooq D, Ahmad M. Injury severity prediction of traffic crashes with ensemble machine learning techniques: a comparative study. Int J Inj Contr Saf Promot 2021; 28:408-427. [PMID: 34060410 DOI: 10.1080/17457300.2021.1928233] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
A better understanding of injury severity risk factors is fundamental to improving crash prediction and effective implementation of appropriate mitigation strategies. Traditional statistical models widely used in this regard have predefined correlation and intrinsic assumptions, which, if flouted, may yield biased predictions. The present study investigates the possibility of using the eXtreme Gradient Boosting (XGBoost) model compared with few traditional machine learning algorithms (logistic regression, random forest, and decision tree) for crash injury severity analysis. The data used in this study was obtained from the traffic safety department, ministry of transport (MOT) at Riyadh, KSA, and contains 13,546 motor vehicle collisions along 15 rural highways reported between January 2017 to December 2019. Empirical results obtained using k-fold (k = 10) for various performance metrics showed that the XGBoost technique outperformed other models in terms of the collective predictive performance as well as injury severity individual class accuracies. XGBoost feature importance analysis indicated that collision type, weather status, road surface conditions, on-site damage type, lighting conditions, and vehicle type are the few sensitive variables in predicting the crash injury severity outcome. Finally, a comparative analysis of XGBoost based on different performance statistics showed that our model outperformed most previous studies.
Collapse
Affiliation(s)
- Arshad Jamal
- Department of Civil and Environmental Engineering, King Fahd University of Petroleum & Minerals, Dhahran, Saudi Arabia
| | - Muhammad Zahid
- College of Metropolitan Transportation, Beijing University of Technology, Beijing, China
| | - Muhammad Tauhidur Rahman
- Department of City and Regional Planning, King Fahd University of Petroleum & Minerals, Dhahran, Saudi Arabia
| | - Hassan M Al-Ahmadi
- Department of Civil and Environmental Engineering, King Fahd University of Petroleum & Minerals, Dhahran, Saudi Arabia
| | - Meshal Almoshaogeh
- Department of Civil Engineering, College of Engineering, Qassim University, Buraydah, Qassim, Saudi Arabia
| | - Danish Farooq
- Department of Transport Technology and Economics, Budapest University of Technology and Economics, Budapest, Hungary.,Department of Civil Engineering, University of Engineering and Technology Peshawar (Bannu Campus), Peshawar, Pakistan
| | - Mahmood Ahmad
- Department of Civil Engineering, University of Engineering and Technology Peshawar (Bannu Campus), Peshawar, Pakistan
| |
Collapse
|
6
|
Singh A, Ranjan RK, Tiwari A. Credit Card Fraud Detection under Extreme Imbalanced Data: A Comparative Study of Data-level Algorithms. J EXP THEOR ARTIF IN 2021. [DOI: 10.1080/0952813x.2021.1907795] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Affiliation(s)
- Amit Singh
- Indian Computer Emergency Response Team, Ministry of Electronics and Information Technology, New Delhi, India
| | | | - Abhishek Tiwari
- Department of Computer Science, Central University of Haryana, Mahendergarh, India
| |
Collapse
|
7
|
Abstract
Data imbalance is a thorny issue in machine learning. SMOTE is a famous oversampling method of imbalanced learning. However, it has some disadvantages such as sample overlapping, noise interference, and blindness of neighbor selection. In order to address these problems, we present a new oversampling method, OS-CCD, based on a new concept, the classification contribution degree. The classification contribution degree determines the number of synthetic samples generated by SMOTE for each positive sample. OS-CCD follows the spatial distribution characteristics of original samples on the class boundary, as well as avoids oversampling from noisy points. Experiments on twelve benchmark datasets demonstrate that OS-CCD outperforms six classical oversampling methods in terms of accuracy, F1-score, AUC, and ROC.
Collapse
|
8
|
Al-Shamaa ZZR, Kurnaz S, Duru AD, Peppa N, Mirnezami AH, Hamady ZZR. The Use of Hellinger Distance Undersampling Model to Improve the Classification of Disease Class in Imbalanced Medical Datasets. Appl Bionics Biomech 2020; 2020:8824625. [PMID: 33204304 PMCID: PMC7657693 DOI: 10.1155/2020/8824625] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2020] [Revised: 09/19/2020] [Accepted: 10/05/2020] [Indexed: 11/17/2022] Open
Abstract
Imbalanced class distribution in the medical dataset is a challenging task that hinders classifying disease correctly. It emerges when the number of healthy class instances being much larger than the disease class instances. To solve this problem, we proposed undersampling the healthy class instances to improve disease class classification. This model is named Hellinger Distance Undersampling (HDUS). It employs the Hellinger Distance to measure the resemblance between majority class instance and its neighbouring minority class instances to separate classes effectively and boost the discrimination power for each class. An extensive experiment has been conducted on four imbalanced medical datasets using three classifiers to compare HDUS with a baseline model and three state-of-the-art undersampling models. The outcomes display that HDUS can perform better than other models in terms of sensitivity, F1 measure, and balanced accuracy.
Collapse
Affiliation(s)
- Zina Z. R. Al-Shamaa
- Graduate School of Science and Engineering, Altınbaş University, Istanbul, Turkey
| | - Sefer Kurnaz
- Graduate School of Science and Engineering, Altınbaş University, Istanbul, Turkey
| | - Adil Deniz Duru
- Sports and Health Sciences Department, Marmara University, Istanbul, Turkey
| | - Nadia Peppa
- Southampton University Hospital NHSFT, Southampton, UK
| | | | | |
Collapse
|
9
|
Rusdah DA, Murfi H. XGBoost in handling missing values for life insurance risk prediction. SN APPLIED SCIENCES 2020. [DOI: 10.1007/s42452-020-3128-y] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022] Open
|