1
|
Yahaya M, Guo R, Jiang X, Bashir K, Matara C, Xu S. Ensemble-based model selection for imbalanced data to investigate the contributing factors to multiple fatality road crashes in Ghana. Accid Anal Prev 2021; 151:105851. [PMID: 33383521 DOI: 10.1016/j.aap.2020.105851] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/27/2020] [Revised: 09/25/2020] [Accepted: 10/16/2020] [Indexed: 06/12/2023]
Abstract
The study aims to identify relevant variables to improve the prediction performance of the crash injury severity (CIS) classification model. Unfortunately, the CIS database is invariably characterized by the class imbalance. For instance, the samples of multiple fatal injury (MFI) severity class are typically rare as opposed to other classes. The imbalance phenomenon may introduce a prediction bias in favour of the majority class and affect the quality of the learning algorithm. The paper proposes an ensemble-based variable ranking scheme that incorporates the data resampling. At the data pre-processing level, majority weighted minority oversampling (MWMOTE) is employed to treat the imbalanced training data. Ensemble of classifiers induced from the balanced data is used to evaluate and rank the individual variables according to their importance to the injury severity prediction. The relevant variables selected are then applied to the balanced data to form a training set for the CIS classification modelling. An empirical comparison is conducted through considering the variable ranking by: 1) the learning of single inductive algorithm with imbalanced data where the relevant variables are applied to the imbalanced data to form the training data; 2) the learning of single inductive algorithm with MWMOTE data and the relevant variables identified are applied to the balanced data to form the training data; and 3) the learning of ensembles with imbalanced data where the relevant variables identified are applied to the imbalanced data to form the training data. Bayesian Networks (BNs) classifiers are then developed for each ranking method, where nested subsets of the top ranked variables are adopted. The model predictions are captured in four performance indicators in the comparative study. Based on three-year (2014-2016) crash data in Ghana, the empirical results show that the proposed method is effective to identify the most prolific predictors of the CIS level. Finally, based on the inference results of BNs developed on the best subset, the study offers the most probable explanations to the occurrence of MFI crashes in Ghana.
Collapse
Affiliation(s)
- Mahama Yahaya
- School of Transportation and Logistics, Southwest Jiaotong University, West Park, High-Tech District, Chengdu, China 611756; National Engineering Laboratory of Integrated Transportation Big Data Application Technology, West Park, High-Tech District, Chengdu, 611756, China
| | - Runhua Guo
- Department of Civil Engineering, Suite 217, Heshangheng Bldg, Tsinghua University, 100084, Beijing, China
| | - Xinguo Jiang
- School of Transportation and Logistics, Southwest Jiaotong University, West Park, High-Tech District, Chengdu, China 611756; National Engineering Laboratory of Integrated Transportation Big Data Application Technology, West Park, High-Tech District, Chengdu, 611756, China.
| | - Kamal Bashir
- Department of Information Technology, Karare University, Omdurman, 12304, Sudan
| | - Caroline Matara
- Department of Civil and Construction Engineering, University of Nairobi, 30197, Nairobi, Kenya
| | - Shiwei Xu
- Guangzhou Transportation Planning Institute, 510030, Guangzhou, China
| |
Collapse
|