1
|
Pean CA, Buddhiraju A, Lin-Wei Chen T, Seo HH, Shimizu MR, Esposito JG, Kwon YM. Racial and Ethnic Disparities in Predictive Accuracy of Machine Learning Algorithms Developed Using a National Database for 30-Day Complications Following Total Joint Arthroplasty. J Arthroplasty 2025; 40:1139-1147. [PMID: 39433263 DOI: 10.1016/j.arth.2024.10.060] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/20/2024] [Revised: 10/11/2024] [Accepted: 10/15/2024] [Indexed: 10/23/2024] Open
Abstract
BACKGROUND While predictive capabilities of machine learning (ML) algorithms for hip and knee total joint arthroplasty (TJA) have been demonstrated in previous studies, their performance in racial and ethnic minority patients has not been investigated. This study aimed to assess the performance of ML algorithms in predicting 30-days complications following TJA in racial and ethnic minority patients. METHODS A total of 267,194 patients undergoing primary TJA between 2013 and 2020 were identified from a national outcomes database. The patient cohort was stratified according to race, with further substratification into Hispanic or non-Hispanic ethnicity. There were two ML algorithms, histogram-based gradient boosting (HGB), and random forest (RF), that were modeled to predict 30-days complications following primary TJA in the overall population. They were subsequently assessed in each racial and ethnic subcohort using discrimination, calibration, accuracy, and potential clinical usefulness. RESULTS Both models achieved excellent (Area under the curve (AUC) > 0.8) discrimination (AUCHGB = AUCRF = 0.86), calibration, and accuracy (HGB: slope = 1.00, intercept = -0.03, Brier score = 0.12; RF: slope = 0.97, intercept = 0.02, Brier score = 0.12) in the non-Hispanic White population (N = 224,073). Discrimination decreased in the White Hispanic (N = 10,429; AUC = 0.75 to 0.76), Black (N = 25,116; AUC = 0.77), Black Hispanic (N = 240; AUC = 0.78), Asian non-Hispanic (N = 4,809; AUC = 0.78 to 0.79), and overall (N = 267,194; AUC = 0.75 to 0.76) cohorts, but remained well-calibrated. We noted the poorest model discrimination (N = 1,870; AUC = 0.67 to 0.68) and calibration in the American-Indian cohort. CONCLUSIONS The ML algorithms demonstrate an inferior predictive ability for 30-days complications following primary TJA in racial and ethnic minorities when trained on existing healthcare big data. This may be attributed to the disproportionate underrepresentation of minority groups within these databases, as demonstrated by the smaller sample sizes available to train the ML models. The ML models developed using smaller datasets (e.g., in racial and ethnic minorities) may not be as accurate as larger datasets, highlighting the need for equity-conscious model development. LEVEL OF EVIDENCE III; retrospective cohort study.
Collapse
Affiliation(s)
- Christian A Pean
- Bioengineering Laboratory, Department of Orthopaedic Surgery, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts
| | - Anirudh Buddhiraju
- Bioengineering Laboratory, Department of Orthopaedic Surgery, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts
| | - Tony Lin-Wei Chen
- Bioengineering Laboratory, Department of Orthopaedic Surgery, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts
| | - Henry Hojoon Seo
- Bioengineering Laboratory, Department of Orthopaedic Surgery, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts
| | - Michelle R Shimizu
- Bioengineering Laboratory, Department of Orthopaedic Surgery, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts
| | - John G Esposito
- Bioengineering Laboratory, Department of Orthopaedic Surgery, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts
| | - Young-Min Kwon
- Bioengineering Laboratory, Department of Orthopaedic Surgery, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts
| |
Collapse
|
2
|
Lu K, Cao X, Wang L, Huang T, Chen L, Wang X, Li Q. Assessment of non-fatal injuries among university students in Hainan: a machine learning approach to exploring key factors. Front Public Health 2024; 12:1453650. [PMID: 39639893 PMCID: PMC11617571 DOI: 10.3389/fpubh.2024.1453650] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2024] [Accepted: 11/08/2024] [Indexed: 12/07/2024] Open
Abstract
Background Injuries constitute a significant global public health concern, particularly among individuals aged 0-34. These injuries are affected by various social, psychological, and physiological factors and are no longer viewed merely as accidental occurrences. Existing research has identified multiple risk factors for injuries; however, they often focus on the cases of children or the older adult, neglecting the university students. Machine learning (ML) can provide advanced analytics and is better suited to complex, nonlinear data compared to traditional methods. That said, ML has been underutilized in injury research despite its great potential. To fill this gap, this study applies ML to analyze injury data among university students in Hainan Province. The purpose is to provide insights into developing effective prevention strategies. To explore the relationship between scores on the self-rating anxiety scale and self-rating depression scale and the risk of non-fatal injuries within 1 year, we categorized these scores into two groups using restricted cubic splines. Methods Chi-square tests and LASSO regression analysis were employed to filter factors potentially associated with non-fatal injuries. The Synthetic Minority Over-Sampling Technique (SMOTE) was applied to balance the dataset. Subsequent analyses were conducted using random forest, logistic regression, decision tree, and XGBoost models. Each model underwent 10-fold cross-validation to mitigate overfitting, with hyperparameters being optimized to improve performance. SHAP was utilized to identify the primary factors influencing non-fatal injuries. Results The Random Forest model has proved effective in this study. It identified three primary risk factors for predicting non-fatal injuries: being male, favorable household financial situation, and stable relationship. Protective factors include reduced internet time and being an only child in the family. Conclusion The study highlighted five key factors influencing non-fatal injuries: sex, household financial situation, relationship stability, internet time, and sibling status. In identifying these factors, the Random Forest, Logistic Regression, Decision Tree, and XGBoost models demonstrated varying effectiveness, with the Random Forest model exhibiting superior performance.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Qiao Li
- *Correspondence: Xiaodan Wang, ; Qiao Li,
| |
Collapse
|
3
|
Xia M, Jin C, Zheng Y, Wang J, Zhao M, Cao S, Xu T, Pei B, Irwin MG, Lin Z, Jiang H. Deep learning-based facial analysis for predicting difficult videolaryngoscopy: a feasibility study. Anaesthesia 2024; 79:399-409. [PMID: 38093485 DOI: 10.1111/anae.16194] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/03/2023] [Indexed: 03/07/2024]
Abstract
While videolaryngoscopy has resulted in better overall success rates of tracheal intubation, airway assessment is still an important prerequisite for safe airway management. This study aimed to create an artificial intelligence model to identify difficult videolaryngoscopy using a neural network. Baseline characteristics, medical history, bedside examination and seven facial images were included as predictor variables. ResNet-18 was introduced to recognise images and extract features. Different machine learning algorithms were utilised to develop predictive models. A videolaryngoscopy view of Cormack-Lehane grade of 1 or 2 was classified as 'non-difficult', while grade 3 or 4 was classified as 'difficult'. A total of 5849 patients were included, of whom 5335 had non-difficult and 514 had difficult videolaryngoscopy. The facial model (only including facial images) using the Light Gradient Boosting Machine algorithm showed the highest area under the curve (95%CI) of 0.779 (0.733-0.825) with a sensitivity (95%CI) of 0.757 (0.650-0.845) and specificity (95%CI) of 0.721 (0.626-0.794) in the test set. Compared with bedside examination and multivariate scores (El-Ganzouri and Wilson), the facial model had significantly higher predictive performance (p < 0.001). Artificial intelligence-based facial analysis is a feasible technique for predicting difficulty during videolaryngoscopy, and the model developed using neural networks has higher predictive performance than traditional methods.
Collapse
Affiliation(s)
- M Xia
- Department of Anaesthesiology, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - C Jin
- Department of Anaesthesiology, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Y Zheng
- State Key Laboratory of Ocean Engineering, School of Naval Architecture, Ocean and Civil Engineering, Shanghai Jiao Tong University, Shanghai, China
| | - J Wang
- Department of Anaesthesiology, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - M Zhao
- State Key Laboratory of Ocean Engineering, School of Naval Architecture, Ocean and Civil Engineering, Shanghai Jiao Tong University, Shanghai, China
| | - S Cao
- Department of Anaesthesiology, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - T Xu
- Department of Anaesthesiology, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - B Pei
- Department of Anaesthesiology, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - M G Irwin
- Department of Anaesthesiology, University of Hong Kong, Hong Kong
| | - Z Lin
- State Key Laboratory of Ocean Engineering, School of Naval Architecture, Ocean and Civil Engineering, Shanghai Jiao Tong University, Shanghai, China
| | - H Jiang
- Department of Anaesthesiology, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| |
Collapse
|
4
|
Balanced neighbor exploration for semi-supervised node classification on imbalanced graph data. Inf Sci (N Y) 2023. [DOI: 10.1016/j.ins.2023.02.064] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/12/2023]
|
5
|
He QH, Feng JJ, Lv FJ, Jiang Q, Xiao MZ. Deep learning and radiomic feature-based blending ensemble classifier for malignancy risk prediction in cystic renal lesions. Insights Imaging 2023; 14:6. [PMID: 36629980 PMCID: PMC9834471 DOI: 10.1186/s13244-022-01349-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Accepted: 12/04/2022] [Indexed: 01/12/2023] Open
Abstract
BACKGROUND The rising prevalence of cystic renal lesions (CRLs) detected by computed tomography necessitates better identification of the malignant cystic renal neoplasms since a significant majority of CRLs are benign renal cysts. Using arterial phase CT scans combined with pathology diagnosis results, a fusion feature-based blending ensemble machine learning model was created to identify malignant renal neoplasms from cystic renal lesions (CRLs). Histopathology results were adopted as diagnosis standard. Pretrained 3D-ResNet50 network was selected for non-handcrafted features extraction and pyradiomics toolbox was selected for handcrafted features extraction. Tenfold cross validated least absolute shrinkage and selection operator regression methods were selected to identify the most discriminative candidate features in the development cohort. Feature's reproducibility was evaluated by intra-class correlation coefficients and inter-class correlation coefficients. Pearson correlation coefficients for normal distribution and Spearman's rank correlation coefficients for non-normal distribution were utilized to remove redundant features. After that, a blending ensemble machine learning model were developed in training cohort. Area under the receiver operator characteristic curve (AUC), accuracy score (ACC), and decision curve analysis (DCA) were employed to evaluate the performance of the final model in testing cohort. RESULTS The fusion feature-based machine learning algorithm demonstrated excellent diagnostic performance in external validation dataset (AUC = 0.934, ACC = 0.905). Net benefits presented by DCA are higher than Bosniak-2019 version classification for stratifying patients with CRL to the appropriate surgery procedure. CONCLUSIONS Fusion feature-based classifier accurately distinguished malignant and benign CRLs which outperformed the Bosniak-2019 version classification and illustrated improved clinical decision-making utility.
Collapse
Affiliation(s)
- Quan-Hao He
- grid.452206.70000 0004 1758 417XDepartment of Urology, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016 People’s Republic of China
| | - Jia-Jun Feng
- grid.79703.3a0000 0004 1764 3838Department of Medical Imaging, Guangzhou First People’s Hospital, School of Medicine, South China University of Technology, Guangzhou, 51000 People’s Republic of China
| | - Fa-Jin Lv
- grid.452206.70000 0004 1758 417XDepartment of Radiology, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016 People’s Republic of China
| | - Qing Jiang
- grid.412461.40000 0004 9334 6536Department of Urology, The Second Affiliated Hospital of Chongqing Medical University, Chongqing, 400010 People’s Republic of China
| | - Ming-Zhao Xiao
- grid.452206.70000 0004 1758 417XDepartment of Urology, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016 People’s Republic of China
| |
Collapse
|
6
|
HS-Gen: a hypersphere-constrained generation mechanism to improve synthetic minority oversampling for imbalanced classification. COMPLEX INTELL SYST 2022. [DOI: 10.1007/s40747-022-00938-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
AbstractMitigating the impact of class-imbalance data on classifiers is a challenging task in machine learning. SMOTE is a well-known method to tackle this task by modifying class distribution and generating synthetic instances. However, most of the SMOTE-based methods focus on the phase of data selection, while few consider the phase of data generation. This paper proposes a hypersphere-constrained generation mechanism (HS-Gen) to improve synthetic minority oversampling. Unlike linear interpolation commonly used in SMOTE-based methods, HS-Gen generates a minority instance in a hypersphere rather than on a straight line. This mechanism expands the distribution range of minority instances with significant randomness and diversity. Furthermore, HS-Gen is attached with a noise prevention strategy that adaptively shrinks the hypersphere by determining whether new instances fall into the majority class region. HS-Gen can be regarded as an oversampling optimization mechanism and flexibly embedded into the SMOTE-based methods. We conduct comparative experiments by embedding HS-Gen into the original SMOTE, Borderline-SMOTE, ADASYN, k-means SMOTE, and RSMOTE. Experimental results show that the embedded versions can generate higher quality synthetic instances than the original ones. Moreover, on these oversampled datasets, the conventional classifiers (C4.5 and Adaboost) obtain significant performance improvement in terms of F1 measure and G-mean.
Collapse
|
7
|
He QH, Tan H, Liao FT, Zheng YN, Lv FJ, Jiang Q, Xiao MZ. Stratification of malignant renal neoplasms from cystic renal lesions using deep learning and radiomics features based on a stacking ensemble CT machine learning algorithm. Front Oncol 2022; 12:1028577. [DOI: 10.3389/fonc.2022.1028577] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2022] [Accepted: 10/07/2022] [Indexed: 11/13/2022] Open
Abstract
Using nephrographic phase CT images combined with pathology diagnosis, we aim to develop and validate a fusion feature-based stacking ensemble machine learning model to distinguish malignant renal neoplasms from cystic renal lesions (CRLs). This retrospective research includes 166 individuals with CRLs for model training and 47 individuals with CRLs in another institution for model testing. Histopathology results are adopted as diagnosis criterion. Nephrographic phase CT scans are selected to build the fusion feature-based machine learning algorithms. The pretrained 3D-ResNet50 CNN model and radiomics methods are selected to extract deep features and radiomics features, respectively. Fivefold cross-validated least absolute shrinkage and selection operator (LASSO) regression methods are adopted to identify the most discriminative candidate features in the development cohort. Intraclass correlation coefficients and interclass correlation coefficients are employed to evaluate feature’s reproducibility. Pearson correlation coefficients for normal distribution features and Spearman’s rank correlation coefficients for non-normal distribution features are used to eliminate redundant features. After that, stacking ensemble machine learning models are developed in the training cohort. The area under the receiver operator characteristic curve (ROC), calibration curve, and decision curve analysis (DCA) are adopted in the testing cohort to evaluate the performance of each model. The stacking ensemble machine learning algorithm reached excellent diagnostic performance in the testing dataset. The calibration plot shows good stability when using the stacking ensemble model. Net benefits presented by DCA are higher than the Bosniak 2019 version classification when employing any machine learning algorithm. The fusion feature-based machine learning algorithm accurately distinguishes malignant renal neoplasms from CRLs, which outperformed the Bosniak 2019 version classification, and proves to be more applicable for clinical decision-making.
Collapse
|
8
|
Adams J, Agyenkwa-Mawuli K, Agyapong O, Wilson MD, Kwofie SK. EBOLApred: A machine learning-based web application for predicting cell entry inhibitors of the Ebola virus. Comput Biol Chem 2022; 101:107766. [DOI: 10.1016/j.compbiolchem.2022.107766] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2022] [Revised: 08/10/2022] [Accepted: 08/29/2022] [Indexed: 11/03/2022]
|
9
|
|
10
|
Klikowski J, Woźniak M. Deterministic Sampling Classifier with weighted Bagging for drifted imbalanced data stream classification. Appl Soft Comput 2022. [DOI: 10.1016/j.asoc.2022.108855] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
11
|
Dai W, Ning C, Nan J, Wang D. Stochastic configuration networks for imbalanced data classification. INT J MACH LEARN CYB 2022. [DOI: 10.1007/s13042-022-01565-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
12
|
Zhu Z, Wang Z, Li D, Du W. Globalized Multiple Balanced Subsets With Collaborative Learning for Imbalanced Data. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:2407-2417. [PMID: 32609619 DOI: 10.1109/tcyb.2020.3001158] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The skewed distribution of data brings difficulties to classify minority and majority samples in the imbalanced problem. The balanced bagging randomly undersampes majority samples several times and combines the selected majority samples with minority samples to form several balanced subsets, in which the numbers of minority and majority samples are roughly equal. However, the balanced bagging is the lack of a unified learning framework. Moreover, it fails to concern the connection of all subsets and the global information of the entire data distribution. To this end, this article puts several balanced subsets into an effective learning framework with a criterion function. In the learning framework, one regularization term called RS establishes the connection and realizes the collaborative learning of all subsets by requiring the consistent outputs of the minority samples in different subsets. Besides, another regularization term called RW provides the global information to each basic classifier by reducing the difference between the direction of the solution vector in each subset and that in the entire dataset. The proposed learning framework is called globalized multiple balanced subsets with collaborative learning (GMBSCL). The experimental results validate the effectiveness of the proposed GMBSCL.
Collapse
|
13
|
An Oversampling Method for Class Imbalance Problems on Large Datasets. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12073424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
Several oversampling methods have been proposed for solving the class imbalance problem. However, most of them require searching the k-nearest neighbors to generate synthetic objects. This requirement makes them time-consuming and therefore unsuitable for large datasets. In this paper, an oversampling method for large class imbalance problems that do not require the k-nearest neighbors’ search is proposed. According to our experiments on large datasets with different sizes of imbalance, the proposed method is at least twice as fast as 8 the fastest method reported in the literature while obtaining similar oversampling quality.
Collapse
|
14
|
RB-CCR: Radial-Based Combined Cleaning and Resampling algorithm for imbalanced data classification. Mach Learn 2021. [DOI: 10.1007/s10994-021-06012-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
AbstractReal-world classification domains, such as medicine, health and safety, and finance, often exhibit imbalanced class priors and have asynchronous misclassification costs. In such cases, the classification model must achieve a high recall without significantly impacting precision. Resampling the training data is the standard approach to improving classification performance on imbalanced binary data. However, the state-of-the-art methods ignore the local joint distribution of the data or correct it as a post-processing step. This can causes sub-optimal shifts in the training distribution, particularly when the target data distribution is complex. In this paper, we propose Radial-Based Combined Cleaning and Resampling (RB-CCR). RB-CCR utilizes the concept of class potential to refine the energy-based resampling approach of CCR. In particular, RB-CCR exploits the class potential to accurately locate sub-regions of the data-space for synthetic oversampling. The category sub-region for oversampling can be specified as an input parameter to meet domain-specific needs or be automatically selected via cross-validation. Our $$5\times 2$$
5
×
2
cross-validated results on 57 benchmark binary datasets with 9 classifiers show that RB-CCR achieves a better precision-recall trade-off than CCR and generally out-performs the state-of-the-art resampling methods in terms of AUC and G-mean.
Collapse
|
15
|
Wang L, Cheng H, Zheng Z, Yang A, Zhu X. Ponzi scheme detection via oversampling-based Long Short-Term Memory for smart contracts. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2021.107312] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
16
|
Tang X, Machimura T, Li J, Liu W, Hong H. A novel optimized repeatedly random undersampling for selecting negative samples: A case study in an SVM-based forest fire susceptibility assessment. JOURNAL OF ENVIRONMENTAL MANAGEMENT 2020; 271:111014. [PMID: 32778297 DOI: 10.1016/j.jenvman.2020.111014] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/25/2020] [Revised: 06/22/2020] [Accepted: 06/23/2020] [Indexed: 06/11/2023]
Abstract
The negative sample selection method is a key issue in studies of using machine learning approaches to spatially assess natural hazards. Recently, a Repeatedly Random Undersampling (RRU) was proposed to address the randomness problem faced in Single Random Sampling. However, the RRU cannot guarantee that the generated classifier has the best classification performance during the repeatedly random sampling process. To address this weakness, in this study we proposed an optimized RRU, which follows the idea of RRU, and then changing its rule to find a best classifier. Then, the selected classifier, the actual most accurate classifier (MAC), was employed to compute the probability of hazard occurrence. Support Vector Machine (SVM) was selected as the analysis method, and Genetic Algorithm was employed to compute the parameters of SVM. Forest fire susceptibility was assessed in Huichang County in China due to its forest values and frequent fire events. The results indicated that compared with the RRU, the optimized RRU can find out an actual MAC which has the best classification performance among possible MACs; also, the fire susceptibility map generated by the actual MAC comforts to objective facts. The generated fire susceptibility map can provide useful decision supports for local government to reduce forest fire risks. Moreover, the proposed sampling method, the optimized RRU, presented an enhanced approach for selecting negative samples, which makes the results of forest fire susceptibility assessment more reliable and accurate.
Collapse
Affiliation(s)
- Xianzhe Tang
- Graduate School of Engineering, Osaka University, Yamadaoka 2-1, Suita, Osaka, 565-0871, Japan
| | - Takashi Machimura
- Graduate School of Engineering, Osaka University, Yamadaoka 2-1, Suita, Osaka, 565-0871, Japan
| | - Jiufeng Li
- Jiangsu Provincial Key Laboratory of Geographic Information Science and Technology, International Institute for Earth System Science, Nanjing University, Nanjing, Jiangsu, 210023, China
| | - Wei Liu
- School of Geography, South China Normal University, Guangzhou, 510631, China
| | - Haoyuan Hong
- Department of Geography and Regional Research, University of Vienna, Vienna, 1010, Austria.
| |
Collapse
|
17
|
Koziarski M, Woźniak M, Krawczyk B. Combined Cleaning and Resampling algorithm for multi-class imbalanced data with label noise. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2020.106223] [Citation(s) in RCA: 39] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
18
|
Krawczyk B, Koziarski M, Wozniak M. Radial-Based Oversampling for Multiclass Imbalanced Data Classification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:2818-2831. [PMID: 31247563 DOI: 10.1109/tnnls.2019.2913673] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Learning from imbalanced data is among the most popular topics in the contemporary machine learning. However, the vast majority of attention in this field is given to binary problems, while their much more difficult multiclass counterparts are relatively unexplored. Handling data sets with multiple skewed classes poses various challenges and calls for a better understanding of the relationship among classes. In this paper, we propose multiclass radial-based oversampling (MC-RBO), a novel data-sampling algorithm dedicated to multiclass problems. The main novelty of our method lies in using potential functions for generating artificial instances. We take into account information coming from all of the classes, contrary to existing multiclass oversampling approaches that use only minority class characteristics. The process of artificial instance generation is guided by exploring areas where the value of the mutual class distribution is very small. This way, we ensure a smart oversampling procedure that can cope with difficult data distributions and alleviate the shortcomings of existing methods. The usefulness of the MC-RBO algorithm is evaluated on the basis of extensive experimental study and backed-up with a thorough statistical analysis. Obtained results show that by taking into account information coming from all of the classes and conducting a smart oversampling, we can significantly improve the process of learning from multiclass imbalanced data.
Collapse
|
19
|
Zhu Z, Wang Z, Li D, Zhu Y, Du W. Geometric Structural Ensemble Learning for Imbalanced Problems. IEEE TRANSACTIONS ON CYBERNETICS 2020; 50:1617-1629. [PMID: 30418931 DOI: 10.1109/tcyb.2018.2877663] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
The classification on imbalanced data sets is a great challenge in machine learning. In this paper, a geometric structural ensemble (GSE) learning framework is proposed to address the issue. It is known that the traditional ensemble methods train and combine a series of basic classifiers according to various weights, which might lack the geometric meaning. Oppositely, the GSE partitions and eliminates redundant majority samples by generating hyper-sphere through the Euclidean metric and learns basic classifiers to enclose the minority samples, which achieves higher efficiency in the training process and seems easier to understand. In detail, the current weak classifier builds boundaries between the majority and the minority samples and removes the former. Then, the remaining samples are used to train the next. When the training process is done, all of the majority samples could be cleaned and the combination of all basic classifiers is obtained. To further improve the generalization, two relaxation techniques are proposed. Theoretically, the computational complexity of GSE could approach O(ndlog(nmin)log(n maj)) . The comprehensive experiments validate both the effectiveness and efficiency of GSE.
Collapse
|
20
|
Zhu Z, Wang Z, Li D, Du W. NearCount: Selecting critical instances based on the cited counts of nearest neighbors. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2019.105196] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
21
|
|
22
|
Zhu Z, Wang Z, Li D, Du W. Multiple Empirical Kernel Learning with Majority Projection for imbalanced problems. Appl Soft Comput 2019. [DOI: 10.1016/j.asoc.2018.11.037] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
23
|
Zhang Y, Yu J, Liu W, Ota K. Ensemble Classification for Skewed Data Streams Based on Neural Network. INT J UNCERTAIN FUZZ 2018. [DOI: 10.1142/s021848851850037x] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Data stream learning in non-stationary environments and skewed class distributions has been receiving more attention in machine learning communities. This paper proposes a novel ensemble classification method (ECSDS) for classifying data streams with skewed class distributions. In the proposed ensemble method, back-propagation neural network is selected as the base classifier. In order to demonstrate the effectiveness of our proposed method, we choose three baseline methods based on ECSDS and evaluate their overall performance on ten datasets from UCI machine learning repository. Moreover, the performance of incremental learning is also evaluated by these datasets. The experimental results show our proposed method can effectively deal with classification problems on non-stationary data streams with class imbalance.
Collapse
Affiliation(s)
- Yong Zhang
- School of Computer and Information Technology, Liaoning Normal University, Dalian 116081, China
- State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
| | - Jiaxin Yu
- School of Computer and Information Technology, Liaoning Normal University, Dalian 116081, China
| | - Wenzhe Liu
- School of Computer and Information Technology, Liaoning Normal University, Dalian 116081, China
| | - Kaoru Ota
- Department of Information and Electronic Engineering, Muroran Institute of Technology, Hokkaido 050-8585, Japan
| |
Collapse
|
24
|
Zheng YJ, Zhou XH, Sheng WG, Xue Y, Chen SY. Generative adversarial network based telecom fraud detection at the receiving bank. Neural Netw 2018; 102:78-86. [DOI: 10.1016/j.neunet.2018.02.015] [Citation(s) in RCA: 54] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2017] [Revised: 02/22/2018] [Accepted: 02/26/2018] [Indexed: 11/30/2022]
|
25
|
Synthetic semi-supervised learning in imbalanced domains: Constructing a model for donor-recipient matching in liver transplantation. Knowl Based Syst 2017. [DOI: 10.1016/j.knosys.2017.02.020] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
26
|
Alahmadi HH, Shen Y, Fouad S, Luft CDB, Bentham P, Kourtzi Z, Tino P. Classifying Cognitive Profiles Using Machine Learning with Privileged Information in Mild Cognitive Impairment. Front Comput Neurosci 2016; 10:117. [PMID: 27909405 PMCID: PMC5112260 DOI: 10.3389/fncom.2016.00117] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2016] [Accepted: 10/31/2016] [Indexed: 11/28/2022] Open
Abstract
Early diagnosis of dementia is critical for assessing disease progression and potential treatment. State-or-the-art machine learning techniques have been increasingly employed to take on this diagnostic task. In this study, we employed Generalized Matrix Learning Vector Quantization (GMLVQ) classifiers to discriminate patients with Mild Cognitive Impairment (MCI) from healthy controls based on their cognitive skills. Further, we adopted a “Learning with privileged information” approach to combine cognitive and fMRI data for the classification task. The resulting classifier operates solely on the cognitive data while it incorporates the fMRI data as privileged information (PI) during training. This novel classifier is of practical use as the collection of brain imaging data is not always possible with patients and older participants. MCI patients and healthy age-matched controls were trained to extract structure from temporal sequences. We ask whether machine learning classifiers can be used to discriminate patients from controls and whether differences between these groups relate to individual cognitive profiles. To this end, we tested participants in four cognitive tasks: working memory, cognitive inhibition, divided attention, and selective attention. We also collected fMRI data before and after training on a probabilistic sequence learning task and extracted fMRI responses and connectivity as features for machine learning classifiers. Our results show that the PI guided GMLVQ classifiers outperform the baseline classifier that only used the cognitive data. In addition, we found that for the baseline classifier, divided attention is the only relevant cognitive feature. When PI was incorporated, divided attention remained the most relevant feature while cognitive inhibition became also relevant for the task. Interestingly, this analysis for the fMRI GMLVQ classifier suggests that (1) when overall fMRI signal is used as inputs to the classifier, the post-training session is most relevant; and (2) when the graph feature reflecting underlying spatiotemporal fMRI pattern is used, the pre-training session is most relevant. Taken together these results suggest that brain connectivity before training and overall fMRI signal after training are both diagnostic of cognitive skills in MCI.
Collapse
Affiliation(s)
- Hanin H Alahmadi
- School of Computer Science, The University of Birmingham Birmingham, UK
| | - Yuan Shen
- School of Computer Science, The University of Birmingham Birmingham, UK
| | - Shereen Fouad
- School of Dentistry, The University of Birmingham Birmingham, UK
| | - Caroline Di B Luft
- School of Biological and Chemical Sciences, Queen Mary University of London London, UK
| | - Peter Bentham
- School of Clinical and Experimental Medicine, The University of Birmingham Birmingham, UK
| | - Zoe Kourtzi
- Department of Psychology, The University of Cambridge Cambridge, UK
| | - Peter Tino
- School of Computer Science, The University of Birmingham Birmingham, UK
| |
Collapse
|
27
|
Pérez-Ortiz M, Gutiérrez PA, Carbonero-Ruz M, Hervás-Martínez C. Semi-supervised learning for ordinal Kernel Discriminant Analysis. Neural Netw 2016; 84:57-66. [PMID: 27639724 DOI: 10.1016/j.neunet.2016.08.004] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2016] [Revised: 08/09/2016] [Accepted: 08/15/2016] [Indexed: 10/21/2022]
Abstract
Ordinal classification considers those classification problems where the labels of the variable to predict follow a given order. Naturally, labelled data is scarce or difficult to obtain in this type of problems because, in many cases, ordinal labels are given by a user or expert (e.g. in recommendation systems). Firstly, this paper develops a new strategy for ordinal classification where both labelled and unlabelled data are used in the model construction step (a scheme which is referred to as semi-supervised learning). More specifically, the ordinal version of kernel discriminant learning is extended for this setting considering the neighbourhood information of unlabelled data, which is proposed to be computed in the feature space induced by the kernel function. Secondly, a new method for semi-supervised kernel learning is devised in the context of ordinal classification, which is combined with our developed classification strategy to optimise the kernel parameters. The experiments conducted compare 6 different approaches for semi-supervised learning in the context of ordinal classification in a battery of 30 datasets, showing (1) the good synergy of the ordinal version of discriminant analysis and the use of unlabelled data and (2) the advantage of computing distances in the feature space induced by the kernel function.
Collapse
Affiliation(s)
- M Pérez-Ortiz
- Department of Quantitative Methods, Universidad Loyola Andalucía, 14004 - Córdoba, Spain.
| | - P A Gutiérrez
- Department of Computer Science and Numerical Analysis, University of Córdoba, 14070 - Córdoba, Spain.
| | - M Carbonero-Ruz
- Department of Quantitative Methods, Universidad Loyola Andalucía, 14004 - Córdoba, Spain.
| | - C Hervás-Martínez
- Department of Computer Science and Numerical Analysis, University of Córdoba, 14070 - Córdoba, Spain.
| |
Collapse
|