101
|
Pérez-Castán JA, Pérez Sanz L, Fernández-Castellano M, Radišić T, Samardžić K, Tukarić I. Learning Assurance Analysis for Further Certification Process of Machine Learning Techniques: Case-Study Air Traffic Conflict Detection Predictor. SENSORS (BASEL, SWITZERLAND) 2022; 22:7680. [PMID: 36236779 PMCID: PMC9573068 DOI: 10.3390/s22197680] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/16/2022] [Revised: 09/29/2022] [Accepted: 10/06/2022] [Indexed: 06/16/2023]
Abstract
Designing and developing artificial intelligence (AI)-based systems that can be trusted justifiably is one of the main issues aviation must face in the coming years. European Union Aviation Safety Agency (EASA) has developed a user guide that could be potentially transformed as means of compliance for future AI-based regulation. Designers and developers must understand how the learning assurance process of any machine learning (ML) model impacts trust. ML is a narrow branch of AI that uses statistical models to perform predictions. This work deals with the learning assurance process for ML-based systems in the field of air traffic control. A conflict detection tool has been developed to identify separation infringements among aircraft pairs, and the ML algorithm used for classification and regression was extreme gradient boosting. This paper analyses the validity and adaptability of EASA W-shaped methodology for ML-based systems. The results have identified the lack of the EASA W-shaped methodology in time-dependent analysis, by showing how time can impact ML algorithms designed in the case where no time requirements are considered. Another meaningful conclusion is, for systems that depend highly on when the prediction is made, classification and regression metrics cannot be one-size-fits-all because they vary over time.
Collapse
Affiliation(s)
- Javier A. Pérez-Castán
- ETSI Aeronáutica y del Espacio, Plaza Cardenal Cisneros, Universidad Politécnica de Madrid, 28008 Madrid, Spain
| | - Luis Pérez Sanz
- ETSI Aeronáutica y del Espacio, Plaza Cardenal Cisneros, Universidad Politécnica de Madrid, 28008 Madrid, Spain
| | - Marta Fernández-Castellano
- ETSI Aeronáutica y del Espacio, Plaza Cardenal Cisneros, Universidad Politécnica de Madrid, 28008 Madrid, Spain
| | - Tomislav Radišić
- Faculty of Transport and Traffic Sciences, University of Zagreb, Borongajska Cesta, 10000 Zagreb, Croatia
| | - Kristina Samardžić
- Faculty of Transport and Traffic Sciences, University of Zagreb, Borongajska Cesta, 10000 Zagreb, Croatia
| | - Ivan Tukarić
- Faculty of Transport and Traffic Sciences, University of Zagreb, Borongajska Cesta, 10000 Zagreb, Croatia
| |
Collapse
|
102
|
Emakhu J, Monplaisir L, Aguwa C, Arslanturk S, Masoud S, Nassereddine H, Hamam MS, Miller JB. Acute coronary syndrome prediction in emergency care: A machine learning approach. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2022; 225:107080. [PMID: 36037605 DOI: 10.1016/j.cmpb.2022.107080] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Revised: 07/30/2022] [Accepted: 08/20/2022] [Indexed: 06/15/2023]
Abstract
BACKGROUND AND OBJECTIVE Clinical concern for acute coronary syndrome (ACS) is one of emergency medicine's most common patient encounters. This study aims to develop an ensemble learning-driven framework as a diagnostic support tool to prevent misdiagnosis. METHODS We obtained extensive clinical electronic health data on patient encounters with clinical concerns for ACS from a large urban emergency department (ED) between January 2017 and August 2020. We applied an analytical framework equipped with many well-developed algorithms to improve the data quality by addressing missing values, dimensionality reduction, and data imbalance. We trained ensemble learning algorithms to classify patients with ACS or non-ACS etiologies of their symptoms. We used performance evaluation metrics such as accuracy, sensitivity, precision, F1-score, and the area under the receiver operating characteristic (AUROC) to measure the model's performance. RESULTS The analysis included 31,228 patients, of whom 563 (1.8%) had ACS and 30,665 (98.2%) had alternative diagnoses. Eleven features, including systolic blood pressure, brain natriuretic peptide, chronic heart disease, coronary artery disease, creatinine, glucose, heart attack, heart rate, nephrotic syndrome, red cell distribution width, and troponin level, are reported as significantly contributing risk factors. The proposed framework successfully classifies these cohorts with sensitivity and AUROC as high as 86.3% and 93.3%. Our proposed model's accuracy, precision, specificity, Matthew's correlation coefficient, and F1-score were 85.7%, 86.3%, 93%, 80%, and 86.3%, respectively. CONCLUSION Our proposed framework can identify early patients with ACS through further refinement and validation.
Collapse
Affiliation(s)
- Joshua Emakhu
- Department of Industrial and Systems Engineering, Wayne State University, Detroit, MI 48201, USA.
| | - Leslie Monplaisir
- Department of Industrial and Systems Engineering, Wayne State University, Detroit, MI 48201, USA
| | - Celestine Aguwa
- Department of Industrial and Systems Engineering, Wayne State University, Detroit, MI 48201, USA
| | - Suzan Arslanturk
- Department of Computer Science, Wayne State University, Detroit, MI 48202, USA
| | - Sara Masoud
- Department of Industrial and Systems Engineering, Wayne State University, Detroit, MI 48201, USA
| | | | - Mohamed S Hamam
- Emergency Department, Henry Ford Hospital, Detroit, MI 48202, USA
| | - Joseph B Miller
- Emergency Department, Henry Ford Hospital, Detroit, MI 48202, USA
| |
Collapse
|
103
|
Improved AdaBoost algorithm using misclassified samples oriented feature selection and weighted non-negative matrix factorization. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.08.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
104
|
Jorge AM, Smith D, Wu Z, Chowdhury T, Costenbader K, Zhang Y, Choi HK, Feldman CH, Zhao Y. Exploration of machine learning methods to predict systemic lupus erythematosus hospitalizations. Lupus 2022; 31:1296-1305. [PMID: 35835534 PMCID: PMC9547899 DOI: 10.1177/09612033221114805] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
Abstract
OBJECTIVES Systemic lupus erythematosus (SLE) is a heterogeneous disease characterized by disease flares which can require hospitalization. Our objective was to apply machine learning methods to predict hospitalizations for SLE from electronic health record (EHR) data. METHODS We identified patients with SLE in a longitudinal EHR-based cohort with ≥2 outpatient rheumatology visits between 2012 and 2019. We applied multiple machine learning methods to predict hospitalizations with a primary diagnosis code for SLE, including decision tree, random forest, naive Bayes, logistic regression, and an ensemble method. Candidate predictors were derived from structured EHR features, including demographics, laboratory tests, medications, ICD-9/10 codes for SLE manifestations, and healthcare utilization. We used two approaches to assess these variables over longitudinal follow-up, including the incorporation of lagged features to capture changes over time of clinical data. The performance of each model was evaluated by overall accuracy, the F statistic, and the area under the receiver operator curve (AUC). RESULTS We identified 1996 patients with SLE. 4.6% were hospitalized for SLE in their most recent year of follow-up. Random forest models had highest performance in predicting SLE hospitalizations, with AUC 0.751 and AUC 0.772 for two approaches (averaging and progressive), respectively. The leading predictors of SLE hospitalizations included dsDNA positivity, C3 level, blood cell counts, and inflammatory markers as well as age and albumin. CONCLUSION We have demonstrated that machine learning methods can predict SLE hospitalizations. We identified key predictors of these events including known markers of SLE disease activity; further validation in external cohorts is warranted.
Collapse
Affiliation(s)
- April M Jorge
- Division of Rheumatology, Allergy, and Immunology, Harvard Medical School, 2348Massachusetts General Hospital, Boston, MA, USA
| | - Dylan Smith
- Department of Computer and Information Sciences, 5923Fordham University, New York, NY, USA
| | - Zhiyao Wu
- Department of Computer and Information Sciences, 5923Fordham University, New York, NY, USA
| | - Tashrif Chowdhury
- Department of Computer and Information Sciences, 5923Fordham University, New York, NY, USA
| | - Karen Costenbader
- Division of Rheumatology, Inflammation, and Immunity, Harvard Medical School, 1861Brigham and Women's Hospital, Boston, MA, USA
| | - Yuqing Zhang
- Division of Rheumatology, Allergy, and Immunology, Harvard Medical School, 2348Massachusetts General Hospital, Boston, MA, USA
| | - Hyon K Choi
- Division of Rheumatology, Allergy, and Immunology, Harvard Medical School, 2348Massachusetts General Hospital, Boston, MA, USA
| | - Candace H Feldman
- Division of Rheumatology, Inflammation, and Immunity, Harvard Medical School, 1861Brigham and Women's Hospital, Boston, MA, USA
| | - Yijun Zhao
- Department of Computer and Information Sciences, 5923Fordham University, New York, NY, USA
| |
Collapse
|
105
|
Chen Z, Duan J, Kang L, Qiu G. Class-Imbalanced Deep Learning via a Class-Balanced Ensemble. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:5626-5640. [PMID: 33900923 DOI: 10.1109/tnnls.2021.3071122] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Class imbalance is a prevalent phenomenon in various real-world applications and it presents significant challenges to model learning, including deep learning. In this work, we embed ensemble learning into the deep convolutional neural networks (CNNs) to tackle the class-imbalanced learning problem. An ensemble of auxiliary classifiers branching out from various hidden layers of a CNN is trained together with the CNN in an end-to-end manner. To that end, we designed a new loss function that can rectify the bias toward the majority classes by forcing the CNN's hidden layers and its associated auxiliary classifiers to focus on the samples that have been misclassified by previous layers, thus enabling subsequent layers to develop diverse behavior and fix the errors of previous layers in a batch-wise manner. A unique feature of the new method is that the ensemble of auxiliary classifiers can work together with the main CNN to form a more powerful combined classifier, or can be removed after finished training the CNN and thus only acting the role of assisting class imbalance learning of the CNN to enhance the neural network's capability in dealing with class-imbalanced data. Comprehensive experiments are conducted on four benchmark data sets of increasing complexity (CIFAR-10, CIFAR-100, iNaturalist, and CelebA) and the results demonstrate significant performance improvements over the state-of-the-art deep imbalance learning methods.
Collapse
|
106
|
Threshold prediction for detecting rare positive samples using a meta-learner. Pattern Anal Appl 2022. [DOI: 10.1007/s10044-022-01103-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
|
107
|
Explainable AI: A Neurally-Inspired Decision Stack Framework. Biomimetics (Basel) 2022; 7:biomimetics7030127. [PMID: 36134931 PMCID: PMC9496620 DOI: 10.3390/biomimetics7030127] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2022] [Revised: 09/01/2022] [Accepted: 09/07/2022] [Indexed: 11/17/2022] Open
Abstract
European law now requires AI to be explainable in the context of adverse decisions affecting the European Union (EU) citizens. At the same time, we expect increasing instances of AI failure as it operates on imperfect data. This paper puts forward a neurally inspired theoretical framework called "decision stacks" that can provide a way forward in research to develop Explainable Artificial Intelligence (X-AI). By leveraging findings from the finest memory systems in biological brains, the decision stack framework operationalizes the definition of explainability. It then proposes a test that can potentially reveal how a given AI decision was made.
Collapse
|
108
|
Solving the class imbalance problem using a counterfactual method for data augmentation. MACHINE LEARNING WITH APPLICATIONS 2022. [DOI: 10.1016/j.mlwa.2022.100375] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
|
109
|
Narwane SV, Sawarkar SD. Is handling unbalanced datasets for machine learning uplifts system performance?: A case of diabetic prediction. Diabetes Metab Syndr 2022; 16:102609. [PMID: 36099677 DOI: 10.1016/j.dsx.2022.102609] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/05/2022] [Revised: 08/21/2022] [Accepted: 08/23/2022] [Indexed: 11/30/2022]
Abstract
BACKGROUND AND AIMS Healthcare is a sensitive sector, and addressing the class imbalance in the healthcare domain is a time-consuming task for machine learning-based systems due to the vast amount of data. This study looks into the impact of socioeconomic disparities on the healthcare data of diabetic patients to make accurate disease predictions. METHODS This study proposed a systematic approach of Closest Distance Ranking and Principal Component Analysis to deal with the unbalanced dataset. A typical machine learning technique was used to analyze the proposed approach. The data set of pregnant diabetic women is analysed for accurate detection. RESULTS The results of the case are analysed using sensitivity, which demonstrates that the minority class's lack of information makes it impossible to forecast the results. On the other hand, the unbalanced dataset was treated using the proposed technique and evaluated with the machine learning algorithm which significantly increased the performance of the system. CONCLUSION The performance of the machine learning-based system was significantly enhanced by the unbalanced dataset which was processed with the proposed technique and evaluated with the machine learning algorithm. For the first time, an unbalanced dataset was treated with a combination of Closest Distance Ranking and Principal Component Analysis.
Collapse
Affiliation(s)
- Swati V Narwane
- Department of Computer Engineering, Datta Meghe College of Engineering, Navi Mumbai, Pin Code: 400 708, India.
| | - Sudhir D Sawarkar
- Department of Computer Engineering, Datta Meghe College of Engineering, Navi Mumbai, Pin Code: 400 708, India.
| |
Collapse
|
110
|
Ahmed J, Green II RC. Predicting severely imbalanced data disk drive failures with machine learning models. MACHINE LEARNING WITH APPLICATIONS 2022. [DOI: 10.1016/j.mlwa.2022.100361] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
|
111
|
Liu X, Guo L, Wang H, Guo J, Yang S, Duan L. Research on imbalance machine learning methods for MR
T
1
WI soft tissue sarcoma data. BMC Med Imaging 2022; 22:149. [PMID: 36028803 PMCID: PMC9417078 DOI: 10.1186/s12880-022-00876-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2022] [Accepted: 08/08/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Soft tissue sarcoma is a rare and highly heterogeneous tumor in clinical practice. Pathological grading of the soft tissue sarcoma is a key factor in patient prognosis and treatment planning while the clinical data of soft tissue sarcoma are imbalanced. In this paper, we propose an effective solution to find the optimal imbalance machine learning model for predicting the classification of soft tissue sarcoma data. METHODS In this paper, a large number of features are first obtained based onT 1 WI images using the radiomics methods.Then, we explore the methods of feature selection, sampling and classification, get 17 imbalance machine learning models based on the above features and performed extensive experiments to classify imbalanced soft tissue sarcoma data. Meanwhile, we used another dataset splitting method as well, which could improve the classification performance and verify the validity of the models. RESULTS The experimental results show that the combination of extremely randomized trees (ERT) classification algorithm using SMOTETomek and the recursive feature elimination technique (RFE) performs best compared to other methods. The accuracy of RFE+STT+ERT is 81.57% , which is close to the accuracy of biopsy, and the accuracy is 95.69% when using another dataset splitting method. CONCLUSION Preoperative predicting pathological grade of soft tissue sarcoma in an accurate and noninvasive manner is essential. Our proposed machine learning method (RFE+STT+ERT) can make a positive contribution to solving the imbalanced data classification problem, which can favorably support the development of personalized treatment plans for soft tissue sarcoma patients.
Collapse
Affiliation(s)
- Xuanxuan Liu
- College of Computer Science and Technology, Qingdao University, Qingdao, 266071 China
| | - Li Guo
- College of Computer Science and Technology, Qingdao University, Qingdao, 266071 China
| | - Hexiang Wang
- Department of Radiology, The Affiliated Hospital of Qingdao University, Qingdao, China
| | - Jia Guo
- Department of Radiology, The Affiliated Hospital of Qingdao University, Qingdao, China
| | - Shifeng Yang
- Department of Radiology, Shandong Provincial Hospital Affiliated to Shandong First Medical University, Jinan, China
| | - Lisha Duan
- Department of Radiology, The Third Hospital of Hebei Medical University, Shijiazhuang, Qingdao, China
| |
Collapse
|
112
|
Wang X, Zhang R, Zhang Z. A Novel Hybrid Sampling Method ESMOTE+SSLM for Handling the Problem of Class Imbalance with Overlap in Financial Distress Detection. Neural Process Lett 2022. [DOI: 10.1007/s11063-022-10998-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
113
|
Gong X, Jia L, Li N. Research on mobile traffic data augmentation methods based on SA-ACGAN-GN. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2022; 19:11512-11532. [PMID: 36124601 DOI: 10.3934/mbe.2022536] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
With the rapid development and application of the mobile Internet, it is necessary to analyze and classify mobile traffic to meet the needs of users. Due to the difficulty in collecting some application data, the mobile traffic data presents a long-tailed distribution, resulting in a decrease in classification accuracy. In addition, the original GAN is difficult to train, and it is prone to "mode collapse". Therefore, this paper introduces the self-attention mechanism and gradient normalization into the auxiliary classifier generative adversarial network to form SA-ACGAN-GN model to solve the long-tailed distribution and training stability problems of mobile traffic data. This method firstly converts the traffic into images; secondly, to improve the quality of the generated images, the self-attention mechanism is introduced into the ACGAN model to obtain the global geometric features of the images; finally, the gradient normalization strategy is added to SA-ACGAN to further improve the data augmentation effect and improve the training stability. It can be seen from the cross-validation experimental data that, on the basis of using the same classifier, the SA-ACGAN-GN algorithm proposed in this paper, compared with other comparison algorithms, has the best precision reaching 93.8%; after adding gradient normalization, during the training process of the model, the classification loss decreases rapidly and the loss curve fluctuates less, indicating that the method proposed in this paper can not only effectively improve the long-tail problem of the dataset, but also enhance the stability of the model training.
Collapse
Affiliation(s)
- Xingyu Gong
- College of Computer Science and Technology, Xi'an University of Science and Technology, Xi'an 710054, China
| | - Ling Jia
- College of Computer Science and Technology, Xi'an University of Science and Technology, Xi'an 710054, China
- Hanzhong Vocational School of Science and Technology, Hanzhong 723200, China
| | - Na Li
- College of Computer Science and Technology, Xi'an University of Science and Technology, Xi'an 710054, China
| |
Collapse
|
114
|
Diaz-Uriarte R, Gómez de Lope E, Giugno R, Fröhlich H, Nazarov PV, Nepomuceno-Chamorro IA, Rauschenberger A, Glaab E. Ten quick tips for biomarker discovery and validation analyses using machine learning. PLoS Comput Biol 2022; 18:e1010357. [PMID: 35951526 PMCID: PMC9371329 DOI: 10.1371/journal.pcbi.1010357] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Affiliation(s)
- Ramon Diaz-Uriarte
- Department of Biochemistry, School of Medicine, Universidad Autónoma de Madrid, Instituto de Investigaciones Biomédicas ‘Alberto Sols’ (UAM-CSIC), Madrid, Spain
| | - Elisa Gómez de Lope
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Luxembourg
| | - Rosalba Giugno
- Department of Computer Science, University of Verona, Verona, Italy
| | - Holger Fröhlich
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Sankt Augustin, Germany
- Bonn-Aachen International Centre for IT (b-it), Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn, Germany
| | - Petr V. Nazarov
- Department of Cancer Research, Luxembourg Institute of Health, Strassen, Luxembourg
| | | | - Armin Rauschenberger
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Luxembourg
| | - Enrico Glaab
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Luxembourg
- * E-mail:
| |
Collapse
|
115
|
Khaksar Fasaee MA, Pesantez J, Pieper KJ, Ling E, Benham B, Edwards M, Berglund E. Developing early warning systems to predict water lead levels in tap water for private systems. WATER RESEARCH 2022; 221:118787. [PMID: 35841794 DOI: 10.1016/j.watres.2022.118787] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/02/2021] [Revised: 06/16/2022] [Accepted: 06/20/2022] [Indexed: 06/15/2023]
Abstract
Lead is a chemical contaminant that threatens public health, and high levels of lead have been identified in drinking water at locations across the globe. Under-served populations that use private systems for drinking water supplies may be at an elevated level of risk because utilities and governing agencies are not responsible for ensuring that lead levels meet the Lead and Copper Rule at these systems. Predictive models that can be used by residents to assess water quality threats in their households can create awareness of water lead levels (WLLs). This research explores and compares the use of statistical models (i.e., Bayesian Belief classifiers) and machine learning models (i.e., ensemble of decision trees) for predicting WLLs. Models are developed using a dataset collected by the Virginia Household Water Quality Program (VAHWQP) at approximately 8000 households in Virginia during 2012-2017. The dataset reports laboratory-tested water quality parameters at households, location information, and household and plumbing characteristics, including observations of water odor, taste, discoloration. Some water quality parameters, such as pH, iron, and copper, can be measured at low resolution by residents using at-home water test kits and can be used to predict risk of WLLs. The use of at-home water quality test kits was simulated through the discretization of water quality parameter measurements to match the resolution of at-home water quality test kits and the introduction of error in water quality readings. Using this approach, this research demonstrates that low-resolution data collected by residents can be used as input for models to estimate WLLs. Model predictability was explored for a set of at-home water quality test kits that observe a variety of water quality parameters and report parameters at a range of resolutions. The effects of the timing of water sampling (e.g., first-draw vs. flushed samples) and error in kits on model error were tested through simulations. The prediction models developed through this research provide a set of tools for private well users to assess the risk of lead contamination. Models can be implemented as early warning systems in citizen science and online platforms to improve awareness of drinking water threats.
Collapse
Affiliation(s)
- Mohammad Ali Khaksar Fasaee
- Department of Civil, Construction, and Environmental Engineering, North Carolina State University, Raleigh, NC 27695, USA.
| | - Jorge Pesantez
- Department of Civil, Construction, and Environmental Engineering, North Carolina State University, Raleigh, NC 27695, USA
| | - Kelsey J Pieper
- Department of Civil and Environmental Engineering, Northeastern University, Boston, MA 02115, USA
| | - Erin Ling
- Department of Biological Systems Engineering, Virginia Tech, Blacksburg, VA 24061, USA
| | - Brian Benham
- Department of Biological Systems Engineering, Virginia Tech, Blacksburg, VA 24061, USA
| | - Marc Edwards
- Department of Civil and Environmental Engineering, Virginia Tech, Blacksburg, VA 24061, USA
| | - Emily Berglund
- Department of Civil, Construction, and Environmental Engineering, North Carolina State University, Raleigh, NC 27695, USA
| |
Collapse
|
116
|
|
117
|
Gu Q, Tian J, Li X, Jiang S. A novel Random Forest integrated model for imbalanced data classification problem. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
118
|
Viola R, Gautheron L, Habrard A, Sebban M. MetaAP: a meta-tree-based ranking algorithm optimizing the average precision from imbalanced data. Pattern Recognit Lett 2022. [DOI: 10.1016/j.patrec.2022.07.019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/16/2022]
|
119
|
Kim M, Hwang KB. An empirical evaluation of sampling methods for the classification of imbalanced data. PLoS One 2022; 17:e0271260. [PMID: 35901023 PMCID: PMC9333262 DOI: 10.1371/journal.pone.0271260] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2022] [Accepted: 06/28/2022] [Indexed: 11/18/2022] Open
Abstract
In numerous classification problems, class distribution is not balanced. For example, positive examples are rare in the fields of disease diagnosis and credit card fraud detection. General machine learning methods are known to be suboptimal for such imbalanced classification. One popular solution is to balance training data by oversampling the underrepresented (or undersampling the overrepresented) classes before applying machine learning algorithms. However, despite its popularity, the effectiveness of sampling has not been rigorously and comprehensively evaluated. This study assessed combinations of seven sampling methods and eight machine learning classifiers (56 varieties in total) using 31 datasets with varying degrees of imbalance. We used the areas under the precision-recall curve (AUPRC) and receiver operating characteristics curve (AUROC) as the performance measures. The AUPRC is known to be more informative for imbalanced classification than the AUROC. We observed that sampling significantly changed the performance of the classifier (paired t-tests P < 0.05) only for few cases (12.2% in AUPRC and 10.0% in AUROC). Surprisingly, sampling was more likely to reduce rather than improve the classification performance. Moreover, the adverse effects of sampling were more pronounced in AUPRC than in AUROC. Among the sampling methods, undersampling performed worse than others. Also, sampling was more effective for improving linear classifiers. Most importantly, we did not need sampling to obtain the optimal classifier for most of the 31 datasets. In addition, we found two interesting examples in which sampling significantly reduced AUPRC while significantly improving AUROC (paired t-tests P < 0.05). In conclusion, the applicability of sampling is limited because it could be ineffective or even harmful. Furthermore, the choice of the performance measure is crucial for decision making. Our results provide valuable insights into the effect and characteristics of sampling for imbalanced classification.
Collapse
Affiliation(s)
- Misuk Kim
- Department of Computer Science and Engineering, Graduate School, Soongsil University, Seoul, Korea
| | - Kyu-Baek Hwang
- Department of Computer Science and Engineering, Graduate School, Soongsil University, Seoul, Korea
- * E-mail:
| |
Collapse
|
120
|
Parity-based cumulative fairness-aware boosting. Knowl Inf Syst 2022. [DOI: 10.1007/s10115-022-01723-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/16/2022]
|
121
|
A Hybrid Algorithm-Level Ensemble Model for Imbalanced Credit Default Prediction in the Energy Industry. ENERGIES 2022. [DOI: 10.3390/en15145206] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Credit default prediction for the energy industry is essential to promoting the healthy development of the energy industry in China. While previous studies have constructed various credit default prediction models with brilliant performance, the class-imbalance problem in the credit default dataset cannot be ignored, where the numbers of credit default cases are usually much smaller than the number of non-default ones. To address the class-imbalance problem, we proposed a novel CT-XGBoost model, which adds to XGBoost with two algorithm-level methods for class imbalance, including the cost-sensitive strategy and threshold method. Based on the credit default dataset consisting of energy corporates in western China, which suffers from the class-imbalance problem, the CT-XGBoost model achieves better performance than the conventional models. The results indicate that the proposed model can efficiently alleviate the inherent class-imbalance problem in the credit default dataset. Moreover, we analyze how the prediction performance is influenced by different parameter settings in the cost-sensitive strategy and threshold method. This study can help market investors and regulators precisely assess the credit risk in the energy industry and provides theoretical guidance to solving the class-imbalance problem in credit default prediction.
Collapse
|
122
|
A score-based preprocessing technique for class imbalance problems. Pattern Anal Appl 2022. [DOI: 10.1007/s10044-022-01084-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
123
|
A Comparative Study on the Influence of Undersampling and Oversampling Techniques for the Classification of Physical Activities Using an Imbalanced Accelerometer Dataset. Healthcare (Basel) 2022; 10:healthcare10071255. [PMID: 35885782 PMCID: PMC9319570 DOI: 10.3390/healthcare10071255] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2022] [Revised: 06/22/2022] [Accepted: 07/02/2022] [Indexed: 11/30/2022] Open
Abstract
Accelerometer data collected from wearable devices have recently been used to monitor physical activities (PAs) in daily life. While the intensity of PAs can be distinguished with a cut-off approach, it is important to discriminate different behaviors with similar accelerometry patterns to estimate energy expenditure. We aim to overcome the data imbalance problem that negatively affects machine learning-based PA classification by extracting well-defined features and applying undersampling and oversampling methods. We extracted various temporal, spectral, and nonlinear features from wrist-, hip-, and ankle-worn accelerometer data. Then, the influences of undersampilng and oversampling were compared using various ML and DL approaches. Among various ML and DL models, ensemble methods including random forest (RF) and adaptive boosting (AdaBoost) exhibited great performance in differentiating sedentary behavior (driving) and three walking types (walking on level ground, ascending stairs, and descending stairs) even in a cross-subject paradigm. The undersampling approach, which has a low computational cost, exhibited classification results unbiased to the majority class. In addition, we found that RF could automatically select relevant features for PA classification depending on the sensor location by examining the importance of each node in multiple decision trees (DTs). This study proposes that ensemble learning using well-defined feature sets combined with the undersampling approach is robust for imbalanced datasets in PA classification. This approach will be useful for PA classification in the free-living situation, where data imbalance problems between classes are common.
Collapse
|
124
|
Wei G, Mu W, Song Y, Dou J. An improved and random synthetic minority oversampling technique for imbalanced data. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.108839] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
125
|
Fernando KRM, Tsokos CP. Dynamically Weighted Balanced Loss: Class Imbalanced Learning and Confidence Calibration of Deep Neural Networks. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:2940-2951. [PMID: 33444149 DOI: 10.1109/tnnls.2020.3047335] [Citation(s) in RCA: 33] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Imbalanced class distribution is an inherent problem in many real-world classification tasks where the minority class is the class of interest. Many conventional statistical and machine learning classification algorithms are subject to frequency bias, and learning discriminating boundaries between the minority and majority classes could be challenging. To address the class distribution imbalance in deep learning, we propose a class rebalancing strategy based on a class-balanced dynamically weighted loss function where weights are assigned based on the class frequency and predicted probability of ground-truth class. The ability of dynamic weighting scheme to self-adapt its weights depending on the prediction scores allows the model to adjust for instances with varying levels of difficulty resulting in gradient updates driven by hard minority class samples. We further show that the proposed loss function is classification calibrated. Experiments conducted on highly imbalanced data across different applications of cyber intrusion detection (CICIDS2017 data set) and medical imaging (ISIC2019 data set) show robust generalization. Theoretical results supported by superior empirical performance provide justification for the validity of the proposed dynamically weighted balanced (DWB) loss function.
Collapse
|
126
|
Chen W, Yang K, Yu Z, Zhang W. Double-kernel based class-specific broad learning system for multiclass imbalance learning. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109535] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/16/2022]
|
127
|
|
128
|
Deep instance envelope network-based imbalance learning algorithm with multilayer fuzzy C-means clustering and minimum interlayer discrepancy. Appl Soft Comput 2022. [DOI: 10.1016/j.asoc.2022.108846] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
129
|
Predicting Compressive Strength of Blast Furnace Slag and Fly Ash Based Sustainable Concrete Using Machine Learning Techniques: An Application of Advanced Decision-Making Approaches. BUILDINGS 2022. [DOI: 10.3390/buildings12070914] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
The utilization of waste industrial materials such as Blast Furnace Slag (BFS) and Fly Ash (F. Ash) will provide an effective alternative strategy for producing eco-friendly and sustainable concrete production. However, testing is a time-consuming process, and the use of soft machine learning (ML) techniques to predict concrete strength can help speed up the procedure. In this study, artificial neural networks (ANNs) and decision trees (DTs) were used for predicting the compressive strength of the concrete. A total of 1030 datasets with eight factors (OPC, F. Ash, BFS, water, days, SP, FA, and CA) were used as input variables for the prediction of concrete compressive strength (response) with the help of training and testing individual models. The reliability and accuracy of the developed models are evaluated in terms of statistical analysis such as R2, RMSE, MAD and SSE. Both models showed a strong correlation and high accuracy between predicted and actual Compressive Strength (CS) along with the eight factors. The DT model gave a significant relation to the CS with R2 values of 0.943 and 0.836, respectively. Hence, the ANNs and DT models can be utilized to predict and train the compressive strength of high-performance concrete and to achieve long-term sustainability. This study will help in the development of prediction models for composite materials for buildings.
Collapse
|
130
|
An Ensemble Transfer Learning Spiking Immune System for Adaptive Smart Grid Protection. ENERGIES 2022. [DOI: 10.3390/en15124398] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The rate of technical innovation, system interconnection, and advanced communications undoubtedly boost distributed energy networks’ efficiency. However, when an additional attack surface is made available, the possibility of an increase in attacks is an unavoidable result. The energy ecosystem’s significant variety draws attackers with various goals, making any critical infrastructure a threat, regardless of scale. Outdated technology and other antiquated countermeasures that worked years ago cannot address the complexity of current threats. As a result, robust artificial intelligence cyber-defense solutions are more important than ever. Based on the above challenge, this paper proposes an ensemble transfer learning spiking immune system for adaptive smart grid protection. It is an innovative Artificial Immune System (AIS) that uses a swarm of Evolving Izhikevich Neural Networks (EINN) in an Ensemble architecture, which optimally integrates Transfer Learning methodologies. The effectiveness of the proposed innovative system is demonstrated experimentally in multiple complex scenarios that optimally simulate the modern energy environment. The most significant findings of this work are that the transfer learning architecture’s shared learning rate significantly adds to the speed of generalization and convergence approach. In addition, the ensemble combination improves the accuracy of the model because the overall behavior of the numerous models is less noisy than a comparable individual single model. Finally, the Izhikevich Spiking Neural Network used here, due to its dynamic configuration, can reproduce different spikes and triggering behaviors of neurons, which models precisely the problem of digital security of energy infrastructures, as proved experimentally.
Collapse
|
131
|
Abstract
Data analysis methods have scarcely kept pace with the rapid increase in Earth observations, spurring the development of novel algorithms, storage methods, and computational techniques. For scientists interested in Mars, the problem is always the same: there is simultaneously never enough of the right data and an overwhelming amount of data in total. Finding sufficient data needles in a haystack to test a hypothesis requires hours of manual data screening, and more needles and hay are added constantly. To date, the vast majority of Martian research has been focused on either one-off local/regional studies or on hugely time-consuming manual global studies. Machine learning in its numerous forms can be helpful for future such work. Machine learning has the potential to help map and classify a large variety of both features and properties on the surface of Mars and to aid in the planning and execution of future missions. Here, we outline the current extent of machine learning as applied to Mars, summarize why machine learning should be an important tool for planetary geomorphology in particular, and suggest numerous research avenues and funding priorities for future efforts. We conclude that: (1) moving toward methods that require less human input (i.e., self- or semi-supervised) is an important paradigm shift for Martian applications, (2) new robust methods using generative adversarial networks to generate synthetic high-resolution digital terrain models represent an exciting new avenue for Martian geomorphologists, (3) more effort and money must be directed toward developing standardized datasets and benchmark tests, and (4) the community needs a large-scale, generalized, and programmatically accessible geographic information system (GIS).
Collapse
|
132
|
Majority-to-minority resampling for boosting-based classification under imbalanced data. APPL INTELL 2022. [DOI: 10.1007/s10489-022-03585-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
133
|
Ajani CK, Zhu Z, Sun DW. Microstructural Classification and Reconstruction of the Computational Geometry of Steamed Bread Using Descriptor-Based Approach. Transp Porous Media 2022. [DOI: 10.1007/s11242-022-01796-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
Abstract
Microstructures affect the properties of food products; accurate and relatively less complex microstructural representations are thus needed for modelling of transport phenomena during food processing. Hence, the present study aimed at developing computational microstructures of steamed bread using descriptor-based approach. Relevant information was extracted from the scanning electron microscope (SEM) images of the steamed bread and evaluated using seven classifiers. For the automatic classification and using all descriptors, bagged trees ensembles (BTE) had the highest accuracy of 98.40%, while Gaussian Naïve Bayes was the least with 92.10% accuracy. In the “step forward” analysis, five descriptors had higher classification accuracy (98.80%) than all descriptors, implying that increase in descriptors might or might not increase classification accuracy. Microstructural validation revealed that the ellipse fitting method with a p value of 0.7984 for the area was found to be superior to the Voronoi method with a corresponding p value of 1.4554 × 10−5, confirming that the ellipse developed microstructure was more suitable for microscale modelling of transport phenomena in steamed bread.
Collapse
|
134
|
Okabe M, Tsuchida J, Yadohisa H. F-measure maximizing logistic regression. COMMUN STAT-SIMUL C 2022. [DOI: 10.1080/03610918.2022.2081706] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Affiliation(s)
- Masaaki Okabe
- Graduate School of Culture and Information Science, Doshisha University, Kyoto, Japan
| | - Jun Tsuchida
- Graduate School of Culture and Information Science, Doshisha University, Kyoto, Japan
| | - Hiroshi Yadohisa
- Graduate School of Culture and Information Science, Doshisha University, Kyoto, Japan
| |
Collapse
|
135
|
Xu Y, Yu Z, Chen CLP. Classifier Ensemble Based on Multiview Optimization for High-Dimensional Imbalanced Data Classification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; PP:870-883. [PMID: 35657843 DOI: 10.1109/tnnls.2022.3177695] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
High-dimensional class imbalanced data have plagued the performance of classification algorithms seriously. Because of a large number of redundant/invalid features and the class imbalanced issue, it is difficult to construct an optimal classifier for high-dimensional imbalanced data. Classifier ensemble has attracted intensive attention since it can achieve better performance than an individual classifier. In this work, we propose a multiview optimization (MVO) to learn more effective and robust features from high-dimensional imbalanced data, based on which an accurate and robust ensemble system is designed. Specifically, an optimized subview generation (OSG) in MVO is first proposed to generate multiple optimized subviews from different scenarios, which can strengthen the classification ability of features and increase the diversity of ensemble members simultaneously. Second, a new evaluation criterion that considers the distribution of data in each optimized subview is developed based on which a selective ensemble of optimized subviews (SEOS) is designed to perform the subview selective ensemble. Finally, an oversampling approach is executed on the optimized view to obtain a new class rebalanced subset for the classifier. Experimental results on 25 high-dimensional class imbalanced datasets indicate that the proposed method outperforms other mainstream classifier ensemble methods.
Collapse
|
136
|
Kanika, Singla J. A novel framework for online transaction fraud detection system based on deep neural network. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2022. [DOI: 10.3233/jifs-212616] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Since the introduction of online payment systems, people have started doing online transactions which has also led to the rise of fraudulent transactions causing loss of money to the users and created distrust in the usage of online payment systems. Hence, fraud detection systems are the need of the hour. Among the transactions occurring on daily basis, frauds are less in number as compared to the genuine transactions, so class imbalance naturally exists in fraud detection systems. In this research work, a novel framework for online transaction fraud detection system based on Deep Neural Network (DNN) has been proposed by utilizing algorithm-level method capable to detect frauds from imbalanced data and to maintain the overall performance of the model as well. The proposed system optimizes the decision threshold by utilizing the validation data efficiently for deciding whether an incoming transaction is a Fraud or not. For demonstration of the efficiency of our proposed system, three class imbalanced publicly available datasets have been used. Proposed system has shown better performance than data-level method. The results produced by the proposed fraud detection system have also been compared with existing machine learning techniques-based fraud detection systems. The experimental results show that the deep learning-based fraud detection system is more efficient for detecting frauds from imbalanced datasets having large number of input features as compared to the machine learning models.
Collapse
Affiliation(s)
- Kanika
- School of CSE, Lovely Professional University, Punjab, India
| | - Jimmy Singla
- School of CSE, Lovely Professional University, Punjab, India
| |
Collapse
|
137
|
Blanchard AE, Gao S, Yoon HJ, Christian JB, Durbin EB, Wu XC, Stroup A, Doherty J, Schwartz SM, Wiggins C, Coyle L, Penberthy L, Tourassi GD. A Keyword-Enhanced Approach to Handle Class Imbalance in Clinical Text Classification. IEEE J Biomed Health Inform 2022; 26:2796-2803. [PMID: 35020599 PMCID: PMC9533247 DOI: 10.1109/jbhi.2022.3141976] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Recent applications ofdeep learning have shown promising results for classifying unstructured text in the healthcare domain. However, the reliability of models in production settings has been hindered by imbalanced data sets in which a small subset of the classes dominate. In the absence of adequate training data, rare classes necessitate additional model constraints for robust performance. Here, we present a strategy for incorporating short sequences of text (i.e. keywords) into training to boost model accuracy on rare classes. In our approach, we assemble a set of keywords, including short phrases, associated with each class. The keywords are then used as additional data during each batch of model training, resulting in a training loss that has contributions from both raw data and keywords. We evaluate our approach on classification of cancer pathology reports, which shows a substantial increase in model performance for rare classes. Furthermore, we analyze the impact of keywords on model output probabilities for bigrams, providing a straightforward method to identify model difficulties for limited training data.
Collapse
|
138
|
Tan Y, Zhao G. Multi-view representation learning with Kolmogorov-Smirnov to predict default based on imbalanced and complex dataset. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.03.022] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
139
|
Shi H, Zhang Y, Chen Y, Ji S, Dong Y. Resampling algorithms based on sample concatenation for imbalance learning. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.108592] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
140
|
Adaptively Promoting Diversity in a Novel Ensemble Method for Imbalanced Credit-Risk Evaluation. MATHEMATICS 2022. [DOI: 10.3390/math10111790] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
Ensemble learning techniques are widely applied to classification tasks such as credit-risk evaluation. As for most credit-risk evaluation scenarios in the real world, only imbalanced data are available for model construction, and the performance of ensemble models still needs to be improved. An ideal ensemble algorithm is supposed to improve diversity in an effective manner. Therefore, we provide an insight in considering an ensemble diversity-promotion method for imbalanced learning tasks. A novel ensemble structure is proposed, which combines self-adaptive optimization techniques and a diversity-promotion method (SA-DP Forest). Additional artificially constructed samples, generated by a fuzzy sampling method at each iteration, directly create diverse hypotheses and address the imbalanced classification problem while training the proposed model. Meanwhile, the self-adaptive optimization mechanism within the ensemble simultaneously balances the individual accuracy as the diversity increases. The results using the decision tree as a base classifier indicate that SA-DP Forest outperforms the comparative algorithms, as reflected by most evaluation metrics on three credit data sets and seven other imbalanced data sets. Our method is also more suitable for experimental data that are properly constructed with a series of artificial imbalance ratios on the original credit data set.
Collapse
|
141
|
Machine Learning Models for Early Prediction of Sepsis on Large Healthcare Datasets. ELECTRONICS 2022. [DOI: 10.3390/electronics11091507] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Sepsis is a highly lethal syndrome with heterogeneous clinical manifestation that can be hard to identify and treat. Early diagnosis and appropriate treatment are critical to reduce mortality and promote survival in suspected cases and improve the outcomes. Several screening prediction systems have been proposed for evaluating the early detection of patient deterioration, but the efficacy is still limited at individual level. The increasing amount and the versatility of healthcare data suggest implementing machine learning techniques to develop models for predicting sepsis. This work presents an experimental study of some machine-learning-based models for sepsis prediction considering vital signs, laboratory test results, and demographics using Medical Information Mart for Intensive Care III (MIMIC-III) (v1.4), a publicly available dataset. The experimental results demonstrate an overall higher performance of machine learning models over the commonly used Sequential Organ Failure Assessment (SOFA) and Quick SOFA (qSOFA) scoring systems at the time of sepsis onset.
Collapse
|
142
|
Guo Y, Jiao B, Tan Y, Zhang P, Tang F. A transfer weighted extreme learning machine for imbalanced classification. INT J INTELL SYST 2022. [DOI: 10.1002/int.22899] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- Yinan Guo
- School of Mechanical Electronic and Information Engineering China University of Mining and Technology (Beijing) Beijing China
- School of Information and Control Engineering China University of Mining and Technology Xuzhou China
| | - Botao Jiao
- School of Information and Control Engineering China University of Mining and Technology Xuzhou China
| | - Ying Tan
- School of Artificial Intelligence, Key Laboratory of Machine Perceptron (MOE), Institute for Artificial Intellignce Peking University Beijing China
| | - Pei Zhang
- School of Information and Control Engineering China University of Mining and Technology Xuzhou China
| | - Fengzhen Tang
- State Key Laboratory of Robotics, Shenyang Institute of Automation Chinese Academy of Sciences Shenyang China
- Institute for Robotics and Intelligent Manufacturing Chinese Academy of Sciences Shenyang China
| |
Collapse
|
143
|
A fuzzy partition-based method to classify social messages assessing their emotional relevance. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.02.028] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
144
|
Dangut MD, Skaf Z, Jennions IK. Handling imbalanced data for aircraft predictive maintenance using the BACHE algorithm. Appl Soft Comput 2022. [DOI: 10.1016/j.asoc.2022.108924] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
145
|
Zhang Z, Liu H, Chen D, Zhang J, Li H, Shen M, Pu Y, Zhang Z, Zhao J, Hu J. SMOTE-based method for balanced spectral nondestructive testing of moldy apple core. Food Control 2022. [DOI: 10.1016/j.foodcont.2022.109100] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
|
146
|
Ning Z, Ye Z, Jiang Z, Zhang D. BESS: Balanced evolutionary semi-stacking for disease detection using partially labeled imbalanced data. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.02.026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
147
|
Dai W, Ning C, Nan J, Wang D. Stochastic configuration networks for imbalanced data classification. INT J MACH LEARN CYB 2022. [DOI: 10.1007/s13042-022-01565-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
148
|
Malhotra R, Jain J. Predicting defects in imbalanced data using resampling methods: an empirical investigation. PeerJ Comput Sci 2022; 8:e573. [PMID: 35634102 PMCID: PMC9137963 DOI: 10.7717/peerj-cs.573] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2020] [Accepted: 05/11/2021] [Indexed: 06/15/2023]
Abstract
The development of correct and effective software defect prediction (SDP) models is one of the utmost needs of the software industry. Statistics of many defect-related open-source data sets depict the class imbalance problem in object-oriented projects. Models trained on imbalanced data leads to inaccurate future predictions owing to biased learning and ineffective defect prediction. In addition to this large number of software metrics degrades the model performance. This study aims at (1) identification of useful metrics in the software using correlation feature selection, (2) extensive comparative analysis of 10 resampling methods to generate effective machine learning models for imbalanced data, (3) inclusion of stable performance evaluators-AUC, GMean, and Balance and (4) integration of statistical validation of results. The impact of 10 resampling methods is analyzed on selected features of 12 object-oriented Apache datasets using 15 machine learning techniques. The performances of developed models are analyzed using AUC, GMean, Balance, and sensitivity. Statistical results advocate the use of resampling methods to improve SDP. Random oversampling portrays the best predictive capability of developed defect prediction models. The study provides a guideline for identifying metrics that are influential for SDP. The performances of oversampling methods are superior to undersampling methods.
Collapse
Affiliation(s)
- Ruchika Malhotra
- Department of Software Engineering, Delhi Technological University (former Delhi College of Engineering), Shahbad Daulatpur, Delhi, India
| | - Juhi Jain
- Department of Computer Science and Engineering, Delhi Technological University (former Delhi College of Engineering), Shahbad Daulatpur, Delhi, India
| |
Collapse
|
149
|
Li F, Li H, Li Y, Wu H, Fu B, Ji Y, Wang C, Shi G. Decoupling Representation Learning for Imbalanced Electroencephalography Classification in Rapid Serial Visual Presentation Task. J Neural Eng 2022; 19. [PMID: 35472762 DOI: 10.1088/1741-2552/ac6a7d] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2021] [Accepted: 04/25/2022] [Indexed: 11/11/2022]
Abstract
OBJECTIVE The class imbalance problem considerably restricts the performance of electroencephalography (EEG) classification in the rapid serial visual presentation (RSVP) task. Existing solutions typically employ re-balancing strategies (e.g. re-weighting and re-sampling) to alleviate the impact of class imbalance, which enhances the classifier learning of deep networks but unexpectedly damages the representative ability of the learned deep features as original distributions become distorted. APPROACH In this study, a novel decoupling representation learning (DRL) model, has been proposed that separates the representation learning and classification processes to capture the discriminative feature of imbalanced RSVP EEG data while classifying it accurately. The representation learning process is responsible for learning universal patterns for the classification of all samples, while the classifier determines a better bounding for the target and non-target classes. Specifically, the representation learning process adopts a dual-branch architecture, which minimizes the contrastive loss to regularize the representation space. In addition, to learn more discriminative information from RSVP EEG data, a novel multi-granular information (MGI) based extractor is designed to extract spatial-temporal information. Considering the class re-balancing strategies can significantly promote classifier learning, the classifier was trained with rebalanced EEG data while freezing the parameters of the representation learning process. MAIN RESULTS To evaluate the proposed method, experiments were conducted on two public datasets and one self-conducted dataset. The results demonstrate that the proposed DRL can achieve state-of-the-art performance for EEG classification in the RSVP task. SIGNIFICANCE This is the first study to focus on the class imbalance problem and propose a generic solution in the RSVP task. Furthermore, multi-granular data was explored to extract more complementary spatial-temporal information. The code is open-source and available at https://github.com/Tammie-Li/DRL.
Collapse
Affiliation(s)
- Fu Li
- Xidian University, No. 2 South Taibai Road, Xi'an, Shaanxi, Xian, Shaanxi, 710071, CHINA
| | - Hongxin Li
- Xidian University, No. 2 South Taibai Road, Xi'an, Shaanxi, Xian, Shaanxi, 710071, CHINA
| | - Yang Li
- Xidian University, No. 2 South Taibai Road, Xi'an, Shaanxi, Xian, 710071, CHINA
| | - Hao Wu
- Xidian University, No. 2 South Taibai Road, Xi'an, Shaanxi, Xian, Shaanxi, 710071, CHINA
| | - Boxun Fu
- Xidian University, No. 2 South Taibai Road, Xi'an, Shaanxi, Xian, Shaanxi, 710071, CHINA
| | - Youshuo Ji
- Xidian University, No. 2 South Taibai Road, Xi'an, Shaanxi, Xian, Shaanxi, 710071, CHINA
| | - Chong Wang
- Xidian University, No. 2 South Taibai Road, Xi'an, Shaanxi, Xian, Shaanxi, 710071, CHINA
| | | |
Collapse
|
150
|
Imbalanced Fault Diagnosis of Rolling Bearing Using Data Synthesis Based on Multi-Resolution Fusion Generative Adversarial Networks. MACHINES 2022. [DOI: 10.3390/machines10050295] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
Fault diagnosis of industrial bearings plays an invaluable role in the health monitoring of rotating machinery. In practice, there is far more normal data than faulty data, so the data usually exhibit a highly skewed class distribution. Algorithms developed using unbalanced datasets will suffer from severe model bias, reducing the accuracy and stability of the classification algorithm. To address these issues, a novel Multi-resolution Fusion Generative Adversarial Network (MFGAN) is proposed for the imbalanced fault diagnosis of rolling bearings via data augmentation. In the data-generation process, the improved feature transfer-based generator receives normal data as input to better learn the fault features, mapping the normal data into fault data space instead of random data space. A multi-scale ensemble discriminator architecture is designed to replace original single discriminator structure in the discriminative process, and multi-scale features are learned via ensemble discriminators. Finally, the proposed framework is validated on the public bearing dataset from Case Western Reserve University (CWRU), and experimental results show the superiority of our method.
Collapse
|