1
|
Setiawan D, Wiranto Y, Girard JM, Watts A, Ashourvan A. Individualized Machine-learning-based Clinical Assessment Recommendation System. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.07.24.24310941. [PMID: 39108531 PMCID: PMC11302612 DOI: 10.1101/2024.07.24.24310941] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 08/12/2024]
Abstract
Background Traditional clinical assessments often lack individualization, relying on standardized procedures that may not accommodate the diverse needs of patients, especially in early stages where personalized diagnosis could offer significant benefits. We aim to provide a machine-learning framework that addresses the individualized feature addition problem and enhances diagnostic accuracy for clinical assessments. Methods Individualized Clinical Assessment Recommendation System (iCARE) employs locally weighted logistic regression and Shapley Additive Explanations (SHAP) value analysis to tailor feature selection to individual patient characteristics. Evaluations were conducted on synthetic and real-world datasets, including early-stage diabetes risk prediction and heart failure clinical records from the UCI Machine Learning Repository. We compared the performance of iCARE with a Global approach using statistical analysis on accuracy and area under the ROC curve (AUC) to select the best additional features. Findings The iCARE framework enhances predictive accuracy and AUC metrics when additional features exhibit distinct predictive capabilities, as evidenced by synthetic datasets 1-3 and the early diabetes dataset. Specifically, in synthetic dataset 1, iCARE achieved an accuracy of 0·999 and an AUC of 1·000, outperforming the Global approach with an accuracy of 0·689 and an AUC of 0·639. In the early diabetes dataset, iCARE shows improvements of 1·5-3·5% in accuracy and AUC across different numbers of initial features. Conversely, in synthetic datasets 4-5 and the heart failure dataset, where features lack discernible predictive distinctions, iCARE shows no significant advantage over global approaches on accuracy and AUC metrics. Interpretation iCARE provides personalized feature recommendations that enhance diagnostic accuracy in scenarios where individualized approaches are critical, improving the precision and effectiveness of medical diagnoses. Funding This work was supported by startup funding from the Department of Psychology at the University of Kansas provided to A.A., and the R01MH125740 award from NIH partially supported J.M.G.'s work.
Collapse
Affiliation(s)
- Devin Setiawan
- The University of Kansas, Department of Electrical Engineering and Computer Science, 1415 Jayhawk Blvd. Lawrence, KS 66045
| | - Yumiko Wiranto
- The University of Kansas, Department of Psychology, 1415 Jayhawk Blvd. Lawrence, KS 66045
| | - Jeffrey M Girard
- The University of Kansas, Department of Psychology, 1415 Jayhawk Blvd. Lawrence, KS 66045
| | - Amber Watts
- The University of Kansas, Department of Psychology, 1415 Jayhawk Blvd. Lawrence, KS 66045
| | - Arian Ashourvan
- The University of Kansas, Department of Psychology, 1415 Jayhawk Blvd. Lawrence, KS 66045
| |
Collapse
|
2
|
Kuo PF, Hsu WT, Lord D, Putra IGB. Classification of autonomous vehicle crash severity: Solving the problems of imbalanced datasets and small sample size. ACCIDENT; ANALYSIS AND PREVENTION 2024; 205:107666. [PMID: 38901160 DOI: 10.1016/j.aap.2024.107666] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/30/2023] [Revised: 05/21/2024] [Accepted: 06/03/2024] [Indexed: 06/22/2024]
Abstract
Only a few researchers have shown how environmental factors and road features relate to Autonomous Vehicle (AV) crash severity levels, and none have focused on the data limitation problems, such as small sample sizes, imbalanced datasets, and high dimensional features. To address these problems, we analyzed an AV crash dataset (2019 to 2021) from the California Department of Motor Vehicles (CA DMV), which included 266 collision reports (51 of those causing injuries). We included external environmental variables by collecting various points of interest (POIs) and roadway features from Open Street Map (OSM) and Data San Francisco (SF). Random Over-Sampling Examples (ROSE) and the Synthetic Minority Over-Sampling Technique (SMOTE) methods were used to balance the dataset and increase the sample size. These two balancing methods were used to expand the dataset and solve the small sample size problem simultaneously. Mutual information, random forest, and XGboost were utilized to address the high dimensional feature and the selection problem caused by including a variety of types of POIs as predictive variables. Because existing studies do not use consistent procedures, we compared the effectiveness of using the feature-selection preprocessing method as the first process to employing the data-balance technique as the first process. Our results showed that AV crash severity levels are related to vehicle manufacturers, vehicle damage level, collision type, vehicle movement, the parties involved in the crash, speed limit, and some types of POIs (areas near transportation, entertainment venues, public places, schools, and medical facilities). Both resampling methods and three data preprocessing methods improved model performance, and the model that used SMOTE and data-balancing first was the best. The results suggest that over-sampling and the feature selection method can improve model prediction performance and define new factors related to AV crash severity levels.
Collapse
Affiliation(s)
- Pei-Fen Kuo
- Department of Geomatics, National Cheng Kung University, Taiwan.
| | - Wei-Ting Hsu
- Department of Geomatics, National Cheng Kung University, Taiwan
| | - Dominique Lord
- Zachry Department of Civil and Environmental Engineering, Texas A&M University, USA
| | | |
Collapse
|
3
|
Liang J, Sawut M, Cui J, Hu X, Xue Z, Zhao M, Zhang X, Rouzi A, Ye X, Xilike A. Object-oriented multi-scale segmentation and multi-feature fusion-based method for identifying typical fruit trees in arid regions using Sentinel-1/2 satellite images. Sci Rep 2024; 14:18230. [PMID: 39107396 PMCID: PMC11303721 DOI: 10.1038/s41598-024-68991-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2024] [Accepted: 07/30/2024] [Indexed: 08/10/2024] Open
Abstract
Fruit tree identification that is quick and precise lays the groundwork for scientifically evaluating orchard yields and dynamically monitoring planting areas. This study aims to evaluate the applicability of time series Sentinel-1/2 satellite data for fruit tree classification and to provide a new method for accurately extracting fruit tree species. Therefore, the study area selected is the Tarim Basin, the most important fruit-growing region in northwest China. The main focus is on identifying several major fruit tree species in this region. Time series Sentinel-1/2 satellite images acquired from the Google Earth Engine (GEE) platform are used for the study. A multi-scale segmentation approach is applied, and six categories of features including spectral, phenological, texture, polarization, vegetation index, and red edge index features are constructed. A total of forth-four features are extracted and optimized using the Vi feature importance index to determine the best time phase. Based on this, an object-oriented (OO) segmentation combined with the Random Forest (RF) method is used to identify fruit tree species. To find the best method for fruit tree identification, the results are compared with three other widely used traditional machine learning algorithms: Support Vector Machine (SVM), Gradient Boosting Decision Tree (GBDT), and Classification and Regression Tree (CART). The results show that: (1) the object-oriented segmentation method helps to improve the accuracy of fruit tree identification features, and September satellite images provide the best time window for fruit tree identification, with spectral, phenological, and texture features contributing the most to fruit tree species identification. (2) The RF model has higher accuracy in identifying fruit tree species than other machine learning models, with an overall accuracy (OA) and a kappa coefficient (KC) of 94.60% and 93.74% respectively, indicating that the combination of object-oriented segmentation and RF algorithm has great value and potential for fruit tree identification and classification. This method can be applied to large-scale fruit tree remote sensing classification and provides an effective technical means for monitoring fruit tree planting areas using medium-to-high-resolution remote sensing images.
Collapse
Affiliation(s)
- Jiaxi Liang
- College of Geography and Remote Sensing Sciences, Xinjiang University, Urumqi, 830046, Xinjiang, China
- Xinjiang Key Laboratory of Oasis Ecology, Xinjiang University, Urumqi, 830046, Xinjiang, China
- Key Laboratory of Smart City and Environment Modelling of Higher Education Institute, Xinjiang University, Urumqi, 830046, Xinjiang, China
| | - Mamat Sawut
- College of Geography and Remote Sensing Sciences, Xinjiang University, Urumqi, 830046, Xinjiang, China.
- Xinjiang Key Laboratory of Oasis Ecology, Xinjiang University, Urumqi, 830046, Xinjiang, China.
- Key Laboratory of Smart City and Environment Modelling of Higher Education Institute, Xinjiang University, Urumqi, 830046, Xinjiang, China.
| | - Jintao Cui
- College of Geography and Remote Sensing Sciences, Xinjiang University, Urumqi, 830046, Xinjiang, China
- Xinjiang Key Laboratory of Oasis Ecology, Xinjiang University, Urumqi, 830046, Xinjiang, China
- Key Laboratory of Smart City and Environment Modelling of Higher Education Institute, Xinjiang University, Urumqi, 830046, Xinjiang, China
| | - Xin Hu
- College of Geography and Remote Sensing Sciences, Xinjiang University, Urumqi, 830046, Xinjiang, China
- Xinjiang Key Laboratory of Oasis Ecology, Xinjiang University, Urumqi, 830046, Xinjiang, China
- Key Laboratory of Smart City and Environment Modelling of Higher Education Institute, Xinjiang University, Urumqi, 830046, Xinjiang, China
| | - Zijing Xue
- College of Geography and Remote Sensing Sciences, Xinjiang University, Urumqi, 830046, Xinjiang, China
- Xinjiang Key Laboratory of Oasis Ecology, Xinjiang University, Urumqi, 830046, Xinjiang, China
- Key Laboratory of Smart City and Environment Modelling of Higher Education Institute, Xinjiang University, Urumqi, 830046, Xinjiang, China
| | - Ming Zhao
- College of Geography and Remote Sensing Sciences, Xinjiang University, Urumqi, 830046, Xinjiang, China
- Xinjiang Key Laboratory of Oasis Ecology, Xinjiang University, Urumqi, 830046, Xinjiang, China
- Key Laboratory of Smart City and Environment Modelling of Higher Education Institute, Xinjiang University, Urumqi, 830046, Xinjiang, China
| | - Xinyu Zhang
- College of Geography and Remote Sensing Sciences, Xinjiang University, Urumqi, 830046, Xinjiang, China
- Xinjiang Key Laboratory of Oasis Ecology, Xinjiang University, Urumqi, 830046, Xinjiang, China
- Key Laboratory of Smart City and Environment Modelling of Higher Education Institute, Xinjiang University, Urumqi, 830046, Xinjiang, China
| | - Areziguli Rouzi
- College of Geography and Remote Sensing Sciences, Xinjiang University, Urumqi, 830046, Xinjiang, China
- Xinjiang Key Laboratory of Oasis Ecology, Xinjiang University, Urumqi, 830046, Xinjiang, China
- Key Laboratory of Smart City and Environment Modelling of Higher Education Institute, Xinjiang University, Urumqi, 830046, Xinjiang, China
| | - Xiaowen Ye
- College of Geography and Remote Sensing Sciences, Xinjiang University, Urumqi, 830046, Xinjiang, China
- Xinjiang Key Laboratory of Oasis Ecology, Xinjiang University, Urumqi, 830046, Xinjiang, China
- Key Laboratory of Smart City and Environment Modelling of Higher Education Institute, Xinjiang University, Urumqi, 830046, Xinjiang, China
| | - Aerqing Xilike
- College of Geography and Remote Sensing Sciences, Xinjiang University, Urumqi, 830046, Xinjiang, China
- Xinjiang Key Laboratory of Oasis Ecology, Xinjiang University, Urumqi, 830046, Xinjiang, China
- Key Laboratory of Smart City and Environment Modelling of Higher Education Institute, Xinjiang University, Urumqi, 830046, Xinjiang, China
| |
Collapse
|
4
|
Ekta, Bhatia V. Auto-BCS: A Hybrid System for Real-Time Breast Cancer Screening from Pathological Images. JOURNAL OF IMAGING INFORMATICS IN MEDICINE 2024; 37:1752-1766. [PMID: 38429562 PMCID: PMC11300416 DOI: 10.1007/s10278-024-01056-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Revised: 12/24/2023] [Accepted: 01/14/2024] [Indexed: 03/03/2024]
Abstract
Breast cancer is recognized as a prominent cause of cancer-related mortality among women globally, emphasizing the critical need for early diagnosis resulting improvement in survival rates. Current breast cancer diagnostic procedures depend on manual assessments of pathological images by medical professionals. However, in remote or underserved regions, the scarcity of expert healthcare resources often compromised the diagnostic accuracy. Machine learning holds great promise for early detection, yet existing breast cancer screening algorithms are frequently characterized by significant computational demands, rendering them unsuitable for deployment on low-processing-power mobile devices. In this paper, a real-time automated system "Auto-BCS" is introduced that significantly enhances the efficiency of early breast cancer screening. The system is structured into three distinct phases. In the initial phase, images undergo a pre-processing stage aimed at noise reduction. Subsequently, feature extraction is carried out using a lightweight and optimized deep learning model followed by extreme gradient boosting classifier, strategically employed to optimize the overall performance and prevent overfitting in the deep learning model. The system's performance is gauged through essential metrics, including accuracy, precision, recall, F1 score, and inference time. Comparative evaluations against state-of-the-art algorithms affirm that Auto-BCS outperforms existing models, excelling in both efficiency and processing speed. Computational efficiency is prioritized by Auto-BCS, making it particularly adaptable to low-processing-power mobile devices. Comparative assessments confirm the superior performance of Auto-BCS, signifying its potential to advance breast cancer screening technology.
Collapse
Affiliation(s)
- Ekta
- Netaji Subhas University of Technology, Delhi, India
| | | |
Collapse
|
5
|
Zhou XH, Xie XL, Liu SQ, Ni ZL, Zhou YJ, Li RQ, Gui MJ, Fan CC, Feng ZQ, Bian GB, Hou ZG. Learning Skill Characteristics From Manipulations. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:9727-9741. [PMID: 35333726 DOI: 10.1109/tnnls.2022.3160159] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Percutaneous coronary intervention (PCI) has increasingly become the main treatment for coronary artery disease. The procedure requires high experienced skills and dexterous manipulations. However, there are few techniques to model PCI skill so far. In this study, a learning framework with local and ensemble learning is proposed to learn skill characteristics of different skill-level subjects from their PCI manipulations. Ten interventional cardiologists (four experts and six novices) were recruited to deliver a medical guidewire to two target arteries on a porcine model for in vivo studies. Simultaneously, translation and twist manipulations of thumb, forefinger, and wrist are acquired with electromagnetic (EM) and fiber-optic bend (FOB) sensors, respectively. These behavior data are then processed with wavelet packet decomposition (WPD) under 1-10 levels for feature extraction. The feature vectors are further fed into three candidate individual classifiers in the local learning layer. Furthermore, the local learning results from different manipulation behaviors are fused in the ensemble learning layer with three rule-based ensemble learning algorithms. In subject-dependent skill characteristics learning, the ensemble learning can achieve 100% accuracy, significantly outperforming the best local result (90%). Furthermore, ensemble learning can also maintain 73% accuracy in subject-independent schemes. These promising results demonstrate the great potential of the proposed method to facilitate skill learning in surgical robotics and skill assessment in clinical practice.
Collapse
|
6
|
Du G, Zhang J, Jiang M, Long J, Lin Y, Li S, Tan KC. Graph-Based Class-Imbalance Learning With Label Enhancement. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:6081-6095. [PMID: 34928806 DOI: 10.1109/tnnls.2021.3133262] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Class imbalance is a common issue in the community of machine learning and data mining. The class-imbalance distribution can make most classical classification algorithms neglect the significance of the minority class and tend toward the majority class. In this article, we propose a label enhancement method to solve the class-imbalance problem in a graph manner, which estimates the numerical label and trains the inductive model simultaneously. It gives a new perspective on the class-imbalance learning based on the numerical label rather than the original logical label. We also present an iterative optimization algorithm and analyze the computation complexity and its convergence. To demonstrate the superiority of the proposed method, several single-label and multilabel datasets are applied in the experiments. The experimental results show that the proposed method achieves a promising performance and outperforms some state-of-the-art single-label and multilabel class-imbalance learning methods.
Collapse
|
7
|
Fousková M, Vališ J, Synytsya A, Habartová L, Petrtýl J, Petruželka L, Setnička V. In vivo Raman spectroscopy in the diagnostics of colon cancer. Analyst 2023; 148:2518-2526. [PMID: 37157993 DOI: 10.1039/d3an00103b] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/10/2023]
Abstract
Early detection and accurate diagnosis of colorectal carcinoma are crucial for successful treatment, yet current methods can be invasive and even inaccurate in some cases. In this work, we present a novel approach for in vivo tissue diagnostics of colorectal carcinoma using Raman spectroscopy. This almost non-invasive technique allows for fast and accurate detection of colorectal carcinoma and its precursors, adenomatous polyps, enabling timely intervention and improved patient outcomes. Using several methods of supervised machine learning, we were able to achieve over 91% accuracy in distinguishing colorectal lesions from healthy epithelial tissue and more than 90% classification accuracy for premalignant adenomatous polyps. Moreover, our models enabled the discrimination of cancerous and precancerous lesions with a mean accuracy of almost 92%. Such results demonstrate the potential of in vivo Raman spectroscopy to become a valuable tool in the fight against colon cancer.
Collapse
Affiliation(s)
- Markéta Fousková
- Department of Analytical Chemistry, University of Chemistry and Technology, Prague, Technická 5, 166 28, Prague 6, Czech Republic.
| | - Jan Vališ
- Department of Analytical Chemistry, University of Chemistry and Technology, Prague, Technická 5, 166 28, Prague 6, Czech Republic.
| | - Alla Synytsya
- Department of Analytical Chemistry, University of Chemistry and Technology, Prague, Technická 5, 166 28, Prague 6, Czech Republic.
| | - Lucie Habartová
- Department of Analytical Chemistry, University of Chemistry and Technology, Prague, Technická 5, 166 28, Prague 6, Czech Republic.
| | - Jaromír Petrtýl
- 4th Department of Internal Medicine, General University Hospital in Prague and 1st Faculty of Medicine, Charles University in Prague, U Nemocnice 2, 128 08, Prague 2, Czech Republic
| | - Luboš Petruželka
- Department of Oncology, General University Hospital in Prague and 1st Faculty of Medicine, Charles University in Prague, U Nemocnice 2, 128 08, Prague 2, Czech Republic
| | - Vladimír Setnička
- Department of Analytical Chemistry, University of Chemistry and Technology, Prague, Technická 5, 166 28, Prague 6, Czech Republic.
| |
Collapse
|
8
|
Han S, Zhu K, Zhou M, Liu X. Evolutionary Weighted Broad Learning and Its Application to Fault Diagnosis in Self-Organizing Cellular Networks. IEEE TRANSACTIONS ON CYBERNETICS 2023; 53:3035-3047. [PMID: 35113791 DOI: 10.1109/tcyb.2021.3126711] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
As a novel neural network-based learning framework, a broad learning system (BLS) has attracted much attention due to its excellent performance on regression and balanced classification problems. However, it is found to be unsuitable for imbalanced data classification problems because it treats each class in an imbalanced dataset equally. To address this issue, this work proposes a weighted BLS (WBLS) in which the weight assigned to each class depends on the number of samples in it. In order to further boost its classification performance, an improved differential evolution algorithm is proposed to automatically optimize its parameters, including the ones in BLS and newly generated weights. We first optimize the parameters with a training dataset, and then apply them to WBLS on a test dataset. The experiments on 20 imbalanced classification problems have shown that our proposed method can achieve higher classification accuracy than the other methods in terms of several widely used performance metrics. Finally, it is applied to fault diagnosis in self-organizing cellular networks to further show its applicability to industrial application problems.
Collapse
|
9
|
Wang N, Liang R, Zhao X, Gao Y. Cost-Sensitive Hypergraph Learning With F-Measure Optimization. IEEE TRANSACTIONS ON CYBERNETICS 2023; 53:2767-2778. [PMID: 34818205 DOI: 10.1109/tcyb.2021.3126756] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
The imbalanced issue among data is common in many machine-learning applications, where samples from one or more classes are rare. To address this issue, many imbalanced machine-learning methods have been proposed. Most of these methods rely on cost-sensitive learning. However, we note that it is infeasible to determine the precise cost values even with great domain knowledge for those cost-sensitive machine-learning methods. So in this method, due to the superiority of F-measure on evaluating the performance of imbalanced data classification, we employ F-measure to calculate the cost information and propose a cost-sensitive hypergraph learning method with F-measure optimization to solve the imbalanced issue. In this method, we employ the hypergraph structure to explore the high-order relationships among the imbalanced data. Based on the constructed hypergraph structure, we optimize the cost value with F-measure and further conduct cost-sensitive hypergraph learning with the optimized cost information. The comprehensive experiments validate the effectiveness of the proposed method.
Collapse
|
10
|
Luo J, Qiao H, Zhang B. A Minimax Probability Machine for Nondecomposable Performance Measures. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:2353-2365. [PMID: 34473631 DOI: 10.1109/tnnls.2021.3106484] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Imbalanced classification tasks are widespread in many real-world applications. For such classification tasks, in comparison with the accuracy rate (AR), it is usually much more appropriate to use nondecomposable performance measures such as the area under the receiver operating characteristic curve (AUC) and the Fβ measure as the classification criterion since the label class is imbalanced. On the other hand, the minimax probability machine is a popular method for binary classification problems and aims at learning a linear classifier by maximizing the AR, which makes it unsuitable to deal with imbalanced classification tasks. The purpose of this article is to develop a new minimax probability machine for the Fβ measure, called minimax probability machine for the Fβ -measures (MPMF), which can be used to deal with imbalanced classification tasks. A brief discussion is also given on how to extend the MPMF model for several other nondecomposable performance measures listed in the article. To solve the MPMF model effectively, we derive its equivalent form which can then be solved by an alternating descent method to learn a linear classifier. Further, the kernel trick is employed to derive a nonlinear MPMF model to learn a nonlinear classifier. Several experiments on real-world benchmark datasets demonstrate the effectiveness of our new model.
Collapse
|
11
|
Bhadra N, Chatterjee SK, Das S. Multiclass classification of environmental chemical stimuli from unbalanced plant electrophysiological data. PLoS One 2023; 18:e0285321. [PMID: 37141215 PMCID: PMC10159166 DOI: 10.1371/journal.pone.0285321] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2023] [Accepted: 04/19/2023] [Indexed: 05/05/2023] Open
Abstract
Plant electrophysiological response contains useful signature of its environment and health which can be utilized using suitable statistical analysis for developing an inverse model to classify the stimulus applied to the plant. In this paper, we have presented a statistical analysis pipeline to tackle a multiclass environmental stimuli classification problem with unbalanced plant electrophysiological data. The objective here is to classify three different environmental chemical stimuli, using fifteen statistical features, extracted from the plant electrical signals and compare the performance of eight different classification algorithms. A comparison using reduced dimensional projection of the high dimensional features via principal component analysis (PCA) has also been presented. Since the experimental data is highly unbalanced due to varying length of the experiments, we employ a random under-sampling approach for the two majority classes to create an ensemble of confusion matrices to compare the classification performances. Along with this, three other multi-classification performance metrics commonly used for unbalanced data viz. balanced accuracy, F1-score and Matthews correlation coefficient have also been analyzed. From the stacked confusion matrices and the derived performance metrics, we choose the best feature-classifier setting in terms of the classification performances carried out in the original high dimensional vs. the reduced feature space, for this highly unbalanced multiclass problem of plant signal classification due to different chemical stress. Difference in the classification performances in the high vs. reduced dimensions are also quantified using the multivariate analysis of variance (MANOVA) hypothesis testing. Our findings have potential real-world applications in precision agriculture for exploring multiclass classification problems with highly unbalanced datasets, employing a combination of existing machine learning algorithms. This work also advances existing studies on environmental pollution level monitoring using plant electrophysiological data.
Collapse
Affiliation(s)
- Nivedita Bhadra
- Department of Physical Sciences, Indian Institute of Science Education and Research, Nadia, Kolkata, West Bengal, India
| | - Shre Kumar Chatterjee
- Department of Electronics and Computer Science, University of Southampton, Southampton, United Kingdom
| | - Saptarshi Das
- Centre for Environmental Mathematics, Faculty of Environment, Science and Economy, University of Exeter, Exeter, United Kingdom
- Institute for Data Science and Artificial Intelligence, University of Exeter, Exeter, United Kingdom
| |
Collapse
|
12
|
Depto DS, Rizvee MM, Rahman A, Zunair H, Rahman MS, Mahdy MRC. Quantifying imbalanced classification methods for leukemia detection. Comput Biol Med 2023; 152:106372. [PMID: 36516574 DOI: 10.1016/j.compbiomed.2022.106372] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2022] [Revised: 11/01/2022] [Accepted: 11/27/2022] [Indexed: 12/03/2022]
Abstract
Uncontrolled proliferation of B-lymphoblast cells is a common characterization of Acute Lymphoblastic Leukemia (ALL). B-lymphoblasts are found in large numbers in peripheral blood in malignant cases. Early detection of the cell in bone marrow is essential as the disease progresses rapidly if left untreated. However, automated classification of the cell is challenging, owing to its fine-grained variability with B-lymphoid precursor cells and imbalanced data points. Deep learning algorithms demonstrate potential for such fine-grained classification as well as suffer from the imbalanced class problem. In this paper, we explore different deep learning-based State-Of-The-Art (SOTA) approaches to tackle imbalanced classification problems. Our experiment includes input, GAN (Generative Adversarial Networks), and loss-based methods to mitigate the issue of imbalanced class on the challenging C-NMC and ALLIDB-2 dataset for leukemia detection. We have shown empirical evidence that loss-based methods outperform GAN-based and input-based methods in imbalanced classification scenarios.
Collapse
Affiliation(s)
- Deponker Sarker Depto
- Department of Electrical and Computer Engineering, North South University, Bashundhara, Dhaka, 1229, Bangladesh.
| | - Md Mashfiq Rizvee
- Department of Electrical and Computer Engineering, North South University, Bashundhara, Dhaka, 1229, Bangladesh; Texas Tech University, Lubbock, TX, United States of America.
| | - Aimon Rahman
- Department of Electrical and Computer Engineering, North South University, Bashundhara, Dhaka, 1229, Bangladesh.
| | | | - M Sohel Rahman
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, ECE Building, West Palasi, Dhaka 1205, Bangladesh.
| | - M R C Mahdy
- Department of Electrical and Computer Engineering, North South University, Bashundhara, Dhaka, 1229, Bangladesh.
| |
Collapse
|
13
|
Deepak S, Ameer P. Brain tumor categorization from imbalanced MRI dataset using weighted loss and deep feature fusion. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.11.039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
14
|
Chen Z, Duan J, Kang L, Qiu G. Class-Imbalanced Deep Learning via a Class-Balanced Ensemble. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:5626-5640. [PMID: 33900923 DOI: 10.1109/tnnls.2021.3071122] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Class imbalance is a prevalent phenomenon in various real-world applications and it presents significant challenges to model learning, including deep learning. In this work, we embed ensemble learning into the deep convolutional neural networks (CNNs) to tackle the class-imbalanced learning problem. An ensemble of auxiliary classifiers branching out from various hidden layers of a CNN is trained together with the CNN in an end-to-end manner. To that end, we designed a new loss function that can rectify the bias toward the majority classes by forcing the CNN's hidden layers and its associated auxiliary classifiers to focus on the samples that have been misclassified by previous layers, thus enabling subsequent layers to develop diverse behavior and fix the errors of previous layers in a batch-wise manner. A unique feature of the new method is that the ensemble of auxiliary classifiers can work together with the main CNN to form a more powerful combined classifier, or can be removed after finished training the CNN and thus only acting the role of assisting class imbalance learning of the CNN to enhance the neural network's capability in dealing with class-imbalanced data. Comprehensive experiments are conducted on four benchmark data sets of increasing complexity (CIFAR-10, CIFAR-100, iNaturalist, and CelebA) and the results demonstrate significant performance improvements over the state-of-the-art deep imbalance learning methods.
Collapse
|
15
|
Zhang L, Wang K, Xu L, Sheng W, Kang Q. Evolving ensembles using multi-objective genetic programming for imbalanced classification. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109611] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/31/2022]
|
16
|
Xu Y, Yu Z, Chen CLP. Classifier Ensemble Based on Multiview Optimization for High-Dimensional Imbalanced Data Classification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; PP:870-883. [PMID: 35657843 DOI: 10.1109/tnnls.2022.3177695] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
High-dimensional class imbalanced data have plagued the performance of classification algorithms seriously. Because of a large number of redundant/invalid features and the class imbalanced issue, it is difficult to construct an optimal classifier for high-dimensional imbalanced data. Classifier ensemble has attracted intensive attention since it can achieve better performance than an individual classifier. In this work, we propose a multiview optimization (MVO) to learn more effective and robust features from high-dimensional imbalanced data, based on which an accurate and robust ensemble system is designed. Specifically, an optimized subview generation (OSG) in MVO is first proposed to generate multiple optimized subviews from different scenarios, which can strengthen the classification ability of features and increase the diversity of ensemble members simultaneously. Second, a new evaluation criterion that considers the distribution of data in each optimized subview is developed based on which a selective ensemble of optimized subviews (SEOS) is designed to perform the subview selective ensemble. Finally, an oversampling approach is executed on the optimized view to obtain a new class rebalanced subset for the classifier. Experimental results on 25 high-dimensional class imbalanced datasets indicate that the proposed method outperforms other mainstream classifier ensemble methods.
Collapse
|
17
|
Wang Y, Zhu X, Yang L, Hu X, He K, Yu C, Jiao S, Chen J, Guo R, Yang S. IDDLncLoc: Subcellular Localization of LncRNAs Based on a Framework for Imbalanced Data Distributions. Interdiscip Sci 2022; 14:409-420. [PMID: 35192174 DOI: 10.1007/s12539-021-00497-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2021] [Revised: 12/16/2021] [Accepted: 12/20/2021] [Indexed: 06/14/2023]
Abstract
Long non-coding RNAs play a crucial role in many life processes of cell, such as genetic markers, RNA splicing, signaling, and protein regulation. Considering that identifying lncRNA's localization in the cell through experimental methods is complicated, hard to reproduce, and expensive, we propose a novel method named IDDLncLoc in this paper, which adopts an ensemble model to solve the problem of the subcellular localization. In the proposal model, dinucleotide-based auto-cross covariance features, k-mer nucleotide composition features, and composition, transition, and distribution features are introduced to encode a raw RNA sequence to vector. To screen out reliable features, feature selection through binomial distribution, and recursive feature elimination is employed. Furthermore, strategies of oversampling in mini-batch, random sampling, and stacking ensemble strategies are customized to overcome the problem of data imbalance on the benchmark dataset. Finally, compared with the latest methods, IDDLncLoc achieves an accuracy of 94.96% on the benchmark dataset, which is 2.59% higher than the best method, and the results further demonstrate IDDLncLoc is excellent on the subcellular localization of lncRNA. Besides, a user-friendly web server is available at http://lncloc.club .
Collapse
Affiliation(s)
- Yan Wang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, China
- School of Artificial Intelligence, Jilin University, Changchun, China
| | - Xiaopeng Zhu
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, China
| | - Lili Yang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, China
- Department of Obstetrics, The First Hospital of Jilin University, Changchun, China
| | - Xuemei Hu
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, China
| | - Kai He
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, China
| | - Cuinan Yu
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, China
| | - Shaoqing Jiao
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, China
| | - Jiali Chen
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, China
| | - Rui Guo
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, China
| | - Sen Yang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, China.
| |
Collapse
|
18
|
Research on Brand Image Evaluation Method Based on Consumer Sentiment Analysis. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:2647515. [PMID: 35669638 PMCID: PMC9167012 DOI: 10.1155/2022/2647515] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/14/2022] [Revised: 04/28/2022] [Accepted: 05/10/2022] [Indexed: 11/17/2022]
Abstract
Brand image assessment is a key step to reasonably quantify the value of a brand and has far-reaching significance for improving the competitiveness of an enterprise. With the rapid development of Internet technology, traditional questionnaires can no longer meet the current needs of brand image assessment. In this environment, the huge amount of fragmented consumer topic data provides a rich data resource and new research ideas for brand image assessment. Therefore, a brand image assessment method based on consumer sentiment analysis is proposed. First, a topic-based brand image cognitive label extraction method is proposed by setting language rules, aggregation rules, and ranking rules according to the characteristics of online topic data. Then, the fusion of cognitive labels and deep features is performed by fusing the deep features extracted from word vectors. Finally, a supervised learning support vector machine is selected as the sentiment classification model. The experimental results show that based on the obtained important cognitive labels, enterprises are able to better understand the unique attributes that consumers have for the brand; the feature fusion approach is better evaluated and can accurately reflect consumers' views on brand image and quantified as brand score.
Collapse
|
19
|
Su Y, Shen Y. A Deep Learning-Based Sentiment Classification Model for Real Online Consumption. Front Psychol 2022; 13:886982. [PMID: 35496187 PMCID: PMC9047760 DOI: 10.3389/fpsyg.2022.886982] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2022] [Accepted: 03/21/2022] [Indexed: 11/18/2022] Open
Abstract
Most e-commerce platforms allow consumers to post product reviews, causing more and more consumers to get into the habit of reading reviews before they buy. These online reviews serve as an emotional feedback of consumers’ product experience and contain a lot of important information, but inevitably there are malicious or irrelevant reviews. It is especially important to discover and identify the real sentiment tendency in online reviews in a timely manner. Therefore, a deep learning-based real online consumer sentiment classification model is proposed. First, the mapping relationship between online reviews of goods and sentiment features is established based on expert knowledge and using fuzzy mathematics, thus mapping the high-dimensional original text data into a continuous low-dimensional space. Secondly, after obtaining local contextual features using convolutional operations, the long-term dependencies between features are fully considered by a bidirectional long- and short-term memory network. Then, the degree of contribution of different words to the text is considered by introducing an attention mechanism, and a regular term constraint is introduced in the objective function. The experimental results show that the proposed convolutional attention–long and short-term memory network (CA–LSTM) model has a higher test accuracy of 83.3% compared with other models, indicating that the model has better classification performance.
Collapse
Affiliation(s)
- Yang Su
- School of Art, Anhui Polytechnic University, Wuhu, China
| | - Yan Shen
- Ideological, Political and Basic Teaching Department, Communication University of China, Nanjing, China
| |
Collapse
|
20
|
Yang M, Wang Z, Li Y, Zhou Y, Li D, Du W. Gravitation balanced multiple kernel learning for imbalanced classification. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-07187-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
21
|
Liu L, Wu X, Li S, Li Y, Tan S, Bai Y. Solving the class imbalance problem using ensemble algorithm: application of screening for aortic dissection. BMC Med Inform Decis Mak 2022; 22:82. [PMID: 35346181 PMCID: PMC8962101 DOI: 10.1186/s12911-022-01821-w] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2021] [Accepted: 03/21/2022] [Indexed: 11/25/2022] Open
Abstract
Background Imbalance between positive and negative outcomes, a so-called class imbalance, is a problem generally found in medical data. Despite various studies, class imbalance has always been a difficult issue. The main objective of this study was to find an effective integrated approach to address the problems posed by class imbalance and to validate the method in an early screening model for a rare cardiovascular disease aortic dissection (AD). Methods Different data-level methods, cost-sensitive learning, and the bagging method were combined to solve the problem of low sensitivity caused by the imbalance of two classes of data. First, feature selection was applied to select the most relevant features using statistical analysis, including significance test and logistic regression. Then, we assigned two different misclassification cost values for two classes, constructed weak classifiers based on the support vector machine (SVM) model, and integrated the weak classifiers with undersampling and bagging methods to build the final strong classifier. Due to the rarity of AD, the data imbalance was particularly prominent. Therefore, we applied our method to the construction of an early screening model for AD disease. Clinical data of 523,213 patients from the Institute of Hypertension, Xiangya Hospital, Central South University were used to verify the validity of this method. In these data, the sample ratio of AD patients to non-AD patients was 1:65, and each sample contained 71 features. Results The proposed ensemble model achieved the highest sensitivity of 82.8%, with training time and specificity reaching 56.4 s and 71.9% respectively. Additionally, it obtained a small variance of sensitivity of 19.58 × 10–3 in the seven-fold cross validation experiment. The results outperformed the common ensemble algorithms of AdaBoost, EasyEnsemble, and Random Forest (RF) as well as the single machine learning (ML) methods of logistic regression, decision tree, k nearest neighbors (KNN), back propagation neural network (BP) and SVM. Among the five single ML algorithms, the SVM model after cost-sensitive learning method performed best with a sensitivity of 79.5% and a specificity of 73.4%. Conclusions In this study, we demonstrate that the integration of feature selection, undersampling, cost-sensitive learning and bagging methods can overcome the challenge of class imbalance in a medical dataset and develop a practical screening model for AD, which could lead to a decision support for screening for AD at an early stage.
Collapse
|
22
|
Tan Z, Chen J, Kang Q, Zhou M, Abusorrah A, Sedraoui K. Dynamic Embedding Projection-Gated Convolutional Neural Networks for Text Classification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:973-982. [PMID: 33417564 DOI: 10.1109/tnnls.2020.3036192] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Text classification is a fundamental and important area of natural language processing for assigning a text into at least one predefined tag or category according to its content. Most of the advanced systems are either too simple to get high accuracy or centered on using complex structures to capture the genuinely required category information, which requires long time to converge during their training stage. In order to address such challenging issues, we propose a dynamic embedding projection-gated convolutional neural network (DEP-CNN) for multi-class and multi-label text classification. Its dynamic embedding projection gate (DEPG) transforms and carries word information by using gating units and shortcut connections to control how much context information is incorporated into each specific position of a word-embedding matrix in a text. To our knowledge, we are the first to apply DEPG over a word-embedding matrix. The experimental results on four known benchmark datasets display that DEP-CNN outperforms its recent peers.
Collapse
|
23
|
Huang Z, Yang S, Zhou M, Li Z, Gong Z, Chen Y. Feature Map Distillation of Thin Nets for Low-Resolution Object Recognition. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:1364-1379. [PMID: 35025743 DOI: 10.1109/tip.2022.3141255] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Intelligent video surveillance is an important computer vision application in natural environments. Since detected objects under surveillance are usually low-resolution and noisy, their accurate recognition represents a huge challenge. Knowledge distillation is an effective method to deal with it, but existing related work usually focuses on reducing the channel count of a student network, not feature map size. As a result, they cannot transfer "privilege information" hidden in feature maps of a wide and deep teacher network into a thin and shallow student one, leading to the latter's poor performance. To address this issue, we propose a Feature Map Distillation (FMD) framework under which the feature map size of teacher and student networks is different. FMD consists of two main components: Feature Decoder Distillation (FDD) and Feature Map Consistency-enforcement (FMC). FDD reconstructs the shallow texture features of a thin student network to approximate the corresponding samples in a teacher network, which allows the high-resolution ones to directly guide the learning of the shallow features of the student network. FMC makes the size and direction of each deep feature map consistent between student and teacher networks, which constrains each pair of feature maps to produce the same feature distribution. FDD and FMC allow a thin student network to learn rich "privilege information" in feature maps of a wide teacher network. The overall performance of FMD is verified in multiple recognition tasks by comparing it with state-of-the-art knowledge distillation methods on low-resolution and noisy objects.
Collapse
|
24
|
Wang KF, An J, Wei Z, Cui C, Ma XH, Ma C, Bao HQ. Deep Learning-Based Imbalanced Classification With Fuzzy Support Vector Machine. Front Bioeng Biotechnol 2022; 9:802712. [PMID: 35127672 PMCID: PMC8815771 DOI: 10.3389/fbioe.2021.802712] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2021] [Accepted: 12/20/2021] [Indexed: 12/23/2022] Open
Abstract
Imbalanced classification is widespread in the fields of medical diagnosis, biomedicine, smart city and Internet of Things. The imbalance of data distribution makes traditional classification methods more biased towards majority classes and ignores the importance of minority class. It makes the traditional classification methods ineffective in imbalanced classification. In this paper, a novel imbalance classification method based on deep learning and fuzzy support vector machine is proposed and named as DFSVM. DFSVM first uses a deep neural network to obtain an embedding representation of the data. This deep neural network is trained by using triplet loss to enhance similarities within classes and differences between classes. To alleviate the effects of imbalanced data distribution, oversampling is performed in the embedding space of the data. In this paper, we use an oversampling method based on feature and center distance, which can obtain more diverse new samples and prevent overfitting. To enhance the impact of minority class, we use a fuzzy support vector machine (FSVM) based on cost-sensitive learning as the final classifier. FSVM assigns a higher misclassification cost to minority class samples to improve the classification quality. Experiments were performed on multiple biological datasets and real-world datasets. The experimental results show that DFSVM has achieved promising classification performance.
Collapse
Affiliation(s)
- Ke-Fan Wang
- School of Electrical and Electronic Engineering, Shanghai Institute of Technology, Shanghai, China
| | - Jing An
- School of Electrical and Electronic Engineering, Shanghai Institute of Technology, Shanghai, China
| | - Zhen Wei
- School of Design, East China Normal University, Shanghai, China
- *Correspondence: Zhen Wei,
| | - Can Cui
- College of Electronic and Information Engineering, Tongji University, Shanghai, China
| | - Xiang-Hua Ma
- School of Electrical and Electronic Engineering, Shanghai Institute of Technology, Shanghai, China
| | - Chao Ma
- School of Electrical and Electronic Engineering, Shanghai Institute of Technology, Shanghai, China
| | - Han-Qiu Bao
- College of Electronic and Information Engineering, Tongji University, Shanghai, China
| |
Collapse
|
25
|
RDPVR: Random Data Partitioning with Voting Rule for Machine Learning from Class-Imbalanced Datasets. ELECTRONICS 2022. [DOI: 10.3390/electronics11020228] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Since most classifiers are biased toward the dominant class, class imbalance is a challenging problem in machine learning. The most popular approaches to solving this problem include oversampling minority examples and undersampling majority examples. Oversampling may increase the probability of overfitting, whereas undersampling eliminates examples that may be crucial to the learning process. We present a linear time resampling method based on random data partitioning and a majority voting rule to address both concerns, where an imbalanced dataset is partitioned into a number of small subdatasets, each of which must be class balanced. After that, a specific classifier is trained for each subdataset, and the final classification result is established by applying the majority voting rule to the results of all of the trained models. We compared the performance of the proposed method to some of the most well-known oversampling and undersampling methods, employing a range of classifiers, on 33 benchmark machine learning class-imbalanced datasets. The classification results produced by the classifiers employed on the generated data by the proposed method were comparable to most of the resampling methods tested, with the exception of SMOTEFUNA, which is an oversampling method that increases the probability of overfitting. The proposed method produced results that were comparable to the Easy Ensemble (EE) undersampling method. As a result, for solving the challenge of machine learning from class-imbalanced datasets, we advocate using either EE or our method.
Collapse
|
26
|
Rezvani S, Wang X. Class imbalance learning using fuzzy ART and intuitionistic fuzzy twin support vector machines. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2021.07.010] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
27
|
Zhang B, Shang P. Cumulative Permuted Fractional Entropy and its Applications. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2021; 32:4946-4955. [PMID: 33021947 DOI: 10.1109/tnnls.2020.3026424] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Fractional calculus and entropy are two essential mathematical tools, and their conceptions support a productive interplay in the study of system dynamics and machine learning. In this article, we modify the fractional entropy and propose the cumulative permuted fractional entropy (CPFE). A theoretical analysis is provided to prove that CPFE not only meets the basic properties of the Shannon entropy but also has unique characteristics of its own. We apply it to typical discrete distributions, simulated data, and real-world data to prove its efficiency in the application. This article demonstrates that CPFE can measure the complexity and uncertainty of complex systems so that it can perform reliable and accurate classification. Finally, we introduce CPFE to support vector machines (SVMs) and get CPFE-SVM. The CPFE can be used to process data to make the irregular data linearly separable. Compared with the other five state-of-the-art algorithms, CPFE-SVM has significantly higher accuracy and less computational burden. Therefore, the CPFE-SVM is especially suitable for the classification of irregular large-scale data sets. Also, it is insensitive to noise. Implications of the results and future research directions are also presented.
Collapse
|
28
|
Melek M, Melek N. Roza: a new and comprehensive metric for evaluating classification systems. Comput Methods Biomech Biomed Engin 2021; 25:1015-1027. [PMID: 34693834 DOI: 10.1080/10255842.2021.1995721] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
Many metrics such as accuracy rate (ACC), area under curve (AUC), Jaccard index (JI), and Cohen's kappa coefficient are available to measure the success of the system in pattern recognition and machine/deep learning systems. However, the superiority of one system to one other cannot be determined based on the mentioned metrics. This is because such a system can be successful using one metric, but not the other ones. Moreover, such metrics are insufficient when the number of samples in the classes is unequal (imbalanced data). In this case, naturally, by using these metrics, a sensible comparison cannot be made between two given systems. In the present study, the comprehensive, fair, and accurate Roza (Roza means rose in Persian. When different permutations of the metrics used are superimposed in a polygon format, it looks like a flower, so we named it Roza.) metric is introduced for evaluating classification systems. This metric, which facilitates the comparison of systems, expresses the summary of many metrics with a single value. To verify the stability and validity of the metric and to conduct a comprehensive, fair, and accurate comparison between the systems, the Roza metric of the systems tested under the same conditions are calculated and comparisons are made. For this, systems tested with three different strategies on three different datasets are considered. The results show that the performance of the system can be summarized by a single value and the Roza metric can be used in all systems that include classification processes, as a powerful metric.
Collapse
Affiliation(s)
- Mesut Melek
- Department of Electronics and Automation, Gumushane University, Gumushane, Turkey
| | - Negin Melek
- Faculty of Engineering, Department of Electrical and Electronics Engineering, Avrasya University, Trabzon, Turkey
| |
Collapse
|
29
|
Yao L, Lin TB. Evolutionary Mahalanobis Distance-Based Oversampling for Multi-Class Imbalanced Data Classification. SENSORS (BASEL, SWITZERLAND) 2021; 21:6616. [PMID: 34640936 PMCID: PMC8512012 DOI: 10.3390/s21196616] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/23/2021] [Revised: 09/14/2021] [Accepted: 09/29/2021] [Indexed: 11/18/2022]
Abstract
The number of sensing data are often imbalanced across data classes, for which oversampling on the minority class is an effective remedy. In this paper, an effective oversampling method called evolutionary Mahalanobis distance oversampling (EMDO) is proposed for multi-class imbalanced data classification. EMDO utilizes a set of ellipsoids to approximate the decision regions of the minority class. Furthermore, multi-objective particle swarm optimization (MOPSO) is integrated with the Gustafson-Kessel algorithm in EMDO to learn the size, center, and orientation of every ellipsoid. Synthetic minority samples are generated based on Mahalanobis distance within every ellipsoid. The number of synthetic minority samples generated by EMDO in every ellipsoid is determined based on the density of minority samples in every ellipsoid. The results of computer simulations conducted herein indicate that EMDO outperforms most of the widely used oversampling schemes.
Collapse
Affiliation(s)
- Leehter Yao
- Department of Electrical Engineering, National Taipei University of Technology, Taipei 10618, Taiwan;
| | | |
Collapse
|
30
|
Abstract
Most photovoltaic (PV) plants conduct operation and maintenance (O&M) by periodical inspection and cleaning. Such O&M is costly and inefficient. It fails to detect system faults in time, thus causing heavy loss. To ensure their operations are at an ideal state, this work proposes an unsupervised method for intelligent performance evaluation and data-driven fault detection, which enables engineers to check PV panels in time and implement timely maintenance. It classifies monitoring data into three subsets: ideal period A, transition period S, and downturn period B. Based on A and B datasets, we build two non-continuous regression prediction models, which are based on a tree ensemble algorithm and then modified to fit the non-continuous characteristic of PV data. We compare real-time measured power with both upper and lower reference baselines derived from two predictive models. By calculating their threshold ranges, the proposed method achieves the instantaneous performance monitoring of PV power generation and provides failure identification and O&M suggestions to engineers. It has been assessed on a 6.95 MW PV plant. Its evaluation results indicate that it is able to accurately determine different functioning states and detect both direct and indirect faults in a PV system, thereby achieving intelligent data-driven maintenance.
Collapse
|
31
|
|
32
|
Improved CBSO: A distributed fuzzy-based adaptive synthetic oversampling algorithm for imbalanced judicial data. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2021.04.017] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
33
|
Wang X, Kang Q, Zhou M, Pan L, Abusorrah A. Multiscale Drift Detection Test to Enable Fast Learning in Nonstationary Environments. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:3483-3495. [PMID: 32544055 DOI: 10.1109/tcyb.2020.2989213] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
A model can be easily influenced by unseen factors in nonstationary environments and fail to fit dynamic data distribution. In a classification scenario, this is known as a concept drift. For instance, the shopping preference of customers may change after they move from one city to another. Therefore, a shopping website or application should alter recommendations based on its poorer predictions of such user patterns. In this article, we propose a novel approach called the multiscale drift detection test (MDDT) that efficiently localizes abrupt drift points when feature values fluctuate, meaning that the current model needs immediate adaption. MDDT is based on a resampling scheme and a paired student t -test. It applies a detection procedure on two different scales. Initially, the detection is performed on a broad scale to check if recently gathered drift indicators remain stationary. If a drift is claimed, a narrow scale detection is performed to trace the refined change time. This multiscale structure reduces the massive time of constantly checking and filters noises in drift indicators. Experiments are performed to compare the proposed method with several algorithms via synthetic and real-world datasets. The results indicate that it outperforms others when abrupt shift datasets are handled, and achieves the highest recall score in localizing drift points.
Collapse
|
34
|
Yielding Multi-Fold Training Strategy for Image Classification of Imbalanced Weeds. APPLIED SCIENCES-BASEL 2021. [DOI: 10.3390/app11083331] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
An imbalanced dataset is a significant challenge when training a deep neural network (DNN) model for deep learning problems, such as weeds classification. An imbalanced dataset may result in a model that behaves robustly on major classes and is overly sensitive to minor classes. This article proposes a yielding multi-fold training (YMufT) strategy to train a DNN model on an imbalanced dataset. This strategy reduces the bias in training through a min-class-max-bound procedure (MCMB), which divides samples in the training set into multiple folds. The model is consecutively trained on each one of these folds. In practice, we experiment with our proposed strategy on two small (PlantSeedlings, small PlantVillage) and two large (Chonnam National University (CNU), large PlantVillage) weeds datasets. With the same training configurations and approximate training steps used in conventional training methods, YMufT helps the DNN model to converge faster, thus requiring less training time. Despite a slight decrease in accuracy on the large dataset, YMufT increases the F1 score in the NASNet model to 0.9708 on the CNU dataset and 0.9928 when using the Mobilenet model training on the large PlantVillage dataset. YMufT shows outstanding performance in both accuracy and F1 score on small datasets, with values of (0.9981, 0.9970) using the Mobilenet model for training on small PlantVillage dataset and (0.9718, 0.9689) using Resnet to train on the PlantSeedlings dataset. Grad-CAM visualization shows that conventional training methods mainly concentrate on high-level features and may capture insignificant features. In contrast, YMufT guides the model to capture essential features on the leaf surface and properly localize the weeds targets.
Collapse
|
35
|
Chen Z, Duan J, Kang L, Qiu G. A hybrid data-level ensemble to enable learning from highly imbalanced dataset. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2020.12.023] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
36
|
An automatic sampling ratio detection method based on genetic algorithm for imbalanced data classification. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2021.106800] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
37
|
Abstract
Data imbalance is a thorny issue in machine learning. SMOTE is a famous oversampling method of imbalanced learning. However, it has some disadvantages such as sample overlapping, noise interference, and blindness of neighbor selection. In order to address these problems, we present a new oversampling method, OS-CCD, based on a new concept, the classification contribution degree. The classification contribution degree determines the number of synthetic samples generated by SMOTE for each positive sample. OS-CCD follows the spatial distribution characteristics of original samples on the class boundary, as well as avoids oversampling from noisy points. Experiments on twelve benchmark datasets demonstrate that OS-CCD outperforms six classical oversampling methods in terms of accuracy, F1-score, AUC, and ROC.
Collapse
|
38
|
Xing H, Wang G, Liu C, Suo M. PM2.5 concentration modeling and prediction by using temperature-based deep belief network. Neural Netw 2020; 133:157-165. [PMID: 33217684 DOI: 10.1016/j.neunet.2020.10.013] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2020] [Revised: 09/24/2020] [Accepted: 10/26/2020] [Indexed: 10/23/2022]
Abstract
Air quality prediction is a global hot issue, and PM2.5 is an important factor affecting air quality. Due to complicated causes of formation, PM2.5 prediction is a thorny and challenging task. In this paper, a novel deep learning model named temperature-based deep belief networks (TDBN) is proposed to predict the daily concentrations of PM2.5 for the next day. Firstly, the location of PM2.5 concentration prediction is Chaoyang Park in Beijing of China from January 1, 2018 to October 27, 2018. The auxiliary variables are selected as input variables of TDBN by Partial Least Square (PLS), and the corresponding data is divided into three independent sections: training samples, validating samples and testing samples. Secondly, the TDBN is composed of temperature-based restricted Boltzmann machine (RBM), where temperature is considered as an effective physical parameter in energy balance of training RBM. The structural parameters of TDBN are determined by minimizing the error in the training process, including hidden layers number, hidden neurons and value of temperature. Finally, the testing samples are used to test the performance of the proposed TDBN on PM2.5 prediction, and the other similar models are tested by the same testing samples for convenience of comparison with TDBN. The experimental results demonstrate that TDBN performs better than its peers in root mean square error (RMSE), mean absolute error (MAE) and coefficient of determination (R2).
Collapse
Affiliation(s)
- Haixia Xing
- College of Computer, Jiangsu vocational college of electronics and information, Huai'an 223003, China
| | - Gongming Wang
- Center for Intelligent and Networked Systems (CFINS), Department of Automation, Tsinghua University, Beijing 100084, China.
| | - Caixia Liu
- Department of Environmental Engineering, Peking University, Beijing 100871, China
| | - Minghe Suo
- College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China
| |
Collapse
|
39
|
Huang Z, Xu X, Zhu H, Zhou M. An Efficient Group Recommendation Model With Multiattention-Based Neural Networks. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:4461-4474. [PMID: 31944999 DOI: 10.1109/tnnls.2019.2955567] [Citation(s) in RCA: 37] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Group recommendation research has recently received much attention in a recommender system community. Currently, several deep-learning-based methods are used in group recommendation to learn preferences of groups on items and predict the next ones in which groups may be interested. However, their recommendation effectiveness is disappointing. To address this challenge, this article proposes a novel model called a multiattention-based group recommendation model (MAGRM). It well utilizes multiattention-based deep neural network structures to achieve accurate group recommendation. We train its two closely related modules: vector representation for group features and preference learning for groups on items. The former is proposed to learn to accurately represent each group's deep semantic features. It integrates four aspects of subfeatures: group co-occurrence, group description, and external and internal social features. In particular, we employ multiattention networks to learn to capture internal social features for groups. The latter employs a neural attention mechanism to depict preference interactions between each group and its members and then combines group and item features to accurately learn group preferences on items. Through extensive experiments on two real-world databases, we show that MAGRM remarkably outperforms the state-of-the-art methods in solving a group recommendation problem.
Collapse
|
40
|
Zhu H, Liu G, Zhou M, Xie Y, Abusorrah A, Kang Q. Optimizing Weighted Extreme Learning Machines for imbalanced classification and application to credit card fraud detection. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2020.04.078] [Citation(s) in RCA: 34] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
41
|
Shu T, Zhang B, Tang YY. Sparse Supervised Representation-Based Classifier for Uncontrolled and Imbalanced Classification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:2847-2856. [PMID: 30582555 DOI: 10.1109/tnnls.2018.2884444] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
The sparse representation-based classification (SRC) has been utilized in many applications and is an effective algorithm in machine learning. However, the performance of SRC highly depends on the data distribution. Some existing works proved that SRC could not obtain satisfactory results on uncontrolled data sets. Except the uncontrolled data sets, SRC cannot deal with imbalanced classification either. In this paper, we proposed a model named sparse supervised representation classifier (SSRC) to solve the above-mentioned issues. The SSRC involves the class label information during the test sample representation phase to deal with the uncontrolled data sets. In SSRC, each class has the opportunity to linearly represent the test sample in its subspace, which can decrease the influences of the uncontrolled data distribution. In order to classify imbalanced data sets, a class weight learning model is proposed and added to SSRC. Each class weight is learned from its corresponding training samples. The experimental results based on the AR face database (uncontrolled) and 15 KEEL data sets (imbalanced) with an imbalanced rate ranging from 1.48 to 61.18 prove SSRC can effectively classify uncontrolled and imbalanced data sets.
Collapse
|
42
|
Yang J, Wu X, Liang J, Sun X, Cheng MM, Rosin PL, Wang L. Self-Paced Balance Learning for Clinical Skin Disease Recognition. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:2832-2846. [PMID: 31199274 DOI: 10.1109/tnnls.2019.2917524] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Class imbalance is a challenging problem in many classification tasks. It induces biased classification results for minority classes that contain less training samples than others. Most existing approaches aim to remedy the imbalanced number of instances among categories by resampling the majority and minority classes accordingly. However, the imbalanced level of difficulty of recognizing different categories is also crucial, especially for distinguishing samples with many classes. For example, in the task of clinical skin disease recognition, several rare diseases have a small number of training samples, but they are easy to diagnose because of their distinct visual properties. On the other hand, some common skin diseases, e.g., eczema, are hard to recognize due to the lack of special symptoms. To address this problem, we propose a self-paced balance learning (SPBL) algorithm in this paper. Specifically, we introduce a comprehensive metric termed the complexity of image category that is a combination of both sample number and recognition difficulty. First, the complexity is initialized using the model of the first pace, where the pace indicates one iteration in the self-paced learning paradigm. We then assign each class a penalty weight that is larger for more complex categories and smaller for easier ones, after which the curriculum is reconstructed by rearranging the training samples. Consequently, the model can iteratively learn discriminative representations via balancing the complexity in each pace. Experimental results on the SD-198 and SD-260 benchmark data sets demonstrate that the proposed SPBL algorithm performs favorably against the state-of-the-art methods. We also demonstrate the effectiveness of the SPBL algorithm's generalization capacity on various tasks, such as indoor scene image recognition and object classification.
Collapse
|
43
|
Du G, Zhang J, Luo Z, Ma F, Ma L, Li S. Joint imbalanced classification and feature selection for hospital readmissions. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2020.106020] [Citation(s) in RCA: 41] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
|
44
|
Feng S, Zhao C, Fu P. A cluster-based hybrid sampling approach for imbalanced data classification. THE REVIEW OF SCIENTIFIC INSTRUMENTS 2020; 91:055101. [PMID: 32486749 DOI: 10.1063/5.0008935] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/27/2020] [Accepted: 04/15/2020] [Indexed: 06/11/2023]
Abstract
When processing instrumental data by using classification approaches, the imbalanced dataset problem is usually challenging. As the minority class instances could be overwhelmed by the majority class instances, training a typical classifier with such a dataset directly might get poor results in classifying the minority class. We propose a cluster-based hybrid sampling approach CUSS (Cluster-based Under-sampling and SMOTE) for imbalanced dataset classification, which belongs to the type of data-level methods and is different from previously proposed hybrid methods. A new cluster-based under-sampling method is designed for CUSS, and a new strategy to set the expected instance number according to data distribution in the original training dataset is also proposed in this paper. The proposed method is compared with five other popular resampling methods on 15 datasets with different instance numbers and different imbalance ratios. The experimental results show that the CUSS method has good performance and outperforms other state-of-the-art methods.
Collapse
Affiliation(s)
- Shou Feng
- College of Information and Communication Engineering, Harbin Engineering University, Harbin 150001, China
| | - Chunhui Zhao
- College of Information and Communication Engineering, Harbin Engineering University, Harbin 150001, China
| | - Ping Fu
- School of Electronics and Information Engineering, Harbin Institute of Technology, Harbin 150001, China
| |
Collapse
|