1
|
Tillmann JF, Hsu AI, Schwarz MK, Yttri EA. A-SOiD, an active-learning platform for expert-guided, data-efficient discovery of behavior. Nat Methods 2024; 21:703-711. [PMID: 38383746 DOI: 10.1038/s41592-024-02200-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2022] [Accepted: 01/29/2024] [Indexed: 02/23/2024]
Abstract
To identify and extract naturalistic behavior, two methods have become popular: supervised and unsupervised. Each approach carries its own strengths and weaknesses (for example, user bias, training cost, complexity and action discovery), which the user must consider in their decision. Here, an active-learning platform, A-SOiD, blends these strengths, and in doing so, overcomes several of their inherent drawbacks. A-SOiD iteratively learns user-defined groups with a fraction of the usual training data, while attaining expansive classification through directed unsupervised classification. In socially interacting mice, A-SOiD outperformed standard methods despite requiring 85% less training data. Additionally, it isolated ethologically distinct mouse interactions via unsupervised classification. We observed similar performance and efficiency using nonhuman primate and human three-dimensional pose data. In both cases, the transparency in A-SOiD's cluster definitions revealed the defining features of the supervised classification through a game-theoretic approach. To facilitate use, A-SOiD comes as an intuitive, open-source interface for efficient segmentation of user-defined behaviors and discovered sub-actions.
Collapse
Affiliation(s)
- Jens F Tillmann
- Institute of Experimental Epileptology and Cognition Research, University of Bonn, Bonn, Germany
| | - Alexander I Hsu
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Martin K Schwarz
- Institute of Experimental Epileptology and Cognition Research, University of Bonn, Bonn, Germany.
| | - Eric A Yttri
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA, USA.
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA, USA.
| |
Collapse
|
2
|
Zeinolabedini Rezaabad M, Lacey H, Marshall L, Johnson F. Influence of resampling techniques on Bayesian network performance in predicting increased algal activity. WATER RESEARCH 2023; 244:120558. [PMID: 37666153 DOI: 10.1016/j.watres.2023.120558] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/23/2023] [Revised: 08/10/2023] [Accepted: 08/30/2023] [Indexed: 09/06/2023]
Abstract
Early warning of increased algal activity is important to mitigate potential impacts on aquatic life and human health. While many methods have been developed to predict increased algal activity, an ongoing issue is that severe algal blooms often occur with low frequency in water bodies. This results in imbalanced data sets available for model specification, leading to poor predictions of the frequency of increased algal activity. One approach to address this is to resample data sets of increased algal activity to increase the prevalence of higher than normal algal activity in calibration data and ultimately improve model predictions. This study aims to investigate the use of resampling techniques to address the imbalanced dataset and determine if such methods can improve the prediction of increased algal activity. Three techniques were investigated, Kmeans under-sampling (US_Kmeans), synthetic minority over-sampling technique (SMOTE), and 'SMOTE and cluster-based under-sampling technique' (SCUT). The resampling methods were applied to a Bayesian network (BN) model of Lake Burragorang in New South Wales, Australia. The model was developed to predict chlorophyll-a (chl-a) using a range of water quality parameters as predictors. The original data and each of the balanced datasets were used for BN structures and parameter learning. The results showed that the best graphical structure was obtained by adding synthetic data from SMOTE with the highest true positive rate (TPR) and area under the curve (AUC). When compared using a fixed graphical structure for the BN, all resampling techniques increased the ability of the BN to detect events with higher probability of increased algal activity. The resampling model results can also be used to better understand the most important influences on high chl-a concentrations and suggest future data collection and model development priorities.
Collapse
Affiliation(s)
- Maryam Zeinolabedini Rezaabad
- Water Research Centre, School of Civil and Environmental Engineering, University of New South Wales, Kensington, New South Wales, Australia; ARC Training Centre Data Analytics for Resources and Environments, School of Life and Environmental Sciences, The University of Sydney, Camperdown, New South Wales, Australia.
| | | | - Lucy Marshall
- Water Research Centre, School of Civil and Environmental Engineering, University of New South Wales, Kensington, New South Wales, Australia; ARC Training Centre Data Analytics for Resources and Environments, School of Life and Environmental Sciences, The University of Sydney, Camperdown, New South Wales, Australia; Faculty of Science and Engineering, Macquarie University, North Ryde, New South Wales, Australia
| | - Fiona Johnson
- Water Research Centre, School of Civil and Environmental Engineering, University of New South Wales, Kensington, New South Wales, Australia; ARC Training Centre Data Analytics for Resources and Environments, School of Life and Environmental Sciences, The University of Sydney, Camperdown, New South Wales, Australia
| |
Collapse
|
3
|
Wang J, Sun L, Xing W, Feng G, Yang J, Li J, Li W. Sugarbeet Seed Germination Prediction Using Hyperspectral Imaging Information Fusion. APPLIED SPECTROSCOPY 2023:37028231171908. [PMID: 37246428 DOI: 10.1177/00037028231171908] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Germination rate is important for seed selection and planting and quality. In this study, hyperspectral image technology integrated with germination tests was applied for feature association analysis and germination performance prediction of sugarbeet seeds. In this study, we proposed a nondestructive prediction method for sugarbeet seed germination. Sugarbeet seed was studied, and hyperspectral imaging (HIS) performed by binarization, morphology, and contour extraction was applied as a nondestructive and accurate technique to achieve single seed image segmentation. Comparative analysis of nine spectral pretreatment methods, SNV + 1D was used to process the average spectrum of sugarbeet seeds. Fourteen characteristic wavelengths were obtained by the Kullback-Leibler (KL) divergence, as the spectral characteristics of sugarbeet seeds. Principal component analysis (PCA) and material properties verified the validity of the extracted characteristic wavelengths. It was extracted of six image features of the hyperspectral image of a single seed obtained based on the gray-level co-occurrence matrix (GLCM). The spectral features, image features, and fusion features were used to establish partial least squares discriminant analysis (PLS-DA), CatBoost, and support vector machine radial-basis function (SVM-RBF) models respectively to predict the germination. The results showed that the prediction effect of fusion features was better than spectral features and image features. By comparing other models, the prediction results of the CatBoost model accuracy were up to 93.52%. The results indicated that, based on HSI and fusion features, the prediction of germinating sugarbeet seeds was more accurate and nondestructive.
Collapse
Affiliation(s)
- Jiaying Wang
- Key Laboratory of Electronic Engineering, Heilongjiang University, Harbin, China
| | - Laijun Sun
- Key Laboratory of Electronic Engineering, Heilongjiang University, Harbin, China
| | - Wang Xing
- Key Laboratory of Sugarbeet Genetics and Breeding, Heilongjiang University, Harbin, China
| | - Guojun Feng
- Key Laboratory of Sugarbeet Genetics and Breeding, Heilongjiang University, Harbin, China
| | - Jun Yang
- Key Laboratory of Electronic Engineering, Heilongjiang University, Harbin, China
| | - Jiajia Li
- Key Laboratory of Sugarbeet Genetics and Breeding, Heilongjiang University, Harbin, China
| | - Wangsheng Li
- Key Laboratory of Sugarbeet Genetics and Breeding, Heilongjiang University, Harbin, China
| |
Collapse
|
4
|
Zhao S, Meng J, Wekesa JS, Luan Y. Identification of small open reading frames in plant lncRNA using class-imbalance learning. Comput Biol Med 2023; 157:106773. [PMID: 36924731 DOI: 10.1016/j.compbiomed.2023.106773] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2023] [Revised: 02/21/2023] [Accepted: 03/09/2023] [Indexed: 03/12/2023]
Abstract
Recently, small open reading frames (sORFs) in long noncoding RNA (lncRNA) have been demonstrated to encode small peptides that can help study the mechanisms of growth and development in organisms. Since machine learning-based computational methods are less costly compared with biological experiments, they can be used to identify sORFs and provide a basis for biological experiments. However, few computational methods and data resources have been exploited for identifying sORFs in plant lncRNA. Besides, machine learning models produce underperforming classifiers when faced with a class-imbalance problem. In this study, an alternative method called SMOTE based on weighted cosine distance (WCDSMOTE) which enables interaction with feature selection is put forward to synthesize minority class samples and weighted edited nearest neighbor (WENN) is applied to clean up majority class samples, thus, hybrid sampling WCDSMOTE-ENN is proposed to deal with imbalanced datasets with the multi-angle feature. A heterogeneous classifier ensemble is introduced to complete the classification task. Therefore, a novel computational method that is based on class-imbalance learning to identify the sORFs with coding potential in plant lncRNA (sORFplnc) is presented. Experimental results manifest that sORFplnc outperforms existing computational methods in identifying sORFs with coding potential. We anticipate that the proposed work can be a reference for relevant research and contribute to agriculture and biomedicine.
Collapse
Affiliation(s)
- Siyuan Zhao
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, 116024, China
| | - Jun Meng
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, 116024, China.
| | - Jael Sanyanda Wekesa
- Department of Information Technology, Jomo Kenyatta University of Agriculture and Technology, Nairobi, 62000-00200, Kenya
| | - Yushi Luan
- School of Bioengineering, Dalian University of Technology, Dalian, Liaoning, 116024, China
| |
Collapse
|
5
|
Chen X, Ye P, Huang L, Wang C, Cai Y, Deng L, Ren H. Exploring science-technology linkages: A deep learning-empowered solution. Inf Process Manag 2023. [DOI: 10.1016/j.ipm.2022.103255] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
6
|
Lu S, Fuggle NR, Westbury LD, Ó Breasail M, Bevilacqua G, Ward KA, Dennison EM, Mahmoodi S, Niranjan M, Cooper C. Machine learning applied to HR-pQCT images improves fracture discrimination provided by DXA and clinical risk factors. Bone 2023; 168:116653. [PMID: 36581259 DOI: 10.1016/j.bone.2022.116653] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/06/2022] [Revised: 12/16/2022] [Accepted: 12/21/2022] [Indexed: 12/28/2022]
Abstract
BACKGROUND Traditional analysis of High Resolution peripheral Quantitative Computed Tomography (HR-pQCT) images results in a multitude of cortical and trabecular parameters which would be potentially cumbersome to interpret for clinicians compared to user-friendly tools utilising clinical parameters. A computer vision approach (by which the entire scan is 'read' by a computer algorithm) to ascertain fracture risk, would be far simpler. We therefore investigated whether a computer vision and machine learning technique could improve upon selected clinical parameters in assessing fracture risk. METHODS Participants of the Hertfordshire Cohort Study (HCS) attended research visits at which height and weight were measured; fracture history was determined via self-report and vertebral fracture assessment. Bone microarchitecture was assessed via HR-pQCT scans of the non-dominant distal tibia (Scanco XtremeCT), and bone mineral density measurement and lateral vertebral assessment were performed using dual-energy X-ray absorptiometry (DXA) (Lunar Prodigy Advanced). Images were cropped, pre-processed and texture analysis was performed using a three-dimensional local binary pattern method. These image data, together with age, sex, height, weight, BMI, dietary calcium and femoral neck BMD, were used in a random-forest classification algorithm. Receiver operating characteristic (ROC) analysis was used to compare fracture risk identification methods. RESULTS Overall, 180 males and 165 females were included in this study with a mean age of approximately 76 years and 97 (28 %) participants had sustained a previous fracture. Using clinical risk factors alone resulted in an area under the curve (AUC) of 0.70 (95 % CI: 0.56-0.84), which improved to 0.71 (0.57-0.85) with the addition of DXA-measured BMD. The addition of HR-pQCT image data to the machine learning classifier with clinical risk factors and DXA-measured BMD as inputs led to an improved AUC of 0.90 (0.83-0.96) with a sensitivity of 0.83 and specificity of 0.74. CONCLUSION These results suggest that using a three-dimensional computer vision method to HR-pQCT scanning may enhance the identification of those at risk of fracture beyond that afforded by clinical risk factors and DXA-measured BMD. This approach has the potential to make the information offered by HR-pQCT more accessible (and therefore) applicable to healthcare professionals in the clinic if the technology becomes more widely available.
Collapse
Affiliation(s)
- Shengyu Lu
- Faculty of Engineering and Physical Sciences, Electronics and Computer Science, University of Southampton, UK.
| | - Nicholas R Fuggle
- MRC Lifecourse Epidemiology Centre, University of Southampton, Southampton, UK; The Alan Turing Institute, London, UK.
| | - Leo D Westbury
- MRC Lifecourse Epidemiology Centre, University of Southampton, Southampton, UK.
| | - Mícheál Ó Breasail
- Population Health Sciences, Bristol Medical School, University of Bristol, Bristol, UK.
| | - Gregorio Bevilacqua
- MRC Lifecourse Epidemiology Centre, University of Southampton, Southampton, UK.
| | - Kate A Ward
- MRC Lifecourse Epidemiology Centre, University of Southampton, Southampton, UK; NIHR Southampton Biomedical Research Centre, University of Southampton and University Hospital Southampton NHS Foundation Trust, Southampton, UK.
| | - Elaine M Dennison
- MRC Lifecourse Epidemiology Centre, University of Southampton, Southampton, UK; Victoria University of Wellington, Wellington, New Zealand.
| | - Sasan Mahmoodi
- Faculty of Engineering and Physical Sciences, Electronics and Computer Science, University of Southampton, UK.
| | - Mahesan Niranjan
- Faculty of Engineering and Physical Sciences, Electronics and Computer Science, University of Southampton, UK.
| | - Cyrus Cooper
- MRC Lifecourse Epidemiology Centre, University of Southampton, Southampton, UK; NIHR Southampton Biomedical Research Centre, University of Southampton and University Hospital Southampton NHS Foundation Trust, Southampton, UK; NIHR Oxford Biomedical Research Centre, University of Oxford, Oxford, UK.
| |
Collapse
|
7
|
Zhu QX, Zhang HT, Tian Y, Zhang N, Xu Y, He YL. Co-training based virtual sample generation for solving the small sample size problem in process industry. ISA TRANSACTIONS 2023; 134:290-301. [PMID: 36064497 DOI: 10.1016/j.isatra.2022.08.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/31/2022] [Revised: 08/20/2022] [Accepted: 08/22/2022] [Indexed: 06/15/2023]
Abstract
With the development of industrialization, the production scale and complexity of process industries are getting larger and larger. But, limited by the small amounts of samples and the uneven sample distribution in the process industry, it is difficult to establish accurate and efficient data-driven soft sensor models to predict some variables. To further develop the application of soft sensor models, generating new virtual samples based on the original sample distribution to extend the sample set is an ideal approach to solve this problem. In this paper, a novel virtual sample generation method based on the co-training of two K-Nearest Neighbor (KNN) models is proposed. First, according to the sparse parameter, sparse regions in each dimension of the feature space are identified. Second, the input features of virtual samples are generated in these sparse regions by performing interpolation operations. Third, the outputs of virtual samples are predicted by double KNN regressors based on co-training. The qualified virtual samples are screened and the model is updated using these virtual samples to improve the prediction accuracy of the double KNN models. To verify the effectiveness and superiority of the proposed virtual sample generation method based on the co-training (CTVSG), case studies are conducted using two standard functions and a Purified Terephthalic Acid (PTA) industrial dataset, where the effectiveness of CTVSG is confirmed.
Collapse
Affiliation(s)
- Qun-Xiong Zhu
- College of Information Science & Technology, Beijing University of Chemical Technology, Beijing, 100029, China; Engineering Research Center of Intelligent PSE, Ministry of Education of China, Beijing 100029, China
| | - Hong-Tao Zhang
- College of Information Science & Technology, Beijing University of Chemical Technology, Beijing, 100029, China; Engineering Research Center of Intelligent PSE, Ministry of Education of China, Beijing 100029, China
| | - Ye Tian
- College of Information Science & Technology, Beijing University of Chemical Technology, Beijing, 100029, China; Engineering Research Center of Intelligent PSE, Ministry of Education of China, Beijing 100029, China
| | - Ning Zhang
- College of Information Science & Technology, Beijing University of Chemical Technology, Beijing, 100029, China; Engineering Research Center of Intelligent PSE, Ministry of Education of China, Beijing 100029, China
| | - Yuan Xu
- College of Information Science & Technology, Beijing University of Chemical Technology, Beijing, 100029, China; Engineering Research Center of Intelligent PSE, Ministry of Education of China, Beijing 100029, China.
| | - Yan-Lin He
- College of Information Science & Technology, Beijing University of Chemical Technology, Beijing, 100029, China; Engineering Research Center of Intelligent PSE, Ministry of Education of China, Beijing 100029, China.
| |
Collapse
|
8
|
Adaptive learning for single-output complex systems via data augmentation and data type identification. Appl Soft Comput 2022. [DOI: 10.1016/j.asoc.2022.109895] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
|
9
|
El Moutaouakil K, Roudani M, El Ouissari A. Optimal Entropy Genetic Fuzzy-C-Means SMOTE (OEGFCM-SMOTE). Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.110235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
|
10
|
A Tailored Particle Swarm and Egyptian Vulture Optimization-Based Synthetic Minority-Oversampling Technique for Class Imbalance Problem. INFORMATION 2022. [DOI: 10.3390/info13080386] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Class imbalance is one of the significant challenges in classification problems. The uneven distribution of data samples in different classes may occur due to human error, improper/unguided collection of data samples, etc. The uneven distribution of class samples among classes may affect the classification accuracy of the developed model. The main motivation behind this study is the design and development of methodologies for handling class imbalance problems. In this study, a new variant of the synthetic minority oversampling technique (SMOTE) has been proposed with the hybridization of particle swarm optimization (PSO) and Egyptian vulture (EV). The proposed method has been termed SMOTE-PSOEV in this study. The proposed method generates an optimized set of synthetic samples from traditional SMOTE and augments the five datasets for verification and validation. The SMOTE-PSOEV is then compared with existing SMOTE variants, i.e., Tomek Link, Borderline SMOTE1, Borderline SMOTE2, Distance SMOTE, and ADASYN. After data augmentation to the minority classes, the performance of SMOTE-PSOEV has been evaluated using support vector machine (SVM), Naïve Bayes (NB), and k-nearest-neighbor (k-NN) classifiers. The results illustrate that the proposed models achieved higher accuracy than existing SMOTE variants.
Collapse
|
11
|
Dritsas E, Trigka M. Machine Learning Methods for Hypercholesterolemia Long-Term Risk Prediction. SENSORS (BASEL, SWITZERLAND) 2022; 22:s22145365. [PMID: 35891045 PMCID: PMC9322993 DOI: 10.3390/s22145365] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/19/2022] [Revised: 07/12/2022] [Accepted: 07/16/2022] [Indexed: 06/12/2023]
Abstract
Cholesterol is a waxy substance found in blood lipids. Its role in the human body is helpful in the process of producing new cells as long as it is at a healthy level. When cholesterol exceeds the permissible limits, it works the opposite, causing serious heart health problems. When a person has high cholesterol (hypercholesterolemia), the blood vessels are blocked by fats, and thus, circulation through the arteries becomes difficult. The heart does not receive the oxygen it needs, and the risk of heart attack increases. Nowadays, machine learning (ML) has gained special interest from physicians, medical centers and healthcare providers due to its key capabilities in health-related issues, such as risk prediction, prognosis, treatment and management of various conditions. In this article, a supervised ML methodology is outlined whose main objective is to create risk prediction tools with high efficiency for hypercholesterolemia occurrence. Specifically, a data understanding analysis is conducted to explore the features association and importance to hypercholesterolemia. These factors are utilized to train and test several ML models to find the most efficient for our purpose. For the evaluation of the ML models, precision, recall, accuracy, F-measure, and AUC metrics have been taken into consideration. The derived results highlighted Soft Voting with Rotation and Random Forest trees as base models, which achieved better performance in comparison to the other models with an AUC of 94.5%, precision of 92%, recall of 91.8%, F-measure of 91.7% and an accuracy equal to 91.75%.
Collapse
|
12
|
Dritsas E, Trigka M. Data-Driven Machine-Learning Methods for Diabetes Risk Prediction. SENSORS 2022; 22:s22145304. [PMID: 35890983 PMCID: PMC9318204 DOI: 10.3390/s22145304] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/26/2022] [Revised: 07/10/2022] [Accepted: 07/13/2022] [Indexed: 01/11/2023]
Abstract
Diabetes mellitus is a chronic condition characterized by a disturbance in the metabolism of carbohydrates, fats and proteins. The most characteristic disorder in all forms of diabetes is hyperglycemia, i.e., elevated blood sugar levels. The modern way of life has significantly increased the incidence of diabetes. Therefore, early diagnosis of the disease is a necessity. Machine Learning (ML) has gained great popularity among healthcare providers and physicians due to its high potential in developing efficient tools for risk prediction, prognosis, treatment and the management of various conditions. In this study, a supervised learning methodology is described that aims to create risk prediction tools with high efficiency for type 2 diabetes occurrence. A features analysis is conducted to evaluate their importance and explore their association with diabetes. These features are the most common symptoms that often develop slowly with diabetes, and they are utilized to train and test several ML models. Various ML models are evaluated in terms of the Precision, Recall, F-Measure, Accuracy and AUC metrics and compared under 10-fold cross-validation and data splitting. Both validation methods highlighted Random Forest and K-NN as the best performing models in comparison to the other models.
Collapse
|
13
|
Wang J, Sun L, Feng G, Bai H, Yang J, Gai Z, Zhao Z, Zhang G. Intelligent detection of hard seeds of snap bean based on hyperspectral imaging. SPECTROCHIMICA ACTA. PART A, MOLECULAR AND BIOMOLECULAR SPECTROSCOPY 2022; 275:121169. [PMID: 35358780 DOI: 10.1016/j.saa.2022.121169] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/28/2021] [Revised: 02/14/2022] [Accepted: 03/14/2022] [Indexed: 06/14/2023]
Abstract
As a common problem in snap beans, hard seed has seriously affected the large-scale industrial planting and yield of snap bean. To realize accurate, quick and non-destructive identifying the hard seeds of snap bean is of great significance to avoiding the effects of hard seeds on germination and growth. This research was based on hyperspectral imaging (HSI) to achieve accurate detection of hard seeds of snap bean. This study obtained the characteristic spectra from the hyperspectral image of a single seed, and then combined the synthetic minority over-sampling technique (SMOTE) and Tomek links to balance the numbers of hard and non-hard seed samples. The characteristic wavelengths were extracted from the average spectrum. Then the average spectrum was processed by first derivative (1D). After that, the characteristic wavelengths could be extracted using successive projections algorithm (SPA). Finally, a radial basis function-support vector machine (RBF-SVM) model was established to realize the intelligent detection of hard seeds, and the detection accuracy rate reached 89.32%. The research results showed that HSI technology could achieved accurate, fast and non-destructive testing of the hard seeds of snap bean, which is of great significance to the large-scale and standardized planting of snap bean and increase the yield per unit area.
Collapse
Affiliation(s)
- Jiaying Wang
- School of Electronic Engineering (Heilongjiang University), Harbin, Heilongjiang, China.
| | - Laijun Sun
- School of Electronic Engineering (Heilongjiang University), Harbin, Heilongjiang, China.
| | - Guojun Feng
- College of Modern Agriculture and Ecological Environment (Heilongjiang University), Harbin, Heilongjiang, China.
| | - Hongyi Bai
- School of Electronic Engineering (Heilongjiang University), Harbin, Heilongjiang, China.
| | - Jun Yang
- School of Electronic Engineering (Heilongjiang University), Harbin, Heilongjiang, China.
| | - Zhaodong Gai
- School of Electronic Engineering (Heilongjiang University), Harbin, Heilongjiang, China.
| | - Zhide Zhao
- School of Electronic Engineering (Heilongjiang University), Harbin, Heilongjiang, China.
| | - Guanghui Zhang
- College of Modern Agriculture and Ecological Environment (Heilongjiang University), Harbin, Heilongjiang, China.
| |
Collapse
|
14
|
Zhang C, Soda P, Bi J, Fan G, Almpanidis G, García S, Ding W. An empirical study on the joint impact of feature selection and data resampling on imbalance classification. APPL INTELL 2022. [DOI: 10.1007/s10489-022-03772-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
15
|
Stroke Risk Prediction with Machine Learning Techniques. SENSORS 2022; 22:s22134670. [PMID: 35808172 PMCID: PMC9268898 DOI: 10.3390/s22134670] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Revised: 06/16/2022] [Accepted: 06/20/2022] [Indexed: 01/25/2023]
Abstract
A stroke is caused when blood flow to a part of the brain is stopped abruptly. Without the blood supply, the brain cells gradually die, and disability occurs depending on the area of the brain affected. Early recognition of symptoms can significantly carry valuable information for the prediction of stroke and promoting a healthy life. In this research work, with the aid of machine learning (ML), several models are developed and evaluated to design a robust framework for the long-term risk prediction of stroke occurrence. The main contribution of this study is a stacking method that achieves a high performance that is validated by various metrics, such as AUC, precision, recall, F-measure and accuracy. The experiment results showed that the stacking classification outperforms the other methods, with an AUC of 98.9%, F-measure, precision and recall of 97.4% and an accuracy of 98%.
Collapse
|
16
|
A New Body Weight Lifelog Outliers Generation Method: Reflecting Characteristics of Body Weight Data. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12094726] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Lifelogs are generated in our daily lives and contain useful information for health monitoring. Nowadays, one can easily obtain various lifelogs from a wearable device such as a smartwatch. These lifelogs could include noise and outliers. In general, the amount of noise and outliers is significantly smaller than that of normal data, resulting in class imbalance. To achieve good analytic accuracy, the noise and outliers should be filtered. Lifelogs have specific characteristics: low volatility and periodicity. It is very important to continuously analyze and manage them within a specific time. To solve the class imbalance problem of outliers in weight lifelog data, we propose a new outlier generation method that reflects the characteristics of body weight. This study compared the proposed method with the SMOTE-based data augmentation and the GAN-based data augmentation methods. Our results confirm that our proposed method for outlier detection was better than the SVM, XGBOOST, and CATBOOST algorithms. Through them, we can reduce the data imbalance level, improve data quality, and improve analytics accuracy.
Collapse
|
17
|
He Y, Li X, Fournier‐Viger P, Huang JZ, Li M, Salloum S. Observation points classifier ensemble for high‐dimensional imbalanced classification. CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY 2022. [DOI: 10.1049/cit2.12100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Affiliation(s)
- Yulin He
- College of Computer Science & Software Engineering Shenzhen University Shenzhen China
- Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) Shenzhen University Shenzhen China
| | - Xu Li
- College of Computer Science & Software Engineering Shenzhen University Shenzhen China
| | | | - Joshua Zhexue Huang
- College of Computer Science & Software Engineering Shenzhen University Shenzhen China
- Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) Shenzhen University Shenzhen China
| | - Mianjie Li
- Faculty of Information Technology Macau University of Science and Technology Macau China
| | - Salman Salloum
- School of Computing National University of Singapore Singapore Singapore
| |
Collapse
|
18
|
Huang Y, Liu DR, Lee SJ, Hsu CH, Liu YG. A boosting resampling method for regression based on a conditional variational autoencoder. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2021.12.100] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
19
|
An Oversampling Method for Class Imbalance Problems on Large Datasets. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12073424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
Several oversampling methods have been proposed for solving the class imbalance problem. However, most of them require searching the k-nearest neighbors to generate synthetic objects. This requirement makes them time-consuming and therefore unsuitable for large datasets. In this paper, an oversampling method for large class imbalance problems that do not require the k-nearest neighbors’ search is proposed. According to our experiments on large datasets with different sizes of imbalance, the proposed method is at least twice as fast as 8 the fastest method reported in the literature while obtaining similar oversampling quality.
Collapse
|
20
|
Comparison of Predictive Models with Balanced Classes Using the SMOTE Method for the Forecast of Student Dropout in Higher Education. ELECTRONICS 2022. [DOI: 10.3390/electronics11030457] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Based on the premise that university student dropout is a social problem in the university ecosystem of any country, technological leverage is a way that allows us to build technological proposals to solve a poorly met need in university education systems. Under this scenario, the study presents and analyzes eight predictive models to forecast university dropout, based on data mining methods and techniques, using WEKA for its implementation, with a dataset of 4365 academic records of students from the National University of Moquegua (UNAM), Peru. The objective is to determine which model presents the best performance indicators to forecast and prevent student dropout. The study aims to propose and compare the accuracy of eight predictive models with balanced classes, using the SMOTE method for the generation of synthetic data. The results allow us to confirm that the predictive model based on Random Forest is the one that presents the highest accuracy and robustness. This study is of great interest to the educational community as it allows for predicting the possible dropout of a student from a university career and being able to take corrective actions both at a global and individual level. The results obtained are highly interesting for the university in which the study has been carried out, obtaining results that generally outperform the results obtained in related works.
Collapse
|
21
|
Utami E, Oyong I, Raharjo S, Dwi Hartanto A, Adi S. Supervised learning and resampling techniques on DISC personality classification using Twitter information in Bahasa Indonesia. APPLIED COMPUTING AND INFORMATICS 2021. [DOI: 10.1108/aci-03-2021-0054] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
PurposeGathering knowledge regarding personality traits has long been the interest of academics and researchers in the fields of psychology and in computer science. Analyzing profile data from personal social media accounts reduces data collection time, as this method does not require users to fill any questionnaires. A pure natural language processing (NLP) approach can give decent results, and its reliability can be improved by combining it with machine learning (as shown by previous studies).Design/methodology/approachIn this, cleaning the dataset and extracting relevant potential features “as assessed by psychological experts” are essential, as Indonesians tend to mix formal words, non-formal words, slang and abbreviations when writing social media posts. For this article, raw data were derived from a predefined dominance, influence, stability and conscientious (DISC) quiz website, returning 316,967 tweets from 1,244 Twitter accounts “filtered to include only personal and Indonesian-language accounts”. Using a combination of NLP techniques and machine learning, the authors aim to develop a better approach and more robust model, especially for the Indonesian language.FindingsThe authors find that employing a SMOTETomek re-sampling technique and hyperparameter tuning boosts the model’s performance on formalized datasets by 57% (as measured through the F1-score).Originality/valueThe process of cleaning dataset and extracting relevant potential features assessed by psychological experts from it are essential because Indonesian people tend to mix formal words, non-formal words, slang words and abbreviations when writing tweets. Organic data derived from a predefined DISC quiz website resulting 1244 records of Twitter accounts and 316.967 tweets.
Collapse
|
22
|
Wong TT, Tsai HC. Multinomial naïve Bayesian classifier with generalized Dirichlet priors for high-dimensional imbalanced data. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2021.107288] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
23
|
|
24
|
ODBOT: Outlier detection-based oversampling technique for imbalanced datasets learning. Neural Comput Appl 2021. [DOI: 10.1007/s00521-021-06198-x] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
25
|
|
26
|
Kukar M, Gunčar G, Vovko T, Podnar S, Černelč P, Brvar M, Zalaznik M, Notar M, Moškon S, Notar M. COVID-19 diagnosis by routine blood tests using machine learning. Sci Rep 2021; 11:10738. [PMID: 34031483 PMCID: PMC8144373 DOI: 10.1038/s41598-021-90265-9] [Citation(s) in RCA: 66] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2020] [Accepted: 05/07/2021] [Indexed: 12/13/2022] Open
Abstract
Physicians taking care of patients with COVID-19 have described different changes in routine blood parameters. However, these changes hinder them from performing COVID-19 diagnoses. We constructed a machine learning model for COVID-19 diagnosis that was based and cross-validated on the routine blood tests of 5333 patients with various bacterial and viral infections, and 160 COVID-19-positive patients. We selected the operational ROC point at a sensitivity of 81.9% and a specificity of 97.9%. The cross-validated AUC was 0.97. The five most useful routine blood parameters for COVID-19 diagnosis according to the feature importance scoring of the XGBoost algorithm were: MCHC, eosinophil count, albumin, INR, and prothrombin activity percentage. t-SNE visualization showed that the blood parameters of the patients with a severe COVID-19 course are more like the parameters of a bacterial than a viral infection. The reported diagnostic accuracy is at least comparable and probably complementary to RT-PCR and chest CT studies. Patients with fever, cough, myalgia, and other symptoms can now have initial routine blood tests assessed by our diagnostic tool. All patients with a positive COVID-19 prediction would then undergo standard RT-PCR studies to confirm the diagnosis. We believe that our results represent a significant contribution to improvements in COVID-19 diagnosis.
Collapse
Affiliation(s)
- Matjaž Kukar
- Smart Blood Analytics Swiss SA, Höschgasse 25, 8008, Zurich, Switzerland
- Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia
| | - Gregor Gunčar
- Smart Blood Analytics Swiss SA, Höschgasse 25, 8008, Zurich, Switzerland
- Faculty of Chemistry and Chemical Technology, University of Ljubljana, Ljubljana, Slovenia
| | - Tomaž Vovko
- Department of Infectious Diseases, University Medical Centre Ljubljana, Ljubljana, Slovenia
| | - Simon Podnar
- Division of Neurology, University Medical Centre Ljubljana, Ljubljana, Slovenia
| | - Peter Černelč
- Division of Internal Medicine, University Medical Centre Ljubljana, Ljubljana, Slovenia
| | - Miran Brvar
- Centre for Clinical Toxicology and Pharmacology, University Medical Centre Ljubljana, Ljubljana, Slovenia
| | - Mateja Zalaznik
- Department of Infectious Diseases, University Medical Centre Ljubljana, Ljubljana, Slovenia
| | - Mateja Notar
- Smart Blood Analytics Swiss SA, Höschgasse 25, 8008, Zurich, Switzerland
| | - Sašo Moškon
- Smart Blood Analytics Swiss SA, Höschgasse 25, 8008, Zurich, Switzerland
| | - Marko Notar
- Smart Blood Analytics Swiss SA, Höschgasse 25, 8008, Zurich, Switzerland.
| |
Collapse
|
27
|
Yang J, Sun L, Xing W, Feng G, Bai H, Wang J. Hyperspectral prediction of sugarbeet seed germination based on gauss kernel SVM. SPECTROCHIMICA ACTA. PART A, MOLECULAR AND BIOMOLECULAR SPECTROSCOPY 2021; 253:119585. [PMID: 33662700 DOI: 10.1016/j.saa.2021.119585] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/31/2020] [Revised: 01/13/2021] [Accepted: 02/01/2021] [Indexed: 06/12/2023]
Abstract
How to quickly and accurately select sugarbeet seeds with reliable germination is very important to sugarbeet planting. In this study, the hyperspectral images of 3072 sugarbeet seeds of the same variety were collected, and were successively processed by binarization, morphology, contour extraction and so on. The average spectrum of the single seed image was obtained by image segmentation. Comprehensive analysis of the evaluation parameters of the five spectral preprocessing methods revealed that the second derivative (2D) processing was optimal. Successive projections algorithm (SPA) was used to extract 16 characteristic wavelengths. Support vector machine radial basis function (SVM-RBF), k-nearest neighbor (KNN) and random forest (RF) models were established at the full wavelength and characteristic wavelength respectively to predict the germination of sugarbeet seeds. By analyzing the prediction accuracy of the three models, it was found that the SVM-RBF model provided the highest prediction accuracy in the test set (the prediction accuracy of the full wavelength was 95.5%, and the prediction accuracy of the characteristic wavelength was 92.32%). The research results showed that the hyperspectral image processing technology could accurately predict the germination rate of sugarbeet seeds, and realize the rapid and non-destructive prediction of the germination status of sugarbeet seeds.
Collapse
Affiliation(s)
- Jun Yang
- Key Laboratory of Electronic Engineering of Heilongjiang Province, Heilongjiang University, Harbin 150080, China
| | - Laijun Sun
- Key Laboratory of Electronic Engineering of Heilongjiang Province, Heilongjiang University, Harbin 150080, China.
| | - Wang Xing
- Key Laboratory of Sugarbeet Genetics and Breeding, Heilongjiang University, Harbin 150080, China.
| | - Guojun Feng
- Key Laboratory of Sugarbeet Genetics and Breeding, Heilongjiang University, Harbin 150080, China.
| | - Hongyi Bai
- Key Laboratory of Electronic Engineering of Heilongjiang Province, Heilongjiang University, Harbin 150080, China.
| | - Jiaying Wang
- Key Laboratory of Electronic Engineering of Heilongjiang Province, Heilongjiang University, Harbin 150080, China.
| |
Collapse
|
28
|
Thielmann A, Weisser C, Krenz A, Säfken B. Unsupervised document classification integrating web scraping, one-class SVM and LDA topic modelling. J Appl Stat 2021; 50:574-591. [PMID: 36819086 PMCID: PMC9930816 DOI: 10.1080/02664763.2021.1919063] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/30/2022]
Abstract
Unsupervised document classification for imbalanced data sets poses a major challenge. To obtain accurate classification results, training data sets are often created manually by humans which requires expert knowledge, time and money. Depending on the imbalance of the data set, this approach also either requires human labelling of all of the data or it fails to adequately recognize underrepresented categories. We propose an integration of web scraping, one-class Support Vector Machines (SVM) and Latent Dirichlet Allocation (LDA) topic modelling as a multi-step classification rule that circumvents manual labelling. Unsupervised one-class document classification with the integration of out-of-domain training data is achieved and >80% of the target data is correctly classified. The proposed method thus even outperforms common machine learning classifiers and is validated on multiple data sets.
Collapse
Affiliation(s)
- Anton Thielmann
- Center for Statistics, Georg-August-Universität Göttingen, Göttingen, Germany,Anton Thielmann
| | - Christoph Weisser
- Center for Statistics, Georg-August-Universität Göttingen, Göttingen, Germany,Campus-Institut Data Science (CIDAS), Göttingen, Germany
| | - Astrid Krenz
- Center for Statistics, Georg-August-Universität Göttingen, Göttingen, Germany,Digital Futures at Work Research Centre, University of Sussex, Brighton, UK
| | - Benjamin Säfken
- Center for Statistics, Georg-August-Universität Göttingen, Göttingen, Germany,Campus-Institut Data Science (CIDAS), Göttingen, Germany
| |
Collapse
|
29
|
García M, Maldonado S, Vairetti C. Efficient n-gram construction for text categorization using feature selection techniques. INTELL DATA ANAL 2021. [DOI: 10.3233/ida-205154] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
In this paper, we present a novel approach for n-gram generation in text classification. The a-priori algorithm is adapted to prune word sequences by combining three feature selection techniques. Unlike the traditional two-step approach for text classification in which feature selection is performed after the n-gram construction process, our proposal performs an embedded feature elimination during the application of the a-priori algorithm. The proposed strategy reduces the number of branches to be explored, speeding up the process and making the construction of all the word sequences tractable. Our proposal has the additional advantage of constructing a low-dimensional dataset with only the features that are relevant for classification, that can be used directly without the need for a feature selection step. Experiments on text classification datasets for sentiment analysis demonstrate that our approach yields the best predictive performance when compared with other feature selection approaches, while also facilitating a better understanding of the words and phrases that explain a given task; in our case online reviews and ratings in various domains.
Collapse
Affiliation(s)
| | - Sebastián Maldonado
- Department of Management Control and Information Systems, School of Economics and Business, University of Chile, Santiago, Chile
- Instituto Sistemas Complejos de Ingeniería (ISCI), Chile
| | - Carla Vairetti
- Universidad de los Andes, Santiago, Chile
- Instituto Sistemas Complejos de Ingeniería (ISCI), Chile
| |
Collapse
|
30
|
Imbalanced data classification based on diverse sample generation and classifier fusion. INT J MACH LEARN CYB 2021. [DOI: 10.1007/s13042-021-01321-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
31
|
Abstract
Data imbalance is a thorny issue in machine learning. SMOTE is a famous oversampling method of imbalanced learning. However, it has some disadvantages such as sample overlapping, noise interference, and blindness of neighbor selection. In order to address these problems, we present a new oversampling method, OS-CCD, based on a new concept, the classification contribution degree. The classification contribution degree determines the number of synthetic samples generated by SMOTE for each positive sample. OS-CCD follows the spatial distribution characteristics of original samples on the class boundary, as well as avoids oversampling from noisy points. Experiments on twelve benchmark datasets demonstrate that OS-CCD outperforms six classical oversampling methods in terms of accuracy, F1-score, AUC, and ROC.
Collapse
|
32
|
Soltanzadeh P, Hashemzadeh M. RCSMOTE: Range-Controlled synthetic minority over-sampling technique for handling the class imbalance problem. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2020.07.014] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
|
33
|
|
34
|
RUESVMs: An Ensemble Method to Handle the Class Imbalance Problem in Land Cover Mapping Using Google Earth Engine. REMOTE SENSING 2020. [DOI: 10.3390/rs12213484] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Timely and accurate Land Cover (LC) information is required for various applications, such as climate change analysis and sustainable development. Although machine learning algorithms are most likely successful in LC mapping tasks, the class imbalance problem is known as a common challenge in this regard. This problem occurs during the training phase and reduces classification accuracy for infrequent and rare LC classes. To address this issue, this study proposes a new method by integrating random under-sampling of majority classes and an ensemble of Support Vector Machines, namely Random Under-sampling Ensemble of Support Vector Machines (RUESVMs). The performance of RUESVMs for LC classification was evaluated in Google Earth Engine (GEE) over two different case studies using Sentinel-2 time-series data and five well-known spectral indices, including the Normalized Difference Vegetation Index (NDVI), Green Normalized Difference Vegetation Index (GNDVI), Soil-Adjusted Vegetation Index (SAVI), Normalized Difference Built-up Index (NDBI), and Normalized Difference Water Index (NDWI). The performance of RUESVMs was also compared with the traditional SVM and combination of SVM with three benchmark data balancing techniques namely the Random Over-Sampling (ROS), Random Under-Sampling (RUS), and Synthetic Minority Over-sampling Technique (SMOTE). It was observed that the proposed method considerably improved the accuracy of LC classification, especially for the minority classes. After adopting RUESVMs, the overall accuracy of the generated LC map increased by approximately 4.95 percentage points, and this amount for the geometric mean of producer’s accuracies was almost 3.75 percentage points, in comparison to the most accurate data balancing method (i.e., SVM-SMOTE). Regarding the geometric mean of users’ accuracies, RUESVMs also outperformed the SVM-SMOTE method with an average increase of 6.45 percentage points.
Collapse
|
35
|
Kim KH, Sohn SY. Hybrid neural network with cost-sensitive support vector machine for class-imbalanced multimodal data. Neural Netw 2020; 130:176-184. [DOI: 10.1016/j.neunet.2020.06.026] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2020] [Revised: 06/13/2020] [Accepted: 06/30/2020] [Indexed: 10/23/2022]
|
36
|
Zhu QX, Zhang XH, He YL. Novel Virtual Sample Generation Based on Locally Linear Embedding for Optimizing the Small Sample Problem: Case of Soft Sensor Applications. Ind Eng Chem Res 2020. [DOI: 10.1021/acs.iecr.0c01942] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Affiliation(s)
- Qun-Xiong Zhu
- College of Information Science & Technology, Beijing University of Chemical Technology, Beijing, 100029, China
- Engineering Research Center of Intelligent PSE, Ministry of Education of China, Beijing, 100029, China
| | - Xiao-Han Zhang
- College of Information Science & Technology, Beijing University of Chemical Technology, Beijing, 100029, China
- Engineering Research Center of Intelligent PSE, Ministry of Education of China, Beijing, 100029, China
| | - Yan-Lin He
- College of Information Science & Technology, Beijing University of Chemical Technology, Beijing, 100029, China
- Engineering Research Center of Intelligent PSE, Ministry of Education of China, Beijing, 100029, China
| |
Collapse
|
37
|
Combined Generative Adversarial Network and Fuzzy C-Means Clustering for Multi-Class Voice Disorder Detection with an Imbalanced Dataset. APPLIED SCIENCES-BASEL 2020. [DOI: 10.3390/app10134571] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The world has witnessed the success of artificial intelligence deployment for smart healthcare applications. Various studies have suggested that the prevalence of voice disorders in the general population is greater than 10%. An automatic diagnosis for voice disorders via machine learning algorithms is desired to reduce the cost and time needed for examination by doctors and speech-language pathologists. In this paper, a conditional generative adversarial network (CGAN) and improved fuzzy c-means clustering (IFCM) algorithm called CGAN-IFCM is proposed for the multi-class voice disorder detection of three common types of voice disorders. Existing benchmark datasets for voice disorders, the Saarbruecken Voice Database (SVD) and the Voice ICar fEDerico II Database (VOICED), use imbalanced classes. A generative adversarial network offers synthetic data to reduce bias in the detection model. Improved fuzzy c-means clustering considers the relationship between adjacent data points in the fuzzy membership function. To explain the necessity of CGAN and IFCM, a comparison is made between the algorithm with CGAN and that without CGAN. Moreover, the performance is compared between IFCM and traditional fuzzy c-means clustering. Lastly, the proposed CGAN-IFCM outperforms existing models in its true negative rate and true positive rate by 9.9–12.9% and 9.1–44.8%, respectively.
Collapse
|
38
|
Hadi W, El-Khalili N, AlNashashibi M, Issa G, AlBanna AA. Application of data mining algorithms for improving stress prediction of automobile drivers: A case study in Jordan. Comput Biol Med 2019; 114:103474. [PMID: 31585402 DOI: 10.1016/j.compbiomed.2019.103474] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2019] [Revised: 09/14/2019] [Accepted: 09/26/2019] [Indexed: 11/16/2022]
Abstract
Driving daily through traffic congestion has been recognised as a major cause of stress. High levels of stress while driving negatively impact the driver's decisions which could potentially lead to accidents and other long-term health hazards. Accordingly, there is a great need to determine stress levels for drivers based on measuring and predicting the major causes (features or classes) that increase stress levels. In this paper, the problem of predicting automobile drivers' stress levels, as experienced during actual driving, is investigated through the application of five different data mining algorithms, namely K-Nearest Neighbour (KNN), Decision Tree (J48), Random Forest (RF), Support Vector Machine (SVM), and Artificial Neural Networks (ANN). An experiment was conducted on 14 drivers taking various routes in Amman - Jordan, with a wearable biomedical device attached to the driver to instantly collect physiological data. The collected data (dataset) is grouped into two different categories, namely 'Yes' to signify the presence of stress and 'No' to signify the absence of stress. In order to efficiently apply data mining algorithms to the data set, oversampling was used to avoid the negative effect of driver samples with a lesser class on the prediction of stress. The findings are evaluated in relation to stress prediction and accordingly contrasted alongside standard reference approaches that do not consider oversampling and/or feature selection using the Friedman rank test. The proposed approach, in combination with RF, was seen to surpass any others in terms of accuracy, AUC, specificity, and sensitivity. The accuracy, AUC, specificity, and sensitivity rates produced by RF utilising our proposed approach were 98.92%, 99.91%, 98.46%, and 99.36%, respectively.
Collapse
Affiliation(s)
- Wa'el Hadi
- Computer Information Systems, University of Petra, Jordan.
| | | | | | | | | |
Collapse
|
39
|
Parsa AB, Taghipour H, Derrible S, Mohammadian AK. Real-time accident detection: Coping with imbalanced data. ACCIDENT; ANALYSIS AND PREVENTION 2019; 129:202-210. [PMID: 31170559 DOI: 10.1016/j.aap.2019.05.014] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/28/2019] [Revised: 05/16/2019] [Accepted: 05/16/2019] [Indexed: 06/09/2023]
Abstract
Detecting accidents is of great importance since they often impose significant delay and inconvenience to road users. This study compares the performance of two popular machine learning models, Support Vector Machine (SVM) and Probabilistic Neural Network (PNN), to detect the occurrence of accidents on the Eisenhower expressway in Chicago. Accordingly, since the detection of accidents should be as rapid as possible, seven models are trained and tested for each machine learning technique, using traffic condition data from 1 to 7 min after the actual occurrence. The main sources of data used in this study consist of weather condition, accident, and loop detector data. Furthermore, to overcome the problem of imbalanced data (i.e., underrepresentation of accidents in the dataset), the Synthetic Minority Oversampling TEchnique (SMOTE) is used. The results show that although SVM achieves overall higher accuracy, PNN outperforms SVM regarding the Detection Rate (DR) (i.e., percentage of correct accident detections). In addition, while both models perform best at 5 min after the occurrence of accidents, models trained at 3 or 4 min after the occurrence of an accident detect accidents more rapidly while performing reasonably well. Lastly, a sensitivity analysis of PNN for Time-To-Detection (TTD) reveals that the speed difference between upstream and downstream of accidents location is particularly significant to detect the occurrence of accidents.
Collapse
Affiliation(s)
- Amir Bahador Parsa
- Department of Civil and Materials Engineering, University of Illinois at Chicago, 842 W Taylor St, 2054 ERF, Chicago, IL 60607, United States.
| | - Homa Taghipour
- Department of Civil and Materials Engineering, University of Illinois at Chicago, 842 W Taylor St, 2054 ERF, Chicago, IL 60607, United States.
| | - Sybil Derrible
- Department of Civil and Materials Engineering, Institute of Environmental Science and Policy, University of Illinois at Chicago, 842 W Taylor St, 2071 ERF, Chicago, IL 60607, United States.
| | - Abolfazl Kouros Mohammadian
- Department of Civil and Materials Engineering, University of Illinois at Chicago, 842 W Taylor St, 2093 ERF, Chicago, IL 60607, United States.
| |
Collapse
|