151
|
Cost-Sensitive Variational Autoencoding Classifier for Imbalanced Data Classification. ALGORITHMS 2022. [DOI: 10.3390/a15050139] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
Classification is among the core tasks in machine learning. Existing classification algorithms are typically based on the assumption of at least roughly balanced data classes. When performing tasks involving imbalanced data, such classifiers ignore the minority data in consideration of the overall accuracy. The performance of traditional classification algorithms based on the assumption of balanced data distribution is insufficient because the minority-class samples are often more important than others, such as positive samples, in disease diagnosis. In this study, we propose a cost-sensitive variational autoencoding classifier that combines data-level and algorithm-level methods to solve the problem of imbalanced data classification. Cost-sensitive factors are introduced to assign a high cost to the misclassification of minority data, which biases the classifier toward minority data. We also designed misclassification costs closely related to tasks by embedding domain knowledge. Experimental results show that the proposed method performed the classification of bulk amorphous materials well.
Collapse
|
152
|
|
153
|
Wang X, Gong J, Song Y, Hu J. Adaptively weighted three-way decision oversampling: A cluster imbalanced-ratio based approach. APPL INTELL 2022. [DOI: 10.1007/s10489-022-03394-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
154
|
Towards hybrid over- and under-sampling combination methods for class imbalanced datasets: an experimental study. Artif Intell Rev 2022. [DOI: 10.1007/s10462-022-10186-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
155
|
An Empirical Assessment of Performance of Data Balancing Techniques in Classification Task. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12083928] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Many real-world classification problems such as fraud detection, intrusion detection, churn prediction, and anomaly detection suffer from the problem of imbalanced datasets. Therefore, in all such classification tasks, we need to balance the imbalanced datasets before building classifiers for prediction purposes. Several data-balancing techniques (DBT) have been discussed in the literature to address this issue. However, not much work is conducted to assess the performance of DBT. Therefore, in this research paper we empirically assess the performance of the data-preprocessing-level data-balancing techniques, namely: Under Sampling (OS), Over Sampling (OS), Hybrid Sampling (HS), Random Over Sampling Examples (ROSE), Synthetic Minority Over Sampling (SMOTE), and Clustering-Based Under Sampling (CBUS) techniques. We have used six different classifiers and twenty-five different datasets, that have varying levels of imbalance ratio (IR), to assess the performance of DBT. The experimental results indicate that DBT helps to improve the performance of the classifiers. However, no significant difference was observed in the performance of the US, OS, HS, SMOTE, and CBUS. It was also observed that performance of DBT was not consistent across varying levels of IR in the dataset and different classifiers.
Collapse
|
156
|
Zhu Z, Wang Z, Li D, Du W. Globalized Multiple Balanced Subsets With Collaborative Learning for Imbalanced Data. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:2407-2417. [PMID: 32609619 DOI: 10.1109/tcyb.2020.3001158] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The skewed distribution of data brings difficulties to classify minority and majority samples in the imbalanced problem. The balanced bagging randomly undersampes majority samples several times and combines the selected majority samples with minority samples to form several balanced subsets, in which the numbers of minority and majority samples are roughly equal. However, the balanced bagging is the lack of a unified learning framework. Moreover, it fails to concern the connection of all subsets and the global information of the entire data distribution. To this end, this article puts several balanced subsets into an effective learning framework with a criterion function. In the learning framework, one regularization term called RS establishes the connection and realizes the collaborative learning of all subsets by requiring the consistent outputs of the minority samples in different subsets. Besides, another regularization term called RW provides the global information to each basic classifier by reducing the difference between the direction of the solution vector in each subset and that in the entire dataset. The proposed learning framework is called globalized multiple balanced subsets with collaborative learning (GMBSCL). The experimental results validate the effectiveness of the proposed GMBSCL.
Collapse
|
157
|
Tarimo CS, Bhuyan SS, Zhao Y, Ren W, Mohammed A, Li Q, Gardner M, Mahande MJ, Wang Y, Wu J. Prediction of low Apgar score at five minutes following labor induction intervention in vaginal deliveries: machine learning approach for imbalanced data at a tertiary hospital in North Tanzania. BMC Pregnancy Childbirth 2022; 22:275. [PMID: 35365129 PMCID: PMC8976377 DOI: 10.1186/s12884-022-04534-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2021] [Accepted: 02/28/2022] [Indexed: 11/18/2022] Open
Abstract
Background Prediction of low Apgar score for vaginal deliveries following labor induction intervention is critical for improving neonatal health outcomes. We set out to investigate important attributes and train popular machine learning (ML) algorithms to correctly classify neonates with a low Apgar scores from an imbalanced learning perspective. Methods We analyzed 7716 induced vaginal deliveries from the electronic birth registry of the Kilimanjaro Christian Medical Centre (KCMC). 733 (9.5%) of which constituted of low (< 7) Apgar score neonates. The ‘extra-tree classifier’ was used to assess features’ importance. We used Area Under Curve (AUC), recall, precision, F-score, Matthews Correlation Coefficient (MCC), balanced accuracy (BA), bookmaker informedness (BM), and markedness (MK) to evaluate the performance of the selected six (6) machine learning classifiers. To address class imbalances, we examined three widely used resampling techniques: the Synthetic Minority Oversampling Technique (SMOTE) and Random Oversampling Examples (ROS) and Random undersampling techniques (RUS). We applied Decision Curve Analysis (DCA) to evaluate the net benefit of the selected classifiers. Results Birth weight, maternal age, and gestational age were found to be important predictors for the low Apgar score following induced vaginal delivery. SMOTE, ROS and and RUS techniques were more effective at improving “recalls” among other metrics in all the models under investigation. A slight improvement was observed in the F1 score, BA, and BM. DCA revealed potential benefits of applying Boosting method for predicting low Apgar scores among the tested models. Conclusion There is an opportunity for more algorithms to be tested to come up with theoretical guidance on more effective rebalancing techniques suitable for this particular imbalanced ratio. Future research should prioritize a debate on which performance indicators to look up to when dealing with imbalanced or skewed data. Supplementary Information The online version contains supplementary material available at 10.1186/s12884-022-04534-0.
Collapse
Affiliation(s)
- Clifford Silver Tarimo
- Department of Epidemiology and Health Statistics, College of Public Health, Zhengzhou University, 100 Kexue Avenue, Zhengzhou, 450001, Henan, China.,Department of Science and Laboratory Technology, Dar es Salaam Institute of Technology, P.O. Box 2958, Dar es Salaam, Tanzania
| | - Soumitra S Bhuyan
- Rutgers University-New Brunswick, Edward J. Bloustein, School of Planning and Public Policy, New Brunswick, USA
| | - Yizhen Zhao
- Luoyang Orthopedic Traumatological Hospital of Henan Province, Luoyang, China
| | - Weicun Ren
- Department of Epidemiology and Health Statistics, College of Public Health, Zhengzhou University, 100 Kexue Avenue, Zhengzhou, 450001, Henan, China.,College of Sanquan, Xinxiang Medical University, Xinxiang, People's Republic of China
| | - Akram Mohammed
- Center for Biomedical Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Quanman Li
- Department of Epidemiology and Health Statistics, College of Public Health, Zhengzhou University, 100 Kexue Avenue, Zhengzhou, 450001, Henan, China
| | - Marilyn Gardner
- Department of Public Health, Western Kentucky University, 1906 College Heights Blvd, Bowling Green, KY, 42101, USA
| | - Michael Johnson Mahande
- Institute of Public Health, Kilimanjaro Christian Medical University College, P.O. Box 2240, Moshi, Tanzania
| | - Yuhui Wang
- Centre for Financial and Corporate Integrity, Coventry University, Coventry, UK
| | - Jian Wu
- Department of Epidemiology and Health Statistics, College of Public Health, Zhengzhou University, 100 Kexue Avenue, Zhengzhou, 450001, Henan, China. .,Henan Province Engineering Research Center of Health Economics & Health Technology Assessment, Henan Province, China.
| |
Collapse
|
158
|
|
159
|
Witt UF, Nibe SM, Ole H, Lebech CS. A novel approach for predicting acute hospitalizations among elderly recipients of home care? A model development study. Int J Med Inform 2022; 160:104715. [DOI: 10.1016/j.ijmedinf.2022.104715] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2021] [Revised: 01/25/2022] [Accepted: 02/07/2022] [Indexed: 10/19/2022]
|
160
|
Ren J, Wang Y, Mao M, Cheung YM. Equalization ensemble for large scale highly imbalanced data classification. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.108295] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
161
|
Verma R, Pargal S, Das D, Parbat T, Kambalapalli SS, Mitra B, Chakraborty S. Impact of Driving Behavior on Commuter’s Comfort during Cab Rides: Towards a New Perspective of Driver Rating. ACM T INTEL SYST TEC 2022. [DOI: 10.1145/3523063] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
Commuter comfort in cab rides affects driver rating as well as the reputation of ride-hailing firms like Uber/Lyft. Existing research has revealed that commuter comfort not only varies at a personalized level but also is perceived differently on different trips for the same commuter. Furthermore, there are several factors, including driving behavior and driving environment, affecting the perception of comfort. Automatically extracting the perceived comfort level of a commuter due to the impact of the driving behavior is crucial for a timely feedback to the drivers, which can help them to meet the commuter’s satisfaction. In light of this, we surveyed around 200 commuters who usually take such cab rides and obtained a set of features that impact comfort during cab rides. Following this, we develop a system
Ridergo
which collects smartphone sensor data from a commuter, extracts the spatial time series feature from the data, and then computes the level of commuter comfort on a five-point scale with respect to the driving.
Ridergo
uses a Hierarchical Temporal Memory model-based approach to observe anomalies in the feature distribution and then trains a Multi-task learning-based neural network model to obtain the comfort level of the commuter at a personalized level. The model also intelligently queries the commuter to add new data points to the available dataset and, in turn, improve itself over periodic training. Evaluation of
Ridergo
on 30 participants shows that the system could provide efficient comfort score with high accuracy when the driving impacts the perceived comfort.
Collapse
Affiliation(s)
| | | | | | | | | | - Bivas Mitra
- Indian Institute of Technology Kharagpur, India
| | | |
Collapse
|
162
|
Suri JS, Bhagawati M, Paul S, Protogerou AD, Sfikakis PP, Kitas GD, Khanna NN, Ruzsa Z, Sharma AM, Saxena S, Faa G, Laird JR, Johri AM, Kalra MK, Paraskevas KI, Saba L. A Powerful Paradigm for Cardiovascular Risk Stratification Using Multiclass, Multi-Label, and Ensemble-Based Machine Learning Paradigms: A Narrative Review. Diagnostics (Basel) 2022; 12:722. [PMID: 35328275 PMCID: PMC8947682 DOI: 10.3390/diagnostics12030722] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2022] [Revised: 03/10/2022] [Accepted: 03/13/2022] [Indexed: 12/16/2022] Open
Abstract
Background and Motivation: Cardiovascular disease (CVD) causes the highest mortality globally. With escalating healthcare costs, early non-invasive CVD risk assessment is vital. Conventional methods have shown poor performance compared to more recent and fast-evolving Artificial Intelligence (AI) methods. The proposed study reviews the three most recent paradigms for CVD risk assessment, namely multiclass, multi-label, and ensemble-based methods in (i) office-based and (ii) stress-test laboratories. Methods: A total of 265 CVD-based studies were selected using the preferred reporting items for systematic reviews and meta-analyses (PRISMA) model. Due to its popularity and recent development, the study analyzed the above three paradigms using machine learning (ML) frameworks. We review comprehensively these three methods using attributes, such as architecture, applications, pro-and-cons, scientific validation, clinical evaluation, and AI risk-of-bias (RoB) in the CVD framework. These ML techniques were then extended under mobile and cloud-based infrastructure. Findings: Most popular biomarkers used were office-based, laboratory-based, image-based phenotypes, and medication usage. Surrogate carotid scanning for coronary artery risk prediction had shown promising results. Ground truth (GT) selection for AI-based training along with scientific and clinical validation is very important for CVD stratification to avoid RoB. It was observed that the most popular classification paradigm is multiclass followed by the ensemble, and multi-label. The use of deep learning techniques in CVD risk stratification is in a very early stage of development. Mobile and cloud-based AI technologies are more likely to be the future. Conclusions: AI-based methods for CVD risk assessment are most promising and successful. Choice of GT is most vital in AI-based models to prevent the RoB. The amalgamation of image-based strategies with conventional risk factors provides the highest stability when using the three CVD paradigms in non-cloud and cloud-based frameworks.
Collapse
Affiliation(s)
- Jasjit S. Suri
- Stroke Diagnostic and Monitoring Division, AtheroPoint™, Roseville, CA 95661, USA
| | - Mrinalini Bhagawati
- Department of Biomedical Engineering, North-Eastern Hill University, Shillong 793022, India; (M.B.); (S.P.)
| | - Sudip Paul
- Department of Biomedical Engineering, North-Eastern Hill University, Shillong 793022, India; (M.B.); (S.P.)
| | - Athanasios D. Protogerou
- Research Unit Clinic, Laboratory of Pathophysiology, Department of Cardiovascular Prevention, National and Kapodistrian University of Athens, 11527 Athens, Greece;
| | - Petros P. Sfikakis
- Rheumatology Unit, National Kapodistrian University of Athens, 11527 Athens, Greece;
| | - George D. Kitas
- Arthritis Research UK Centre for Epidemiology, Manchester University, Manchester 46962, UK;
| | - Narendra N. Khanna
- Department of Cardiology, Indraprastha APOLLO Hospitals, New Delhi 110020, India;
| | - Zoltan Ruzsa
- Department of Internal Medicines, Invasive Cardiology Division, University of Szeged, 6720 Szeged, Hungary;
| | - Aditya M. Sharma
- Division of Cardiovascular Medicine, University of Virginia, Charlottesville, VA 22903, USA;
| | - Sanjay Saxena
- Department of CSE, International Institute of Information Technology, Bhubaneswar 751003, India;
| | - Gavino Faa
- Department of Pathology, A.O.U., di Cagliari-Polo di Monserrato s.s., 09045 Cagliari, Italy;
| | - John R. Laird
- Cardiology Department, St. Helena Hospital, St. Helena, CA 94574, USA;
| | - Amer M. Johri
- Department of Medicine, Division of Cardiology, Queen’s University, Kingston, ON K7L 3N6, Canada;
| | - Manudeep K. Kalra
- Department of Radiology, Massachusetts General Hospital, Boston, MA 02114, USA;
| | - Kosmas I. Paraskevas
- Department of Vascular Surgery, Central Clinic of Athens, N. Iraklio, 14122 Athens, Greece;
| | - Luca Saba
- Department of Radiology, A.O.U., di Cagliari-Polo di Monserrato s.s., 09045 Cagliari, Italy;
| |
Collapse
|
163
|
Propension to customer churn in a financial institution: a machine learning approach. Neural Comput Appl 2022; 34:11751-11768. [PMID: 35281625 PMCID: PMC8898559 DOI: 10.1007/s00521-022-07067-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2021] [Accepted: 02/05/2022] [Indexed: 01/08/2023]
Abstract
This paper examines churn prediction of customers in the banking sector using a unique customer-level dataset from a large Brazilian bank. Our main contribution is in exploring this rich dataset, which contains prior client behavior traits that enable us to document new insights into the main determinants predicting future client churn. We conduct a horserace of many supervised machine learning algorithms under the same cross-validation and evaluation setup, enabling a fair comparison across algorithms. We find that the random forests technique outperforms decision trees, k-nearest neighbors, elastic net, logistic regression, and support vector machines models in several metrics. Our investigation reveals that customers with a stronger relationship with the institution, who have more products and services, who borrow more from the bank, are less likely to close their checking accounts. Using a back-of-the-envelope estimation, we find that our model has the potential to forecast potential losses of up to 10% of the operating result reported by the largest Brazilian banks in 2019, suggesting the model has a significant economic impact. Our results corroborate the importance of investing in cross-selling and upselling strategies focused on their current customers. These strategies can have positive side effects on customer retention.
Collapse
|
164
|
RoiSeg: An Effective Moving Object Segmentation Approach Based on Region-of-Interest with Unsupervised Learning. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12052674] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
Traditional video object segmentation often has low detection speed and inaccurate results due to the jitter caused by the pan-and-tilt or hand-held devices. Deep neural network (DNN) has been widely adopted to address these problems; however, it relies on a large number of annotated data and high-performance computing units. Therefore, DNN is not suitable for some special scenarios (e.g., no prior knowledge or powerful computing ability). In this paper, we propose RoiSeg, an effective moving object segmentation approach based on Region-of-Interest (ROI), which utilizes unsupervised learning method to achieve automatic segmentation of moving objects. Specifically, we first hypothesize that the central n × n pixels of images act as the ROI to represent the features of the segmented moving object. Second, we pool the ROI to a central point of the foreground to simplify the segmentation problem into a classification problem based on ROI. Third but not the least, we implement a trajectory-based classifier and an online updating mechanism to address the classification problem and the compensation of class imbalance, respectively. We conduct extensive experiments to evaluate the performance of RoiSeg and the experimental results demonstrate that RoiSeg is more accurate and faster compared with other segmentation algorithms. Moreover, RoiSeg not only effectively handles ambient lighting changes, fog, salt and pepper noise, but also has a good ability to deal with camera jitter and windy scenes.
Collapse
|
165
|
Desprez M, Zawada K, Ramp D. Overcoming the ordinal imbalanced data problem by combining data processing and stacked generalizations. MACHINE LEARNING WITH APPLICATIONS 2022. [DOI: 10.1016/j.mlwa.2021.100241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
|
166
|
Zhang J, Wang T, Ng WW, Pedrycz W. Ensembling perturbation-based oversamplers for imbalanced datasets. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.01.049] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
167
|
Pei W, Xue B, Shang L, Zhang M. High-Dimensional Unbalanced Binary Classification by Genetic Programming with Multi-Criterion Fitness Evaluation and Selection. EVOLUTIONARY COMPUTATION 2022; 30:99-129. [PMID: 34902018 DOI: 10.1162/evco_a_00304] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/27/2020] [Accepted: 09/10/2021] [Indexed: 06/14/2023]
Abstract
High-dimensional unbalanced classification is challenging because of the joint effects of high dimensionality and class imbalance. Genetic programming (GP) has the potential benefits for use in high-dimensional classification due to its built-in capability to select informative features. However, once data are not evenly distributed, GP tends to develop biased classifiers which achieve a high accuracy on the majority class but a low accuracy on the minority class. Unfortunately, the minority class is often at least as important as the majority class. It is of importance to investigate how GP can be effectively utilized for high-dimensional unbalanced classification. In this article, to address the performance bias issue of GP, a new two-criterion fitness function is developed, which considers two criteria, that is, the approximation of area under the curve (AUC) and the classification clarity (i.e., how well a program can separate two classes). The obtained values on the two criteria are combined in pairs, instead of summing them together. Furthermore, this article designs a three-criterion tournament selection to effectively identify and select good programs to be used by genetic operators for generating offspring during the evolutionary learning process. The experimental results show that the proposed method achieves better classification performance than other compared methods.
Collapse
Affiliation(s)
- Wenbin Pei
- School of Engineering and Computer Science, Victoria University of Wellington, PO Box 600, Wellington 6140, New Zealand
| | - Bing Xue
- School of Engineering and Computer Science, Victoria University of Wellington, PO Box 600, Wellington 6140, New Zealand
| | - Lin Shang
- State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China
| | - Mengjie Zhang
- School of Engineering and Computer Science, Victoria University of Wellington, PO Box 600, Wellington 6140, New Zealand
| |
Collapse
|
168
|
Pankajavalli PB, Karthick GS. An Independent Constructive Multi-class Classification Algorithm for Predicting the Risk Level of Stress Using Multi-modal Data. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING 2022. [DOI: 10.1007/s13369-022-06643-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
169
|
Ratsimbazafindranahaka MN, Huetz C, Andrianarimisa A, Reidenberg JS, Saloma A, Adam O, Charrier I. Characterizing the suckling behavior by video and 3D-accelerometry in humpback whale calves on a breeding ground. PeerJ 2022; 10:e12945. [PMID: 35194528 PMCID: PMC8858581 DOI: 10.7717/peerj.12945] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2021] [Accepted: 01/25/2022] [Indexed: 01/11/2023] Open
Abstract
Getting maternal milk through nursing is vital for all newborn mammals. Despite its importance, nursing has been poorly documented in humpback whales (Megaptera novaeangliae). Nursing is difficult to observe underwater without disturbing the whales and is usually impossible to observe from a ship. We attempted to observe nursing from the calf's perspective by placing CATS cam tags on three humpback whale calves in the Sainte Marie channel, Madagascar, Indian Ocean, during the breeding seasons. CATS cam tags are animal-borne multi-sensor tags equipped with a video camera, a hydrophone, and several auxiliary sensors (including a 3-axis accelerometer, a 3-axis magnetometer, and a depth sensor). The use of multi-sensor tags minimized potential disturbance from human presence. A total of 10.52 h of video recordings were collected with the corresponding auxiliary data. Video recordings were manually analyzed and correlated with the auxiliary data, allowing us to extract different kinematic features including the depth rate, speed, Fluke Stroke Rate (FSR), Overall Body Dynamic Acceleration (ODBA), pitch, roll, and roll rate. We found that suckling events lasted 18.8 ± 8.8 s on average (N = 34) and were performed mostly during dives. Suckling events represented 1.7% of the total observation time. During suckling, the calves were visually estimated to be at a 30-45° pitch angle relative to the midline of their mother's body and were always observed rolling either to the right or to the left. In our auxiliary dataset, we confirmed that suckling behavior was primarily characterized by a high average absolute roll and additionally we also found that it was likely characterized by a high average FSR and a low average speed. Kinematic features were used for supervised machine learning in order to subsequently detect suckling behavior automatically. Our study is a proof of method on which future investigations can build upon. It opens new opportunities for further investigation of suckling behavior in humpback whales and the baleen whale species.
Collapse
Affiliation(s)
- Maevatiana N. Ratsimbazafindranahaka
- Association Cétamada, Barachois Sainte Marie, Madagascar,Institut des Neurosciences Paris-Saclay, Université Paris-Saclay, CNRS, Saclay, France,Département de Zoologie et Biodiversité Animale, Université d’Antananarivo, Antananarivo, Madagascar
| | - Chloé Huetz
- Institut des Neurosciences Paris-Saclay, Université Paris-Saclay, CNRS, Saclay, France
| | - Aristide Andrianarimisa
- Département de Zoologie et Biodiversité Animale, Université d’Antananarivo, Antananarivo, Madagascar
| | - Joy S. Reidenberg
- Center for Anatomy and Functional Morphology, Icahn School of Medicine at Mount Sinai, New York, United States of America
| | - Anjara Saloma
- Association Cétamada, Barachois Sainte Marie, Madagascar
| | - Olivier Adam
- Institut des Neurosciences Paris-Saclay, Université Paris-Saclay, CNRS, Saclay, France,Institut Jean Le Rond d’Alembert, Sorbonne Université, Paris, France
| | - Isabelle Charrier
- Institut des Neurosciences Paris-Saclay, Université Paris-Saclay, CNRS, Saclay, France
| |
Collapse
|
170
|
Rafi-Ur-Rashid M, Mahbub M, Adnan MA. Breaking the Curse of Class Imbalance: Bangla Text Classification. ACM T ASIAN LOW-RESO 2022. [DOI: 10.1145/3511601] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
This paper addresses the class imbalance issue in a low-resource language called Bengali. As a use-case, we choose one of the most fundamental NLP tasks, i.e., text classification, where we utilize three benchmark text corpus- fake news dataset, sentiment analysis dataset, and song lyrics dataset. Each of them contains a critical class imbalance. We attempt to tackle the problem by applying several strategies that include data augmentation with synthetic samples via text and embedding generation in order to augment the proportion of the minority samples. Moreover, we apply ensembling of deep learning models by subsetting the majority samples. Additionally, we enforce the focal loss function for class imbalanced data classification. We also apply the outlier detection technique, data resampling, and hidden feature extraction to improve the minority-f1 score. All of our experimentations are entirely focused on textual content analysis, which results in more than
90%
minority-f1 score for each of the three tasks. It is an excellent outcome on such highly class-imbalanced datasets.
Collapse
Affiliation(s)
| | | | - Muhammad Abdullah Adnan
- Bangladesh University of Engineering & Technology (BUET), Bangladesh and United International University, Bangladesh
| |
Collapse
|
171
|
Zhang S, Khattak A, Matara CM, Hussain A, Farooq A. Hybrid feature selection-based machine learning Classification system for the prediction of injury severity in single and multiple-vehicle accidents. PLoS One 2022; 17:e0262941. [PMID: 35108288 PMCID: PMC8809572 DOI: 10.1371/journal.pone.0262941] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2021] [Accepted: 01/07/2022] [Indexed: 11/19/2022] Open
Abstract
To undertake a reliable analysis of injury severity in road traffic accidents, a complete understanding of important attributes is essential. As a result of the shift from traditional statistical parametric procedures to computer-aided methods, machine learning approaches have become an important aspect in predicting the severity of road traffic injuries. The paper presents a hybrid feature selection-based machine learning classification approach for detecting significant attributes and predicting injury severity in single and multiple-vehicle accidents. To begin, we employed a Random Forests (RF) classifier in conjunction with an intrinsic wrapper-based feature selection approach called the Boruta Algorithm (BA) to find the relevant important attributes that determine injury severity. The influential attributes were then fed into a set of four classifiers to accurately predict injury severity (Naive Bayes (NB), K-Nearest Neighbor (K-NN), Binary Logistic Regression (BLR), and Extreme Gradient Boosting (XGBoost)). According to BA's experimental investigation, the vehicle type was the most influential factor, followed by the month of the year, the driver's age, and the alignment of the road segment. The driver's gender, the presence of a median, and the presence of a shoulder were all found to be unimportant. According to classifier performance measures, XGBoost surpasses the other classifiers in terms of prediction performance. Using the specified attributes, the accuracy, Cohen's Kappa, F1-Measure, and AUC-ROC values of the XGBoost were 82.10%, 0.607, 0.776, and 0.880 for single vehicle accidents and 79.52%, 0.569, 0.752, and 0.86 for multiple-vehicle accidents, respectively.
Collapse
Affiliation(s)
- Shuguang Zhang
- CCCC Southwest Investment & Development Company Limited, Beijing, China
| | - Afaq Khattak
- The Key Laboratory of Road and Traffic Engineering, Ministry of Education, Tongji University, Jiading, Shanghai, China
| | | | - Arshad Hussain
- NUST Institute of Civil Engineering, National University of Sciences and Technology, Islamabad, Pakistan
| | - Asim Farooq
- Head of Department at Centre of Excellence in Transportation Engineering, Pak Austria Facshhoule, Institute of Applied Sciences, Haripur, Pakistan
| |
Collapse
|
172
|
An imbalanced learning method by combining SMOTE with Center Offset Factor. Appl Soft Comput 2022. [DOI: 10.1016/j.asoc.2022.108618] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
173
|
Ng WWY, Xu S, Zhang J, Tian X, Rong T, Kwong S. Hashing-Based Undersampling Ensemble for Imbalanced Pattern Classification Problems. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:1269-1279. [PMID: 32598288 DOI: 10.1109/tcyb.2020.3000754] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Undersampling is a popular method to solve imbalanced classification problems. However, sometimes it may remove too many majority samples which may lead to loss of informative samples. In this article, the hashing-based undersampling ensemble (HUE) is proposed to deal with this problem by constructing diversified training subspaces for undersampling. Samples in the majority class are divided into many subspaces by a hashing method. Each subspace corresponds to a training subset which consists of most of the samples from this subspace and a few samples from surrounding subspaces. These training subsets are used to train an ensemble of classification and regression tree classifiers with all minority class samples. The proposed method is tested on 25 UCI datasets against state-of-the-art methods. Experimental results show that the HUE outperforms other methods and yields good results on highly imbalanced datasets.
Collapse
|
174
|
Karthik VSS, Mishra A, Reddy US. Credit Card Fraud Detection by Modelling Behaviour Pattern using Hybrid Ensemble Model. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING 2022. [DOI: 10.1007/s13369-021-06147-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
175
|
Li J, Tao Y, Cong H, Zhu E, Cai T. Predicting liver cancers using skewed epidemiological data. Artif Intell Med 2022; 124:102234. [DOI: 10.1016/j.artmed.2021.102234] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2020] [Revised: 11/26/2021] [Accepted: 12/21/2021] [Indexed: 01/04/2023]
|
176
|
Binary imbalanced big data classification based on fuzzy data reduction and classifier fusion. Soft comput 2022. [DOI: 10.1007/s00500-021-06654-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
177
|
Design of an ATC Tool for Conflict Detection Based on Machine Learning Techniques. AEROSPACE 2022. [DOI: 10.3390/aerospace9020067] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
Given the ongoing interest in the application of Machine Learning (ML) techniques, the development of new Air Traffic Control (ATC) tools is paramount for the improvement of the management of the air transport system. This article develops an ATC tool based on ML techniques for conflict detection. The methodology develops a data-driven approach that predicts separation infringements between aircraft within airspace. The methodology exploits two different ML algorithms: classification and regression. Classification algorithms denote aircraft pairs as a Situation of Interest (SI), i.e., when two aircraft are predicted to cross with a separation lower than 10 Nautical Miles (NM) and 1000 feet. Regression algorithms predict the minimum separation expected between an aircraft pair. This data-driven approach extracts ADS-B trajectories from the OpenSky Network. In addition, the historical ADS-B trajectories work as 4D trajectory predictions to be used as inputs for the database. Conflict and SI are simulated by performing temporary modifications to ensure that the aircraft pierces into the airspace in the same time period. The methodology is applied to Switzerland’s airspace. The results show that the ML algorithms could perform conflict prediction with high-accuracy metrics: 99% for SI classification and 1.5 NM for RMSE.
Collapse
|
178
|
Yousef R, Gupta G, Yousef N, Khari M. A holistic overview of deep learning approach in medical imaging. MULTIMEDIA SYSTEMS 2022; 28:881-914. [PMID: 35079207 PMCID: PMC8776556 DOI: 10.1007/s00530-021-00884-5] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Accepted: 12/23/2021] [Indexed: 05/07/2023]
Abstract
Medical images are a rich source of invaluable necessary information used by clinicians. Recent technologies have introduced many advancements for exploiting the most of this information and use it to generate better analysis. Deep learning (DL) techniques have been empowered in medical images analysis using computer-assisted imaging contexts and presenting a lot of solutions and improvements while analyzing these images by radiologists and other specialists. In this paper, we present a survey of DL techniques used for variety of tasks along with the different medical image's modalities to provide critical review of the recent developments in this direction. We have organized our paper to provide significant contribution of deep leaning traits and learn its concepts, which is in turn helpful for non-expert in medical society. Then, we present several applications of deep learning (e.g., segmentation, classification, detection, etc.) which are commonly used for clinical purposes for different anatomical site, and we also present the main key terms for DL attributes like basic architecture, data augmentation, transfer learning, and feature selection methods. Medical images as inputs to deep learning architectures will be the mainstream in the coming years, and novel DL techniques are predicted to be the core of medical images analysis. We conclude our paper by addressing some research challenges and the suggested solutions for them found in literature, and also future promises and directions for further developments.
Collapse
Affiliation(s)
- Rammah Yousef
- Yogananda School of AI Computer and Data Sciences, Shoolini University, Solan, 173229 Himachal Pradesh India
| | - Gaurav Gupta
- Yogananda School of AI Computer and Data Sciences, Shoolini University, Solan, 173229 Himachal Pradesh India
| | - Nabhan Yousef
- Electronics and Communication Engineering, Marwadi University, Rajkot, Gujrat India
| | - Manju Khari
- Jawaharlal Nehru University, New Delhi, India
| |
Collapse
|
179
|
Yi X, Xu Y, Hu Q, Krishnamoorthy S, Li W, Tang Z. ASN-SMOTE: a synthetic minority oversampling method with adaptive qualified synthesizer selection. COMPLEX INTELL SYST 2022. [DOI: 10.1007/s40747-021-00638-w] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
AbstractOversampling is a promising preprocessing technique for imbalanced datasets which generates new minority instances to balance the dataset. However, improper generated minority instances, i.e., noise instances, may interfere the learning of the classifier and impact it negatively. Given this, in this paper, we propose a simple and effective oversampling approach known as ASN-SMOTE based on the k-nearest neighbors and the synthetic minority oversampling technology (SMOTE). ASN-SMOTE first filters noise in the minority class by determining whether the nearest neighbor of each minority instance belongs to the minority or majority class. After that, ASN-SMOTE uses the nearest majority instance of each minority instance to effectively perceive the decision boundary, inside which the qualified minority instances are selected adaptively for each minority instance by the proposed adaptive neighbor selection scheme to synthesize new minority instance. To substantiate the effectiveness, ASN-SMOTE has been applied to three different classifiers and comprehensive experiments have been conducted on 24 imbalanced benchmark datasets. ASN-SMOTE is also extensively compared with nine notable oversampling algorithms. The results show that ASN-SMOTE achieves the best results in the majority of datasets. The ASN-SMOTE implementation is available at: https://www.github.com/yixinkai123/ASN-SMOTE/.
Collapse
|
180
|
RDPVR: Random Data Partitioning with Voting Rule for Machine Learning from Class-Imbalanced Datasets. ELECTRONICS 2022. [DOI: 10.3390/electronics11020228] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Since most classifiers are biased toward the dominant class, class imbalance is a challenging problem in machine learning. The most popular approaches to solving this problem include oversampling minority examples and undersampling majority examples. Oversampling may increase the probability of overfitting, whereas undersampling eliminates examples that may be crucial to the learning process. We present a linear time resampling method based on random data partitioning and a majority voting rule to address both concerns, where an imbalanced dataset is partitioned into a number of small subdatasets, each of which must be class balanced. After that, a specific classifier is trained for each subdataset, and the final classification result is established by applying the majority voting rule to the results of all of the trained models. We compared the performance of the proposed method to some of the most well-known oversampling and undersampling methods, employing a range of classifiers, on 33 benchmark machine learning class-imbalanced datasets. The classification results produced by the classifiers employed on the generated data by the proposed method were comparable to most of the resampling methods tested, with the exception of SMOTEFUNA, which is an oversampling method that increases the probability of overfitting. The proposed method produced results that were comparable to the Easy Ensemble (EE) undersampling method. As a result, for solving the challenge of machine learning from class-imbalanced datasets, we advocate using either EE or our method.
Collapse
|
181
|
Ni Q, Fan Z, Zhang L, Zhang B, Zheng X, Zhang Y. Daily Activity Recognition and Tremor Quantification from Accelerometer Data for Patients with Essential Tremor Using Stacked Denoising Autoencoders. INT J COMPUT INT SYS 2022. [DOI: 10.1007/s44196-021-00052-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022] Open
Abstract
AbstractHuman activity recognition (HAR) has received more and more attention, which is able to play an important role in many fields, such as healthcare and intelligent home. Thus, we have discussed an application of activity recognition in the healthcare field in this paper. Essential tremor (ET) is a common neurological disorder that can make people with this disease rise involuntary tremor. Nowadays, the disease is easy to be misdiagnosed as other diseases. We have combined the essential tremor and activity recognition to recognize ET patients’ activities and evaluate the degree of ET for providing an auxiliary analysis toward disease diagnosis by utilizing stacked denoising autoencoder (SDAE) model. Meanwhile, it is difficult for model to learn enough useful features due to the small behavior dataset from ET patients. Thus, resampling techniques are proposed to alleviate small sample size and imbalanced samples problems. In our experiment, 20 patients with ET and 5 healthy people have been chosen to collect their acceleration data for activity recognition. The experimental results show the significant result on ET patients activity recognition and the SDAE model has achieved an overall accuracy of 93.33%. What’s more, this model is also used to evaluate the degree of ET and has achieved the accuracy of 95.74%. According to a set of experiments, the model we used is able to acquire significant performance on ET patients activity recognition and degree of tremor assessment.
Collapse
|
182
|
Islam A, Belhaouari SB, Rehman AU, Bensmail H. KNNOR: An oversampling technique for imbalanced datasets. Appl Soft Comput 2022. [DOI: 10.1016/j.asoc.2021.108288] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
183
|
Qiu WR, Guan MY, Wang QK, Lou LL, Xiao X. Identifying Pupylation Proteins and Sites by Incorporating Multiple Methods. Front Endocrinol (Lausanne) 2022; 13:849549. [PMID: 35557849 PMCID: PMC9088680 DOI: 10.3389/fendo.2022.849549] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/06/2022] [Accepted: 03/07/2022] [Indexed: 11/20/2022] Open
Abstract
Pupylation is an important posttranslational modification in proteins and plays a key role in the cell function of microorganisms; an accurate prediction of pupylation proteins and specified sites is of great significance for the study of basic biological processes and development of related drugs since it would greatly save experimental costs and improve work efficiency. In this work, we first constructed a model for identifying pupylation proteins. To improve the pupylation protein prediction model, the KNN scoring matrix model based on functional domain GO annotation and the Word Embedding model were used to extract the features and Random Under-sampling (RUS) and Synthetic Minority Over-sampling Technique (SMOTE) were applied to balance the dataset. Finally, the balanced data sets were input into Extreme Gradient Boosting (XGBoost). The performance of 10-fold cross-validation shows that accuracy (ACC), Matthew's correlation coefficient (MCC), and area under the ROC curve (AUC) are 95.23%, 0.8100, and 0.9864, respectively. For the pupylation site prediction model, six feature extraction codes (i.e., TPC, AAI, One-hot, PseAAC, CKSAAP, and Word Embedding) served to extract protein sequence features, and the chi-square test was employed for feature selection. Rigorous 10-fold cross-validations indicated that the accuracies are very high and outperformed its existing counterparts. Finally, for the convenience of researchers, PUP-PS-Fuse has been established at https://bioinfo.jcu.edu.cn/PUP-PS-Fuse and http://121.36.221.79/PUP-PS-Fuse/as a backup.
Collapse
Affiliation(s)
| | | | | | | | - Xuan Xiao
- *Correspondence: Wang-Ren Qiu, ; Xuan Xiao,
| |
Collapse
|
184
|
Jiang EP. A Hybrid Learning Framework for Imbalanced Classification. INTERNATIONAL JOURNAL OF INTELLIGENT INFORMATION TECHNOLOGIES 2022. [DOI: 10.4018/ijiit.306967] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Class imbalance is a well-known and challenging algorithmic research topic among the machine learning community as traditional classifiers generally perform poorly on imbalanced problems, where data to be learned have skewed distributions between their classes. This paper presents a hybrid framework named PRUSBoost for learning imbalanced classification. It combines a selective data under-sampling procedure and a powerful boosting strategy to effectively enhance classification performance on imbalanced problems. Different from the simple random under sampling algorithm, this framework constructs the training data of the majority or negative class by using a newly developed partition based under sampling approach. Experiments on several datasets from different application domains that carry skewed class distributions have shown that the proposed framework provides a very competitive, consistent, and effective solution to imbalanced classification problems.
Collapse
|
185
|
Sollee J, Tang L, Igiraneza AB, Xiao B, Bai HX, Yang L. Artificial Intelligence for Medical Image Analysis in Epilepsy. Epilepsy Res 2022; 182:106861. [DOI: 10.1016/j.eplepsyres.2022.106861] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2021] [Revised: 11/18/2021] [Accepted: 01/16/2022] [Indexed: 11/16/2022]
|
186
|
Liao W, Liu P. Enhanced descriptor identification and mechanism understanding for catalytic activity using a data-driven framework: revealing the importance of interactions between elementary steps. Catal Sci Technol 2022. [DOI: 10.1039/d2cy00284a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
A data-driven framework was developed which used ML surrogate model to extract activity controlling descriptors from kinetics dataset. It enhanced mechanic understanding and predicted catalytic activities more accurately than derivate-based method.
Collapse
Affiliation(s)
- Wenjie Liao
- Department of Chemistry, State University of New York at Stony Brook, Stony Brook, New York, 11794, USA
| | - Ping Liu
- Department of Chemistry, State University of New York at Stony Brook, Stony Brook, New York, 11794, USA
- Chemistry Division, Brookhaven National Laboratory, Upton, New York, 11973, USA
| |
Collapse
|
187
|
Hong B, Ma X, Tang W, Shen Z. Recognition of Air Passengers' Willingness to Pay for Seat Selection for Imbalanced Data Based on Improved XGBoost. INTERNATIONAL JOURNAL OF COGNITIVE INFORMATICS AND NATURAL INTELLIGENCE 2022. [DOI: 10.4018/ijcini.312249] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Passenger-paid seat selection is one of the important sources of ancillary revenue for airlines, and machine learning-based willingness-to-pay identification is of great practicality for airlines to accurately tap potential willing passengers. However, affected by periodic statistical errors, air passenger order data often has some problems such as high noise, high latitude, and unbalanced category. In view of this, this paper proposes a method for identifying air passengers' willingness to pay for seat selection based on improved XGBoost, which is improved and integrated from three stages: data, feature, and algorithm. The feasibility of the proposed multi-stage improved integration method is verified by real airline passenger dataset, and the experimental results show that the proposed improved method has better classification effect when compared with the classical six imbalance classification models, which provides a basis for accurate marketing of airline paid seat selection programs.
Collapse
|
188
|
Ibarguren I, Pérez JM, Muguerza J, Arbelaitz O, Yera A. PCTBagging: From inner ensembles to ensembles. A trade-off between discriminating capacity and interpretability. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2021.11.010] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
189
|
Mayabadi S, Saadatfar H. Two density-based sampling approaches for imbalanced and overlapping data. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.108217] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
|
190
|
Razzaq M, Clément F, Yvinec R. An overview of deep learning applications in precocious puberty and thyroid dysfunction. Front Endocrinol (Lausanne) 2022; 13:959546. [PMID: 36339395 PMCID: PMC9632447 DOI: 10.3389/fendo.2022.959546] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Accepted: 09/16/2022] [Indexed: 11/24/2022] Open
Abstract
In the last decade, deep learning methods have garnered a great deal of attention in endocrinology research. In this article, we provide a summary of current deep learning applications in endocrine disorders caused by either precocious onset of adult hormone or abnormal amount of hormone production. To give access to the broader audience, we start with a gentle introduction to deep learning and its most commonly used architectures, and then we focus on the research trends of deep learning applications in thyroid dysfunction classification and precocious puberty diagnosis. We highlight the strengths and weaknesses of various approaches and discuss potential solutions to different challenges. We also go through the practical considerations useful for choosing (and building) the deep learning model, as well as for understanding the thought process behind different decisions made by these models. Finally, we give concluding remarks and future directions.
Collapse
Affiliation(s)
- Misbah Razzaq
- PRC, INRAE, CNRS, Université de Tours, Nouzilly, France
- *Correspondence: Misbah Razzaq,
| | - Frédérique Clément
- Université Paris-Saclay, Inria, Centre Inria de Saclay, Palaiseau, France
| | - Romain Yvinec
- PRC, INRAE, CNRS, Université de Tours, Nouzilly, France
- Université Paris-Saclay, Inria, Centre Inria de Saclay, Palaiseau, France
| |
Collapse
|
191
|
Hu R, Gan J, Zhu X, Liu T, Shi X. Multi-task multi-modality SVM for early COVID-19 Diagnosis using chest CT data. Inf Process Manag 2022; 59:102782. [PMID: 34629687 PMCID: PMC8487772 DOI: 10.1016/j.ipm.2021.102782] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2021] [Revised: 09/17/2021] [Accepted: 09/23/2021] [Indexed: 01/08/2023]
Abstract
In the early diagnosis of the Coronavirus disease (COVID-19), it is of great importance for either distinguishing severe cases from mild cases or predicting the conversion time that mild cases would possibly convert to severe cases. This study investigates both of them in a unified framework by exploring the problems such as slight appearance difference between mild cases and severe cases, the interpretability, the High Dimension and Low Sample Size (HDLSS) data, and the class imbalance. To this end, the proposed framework includes three steps: (1) feature extraction which first conducts the hierarchical segmentation on the chest Computed Tomography (CT) image data and then extracts multi-modality handcrafted features for each segment, aiming at capturing the slight appearance difference from different perspectives; (2) data augmentation which employs the over-sampling technique to augment the number of samples corresponding to the minority classes, aiming at investigating the class imbalance problem; and (3) joint construction of classification and regression by proposing a novel Multi-task Multi-modality Support Vector Machine (MM-SVM) method to solve the issue of the HDLSS data and achieve the interpretability. Experimental analysis on two synthetic and one real COVID-19 data set demonstrated that our proposed framework outperformed six state-of-the-art methods in terms of binary classification and regression performance.
Collapse
Affiliation(s)
- Rongyao Hu
- School of Computer Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
- Massey University Albany Campus, Auckland 0745, New Zealand
| | - Jiangzhang Gan
- School of Computer Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
- Massey University Albany Campus, Auckland 0745, New Zealand
| | - Xiaofeng Zhu
- School of Computer Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
- Massey University Albany Campus, Auckland 0745, New Zealand
| | - Tong Liu
- Massey University Albany Campus, Auckland 0745, New Zealand
| | - Xiaoshuang Shi
- School of Computer Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| |
Collapse
|
192
|
Ismail H, Serhani MA, Hussien N, Elabyad R, Navaz A. Public wellbeing analytics framework using social media chatter data. SOCIAL NETWORK ANALYSIS AND MINING 2022; 12:163. [PMID: 36345490 PMCID: PMC9630074 DOI: 10.1007/s13278-022-00987-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2022] [Revised: 09/28/2022] [Accepted: 10/13/2022] [Indexed: 11/05/2022]
Abstract
Public wellbeing has always been crucial. Many governments around the globe prioritize the impact of their decisions on public wellbeing. In this paper, we propose an end-to-end public wellbeing analytics framework designed to predict the public’s wellbeing status and infer insights through the continuous analysis of social media content over several temporal events and across several locations. The proposed framework implements a novel distant supervision approach designed specifically to generate wellbeing-labeled datasets. In addition, it implements a wellbeing prediction model trained on contextualized sentence embeddings using BERT. Wellbeing predictions are visualized using several spatiotemporal analytics that can support decision-makers in gauging the impact of several government decisions and temporal events on the public, aiding in improving the decision-making process. Empirical experiments evaluate the effectiveness of the proposed distant supervision approach, the prediction model, and the utility of the produced analytics in gauging the public wellbeing status in a specific context.
Collapse
Affiliation(s)
- Heba Ismail
- grid.444459.c0000 0004 1762 9315College of Engineering, Abu Dhabi University, Abu Dhabi, UAE
| | - M. Adel Serhani
- grid.43519.3a0000 0001 2193 6666College of IT, United Arab Emirates University, Al Ain, UAE
| | - Nada Hussien
- grid.444459.c0000 0004 1762 9315College of Engineering, Abu Dhabi University, Abu Dhabi, UAE
| | - Rawan Elabyad
- grid.444459.c0000 0004 1762 9315College of Engineering, Abu Dhabi University, Abu Dhabi, UAE
| | - Alramzana Navaz
- grid.43519.3a0000 0001 2193 6666College of IT, United Arab Emirates University, Al Ain, UAE
| |
Collapse
|
193
|
|
194
|
Wong SY, Ye X, Guo F, Goh HH. Computational intelligence for preventive maintenance of power transformers. Appl Soft Comput 2022. [DOI: 10.1016/j.asoc.2021.108129] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
195
|
Pes B, Lai G. Cost-sensitive learning strategies for high-dimensional and imbalanced data: a comparative study. PeerJ Comput Sci 2021; 7:e832. [PMID: 35036539 PMCID: PMC8725666 DOI: 10.7717/peerj-cs.832] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2021] [Accepted: 12/06/2021] [Indexed: 05/28/2023]
Abstract
High dimensionality and class imbalance have been largely recognized as important issues in machine learning. A vast amount of literature has indeed investigated suitable approaches to address the multiple challenges that arise when dealing with high-dimensional feature spaces (where each problem instance is described by a large number of features). As well, several learning strategies have been devised to cope with the adverse effects of imbalanced class distributions, which may severely impact on the generalization ability of the induced models. Nevertheless, although both the issues have been largely studied for several years, they have mostly been addressed separately, and their combined effects are yet to be fully understood. Indeed, little research has been so far conducted to investigate which approaches might be best suited to deal with datasets that are, at the same time, high-dimensional and class-imbalanced. To make a contribution in this direction, our work presents a comparative study among different learning strategies that leverage both feature selection, to cope with high dimensionality, as well as cost-sensitive learning methods, to cope with class imbalance. Specifically, different ways of incorporating misclassification costs into the learning process have been explored. Also different feature selection heuristics have been considered, both univariate and multivariate, to comparatively evaluate their effectiveness on imbalanced data. The experiments have been conducted on three challenging benchmarks from the genomic domain, gaining interesting insight into the beneficial impact of combining feature selection and cost-sensitive learning, especially in the presence of highly skewed data distributions.
Collapse
Affiliation(s)
- Barbara Pes
- Dipartimento di Matematica e Informatica, Università degli Studi di Cagliari, Cagliari, Italy
| | - Giuseppina Lai
- Dipartimento di Matematica e Informatica, Università degli Studi di Cagliari, Cagliari, Italy
| |
Collapse
|
196
|
Apicella A, Arpaia P, Giugliano S, Mastrati G, Moccaldi N. High-wearable EEG-based transducer for engagement detection in pediatric rehabilitation. BRAIN-COMPUTER INTERFACES 2021. [DOI: 10.1080/2326263x.2021.2015149] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
- Andrea Apicella
- Laboratory of Augmented Reality for Health Monitoring (Arhemlab), Department of Electrical Engineering and Information Technology, University of Naples Federico Ii, Naples, Italy
| | - Pasquale Arpaia
- Laboratory of Augmented Reality for Health Monitoring (Arhemlab), Department of Electrical Engineering and Information Technology, University of Naples Federico Ii, Naples, Italy
| | - Salvatore Giugliano
- Laboratory of Augmented Reality for Health Monitoring (Arhemlab), Department of Electrical Engineering and Information Technology, University of Naples Federico Ii, Naples, Italy
| | - Giovanna Mastrati
- Laboratory of Augmented Reality for Health Monitoring (Arhemlab), Department of Electrical Engineering and Information Technology, University of Naples Federico Ii, Naples, Italy
| | - Nicola Moccaldi
- Laboratory of Augmented Reality for Health Monitoring (Arhemlab), Department of Electrical Engineering and Information Technology, University of Naples Federico Ii, Naples, Italy
| |
Collapse
|
197
|
Wang Z, Jia P, Xu X, Wang B, Zhu Y, Li D. Sample and feature selecting based ensemble learning for imbalanced problems. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2021.107884] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
198
|
An efficient chaotic salp swarm optimization approach based on ensemble algorithm for class imbalance problems. Soft comput 2021. [DOI: 10.1007/s00500-021-06080-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
199
|
Comparing machine learning to a rule-based approach for predicting suicidal behavior among adolescents: Results from a longitudinal population-based survey. J Affect Disord 2021; 295:1415-1420. [PMID: 34620490 DOI: 10.1016/j.jad.2021.09.018] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/09/2021] [Revised: 07/22/2021] [Accepted: 09/12/2021] [Indexed: 11/21/2022]
Abstract
INTRODUCTION Suicidal thoughts and suicide attempts are one of the most prominent public health concerns in adolescents and therefore early detection is important to initiate preventive interventions and closer monitoring. METHOD We examined whether the Machine Learning models Random Forest and Lasso Regression better predict future suicidal behavior than a simple decision rule that classifies every adolescent with history of suicide ideation at baseline as at risk (current practice). We used data from a general population of students in second and fourth year of secondary education in Amsterdam, the Netherlands. RESULTS Both the Random Forest and the Lasso Regression resulted in slightly better prediction. The AUC of the Random Forest (0.79) and Lasso regression (0.76) were both higher than the AUC of the decision rule (0.64). The Random Forest achieved slightly (but non-significantly) higher sensitivity than the decision rule (0.37 versus 0.34), with the same specificity (0.94). With Lasso Regression the sensitivity increased significantly (0.52), but at the expense of the specificity (0.85). LIMITATIONS The loss of cases after merging the data, the use of self-reported data, confidential data collection and the use of only four questions to measure suicidal behavior. CONCLUSIONS This is the first study applying Machine Learning techniques to predict future suicidal behavior on survey data collected in a general population of adolescents. Our study showed that integrating machine learning techniques in screening practice will result in a small improvement in the ability to predict suicide. The models need to be further optimized to improve accuracy.
Collapse
|
200
|
Lin E, Lin CH, Lane HY. Machine Learning and Deep Learning for the Pharmacogenomics of Antidepressant Treatments. CLINICAL PSYCHOPHARMACOLOGY AND NEUROSCIENCE : THE OFFICIAL SCIENTIFIC JOURNAL OF THE KOREAN COLLEGE OF NEUROPSYCHOPHARMACOLOGY 2021; 19:577-588. [PMID: 34690113 PMCID: PMC8553527 DOI: 10.9758/cpn.2021.19.4.577] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/09/2021] [Accepted: 04/10/2021] [Indexed: 12/31/2022]
Abstract
A growing body of evidence now proposes that machine learning and deep learning techniques can serve as a vital foundation for the pharmacogenomics of antidepressant treatments in patients with major depressive disorder (MDD). In this review, we focus on the latest developments for pharmacogenomics research using machine learning and deep learning approaches together with neuroimaging and multi-omics data. First, we review relevant pharmacogenomics studies that leverage numerous machine learning and deep learning techniques to determine treatment prediction and potential biomarkers for antidepressant treatments in MDD. In addition, we depict some neuroimaging pharmacogenomics studies that utilize various machine learning approaches to predict antidepressant treatment outcomes in MDD based on the integration of research on pharmacogenomics and neuroimaging. Moreover, we summarize the limitations in regard to the past pharmacogenomics studies of antidepressant treatments in MDD. Finally, we outline a discussion of challenges and directions for future research. In light of latest advancements in neuroimaging and multi-omics, various genomic variants and biomarkers associated with antidepressant treatments in MDD are being identified in pharmacogenomics research by employing machine learning and deep learning algorithms.
Collapse
Affiliation(s)
- Eugene Lin
- Department of Biostatistics, University of Washington, Seattle, WA, USA
- Department of Electrical & Computer Engineering, University of Washington, Seattle, WA, USA
- Graduate Institute of Biomedical Sciences, China Medical University, Taichung, Taiwan
| | - Chieh-Hsin Lin
- Graduate Institute of Biomedical Sciences, China Medical University, Taichung, Taiwan
- Department of Psychiatry, Kaohsiung Chang Gung Memorial Hospital, Chang Gung University College of Medicine, Kaohsiung, Taiwan
- School of Medicine, Chang Gung University, Taoyuan, Taiwan
| | - Hsien-Yuan Lane
- Graduate Institute of Biomedical Sciences, China Medical University, Taichung, Taiwan
- Department of Psychiatry, China Medical University Hospital, Taichung, Taiwan
- Department of Brain Disease Research Center, China Medical University Hospital, Taichung, Taiwan
- Department of Psychology, College of Medical and Health Sciences, Asia University, Taichung, Taiwan
| |
Collapse
|