251
|
Esposito C, Landrum GA, Schneider N, Stiefl N, Riniker S. GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data in Machine Learning. J Chem Inf Model 2021; 61:2623-2640. [PMID: 34100609 DOI: 10.1021/acs.jcim.1c00160] [Citation(s) in RCA: 46] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Machine learning classifiers trained on class imbalanced data are prone to overpredict the majority class. This leads to a larger misclassification rate for the minority class, which in many real-world applications is the class of interest. For binary data, the classification threshold is set by default to 0.5 which, however, is often not ideal for imbalanced data. Adjusting the decision threshold is a good strategy to deal with the class imbalance problem. In this work, we present two different automated procedures for the selection of the optimal decision threshold for imbalanced classification. A major advantage of our procedures is that they do not require retraining of the machine learning models or resampling of the training data. The first approach is specific for random forest (RF), while the second approach, named GHOST, can be potentially applied to any machine learning classifier. We tested these procedures on 138 public drug discovery data sets containing structure-activity data for a variety of pharmaceutical targets. We show that both thresholding methods improve significantly the performance of RF. We tested the use of GHOST with four different classifiers in combination with two molecular descriptors, and we found that most classifiers benefit from threshold optimization. GHOST also outperformed other strategies, including random undersampling and conformal prediction. Finally, we show that our thresholding procedures can be effectively applied to real-world drug discovery projects, where the imbalance and characteristics of the data vary greatly between the training and test sets.
Collapse
Affiliation(s)
- Carmen Esposito
- Laboratory of Physical Chemistry, ETH Zurich, Vladimir-Prelog-Weg 2, 8093 Zurich, Switzerland
| | - Gregory A Landrum
- Laboratory of Physical Chemistry, ETH Zurich, Vladimir-Prelog-Weg 2, 8093 Zurich, Switzerland.,T5 Informatics GmbH, Spalenring 11, 4055 Basel, Switzerland
| | - Nadine Schneider
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Novartis Campus, 4002 Basel, Switzerland
| | - Nikolaus Stiefl
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Novartis Campus, 4002 Basel, Switzerland
| | - Sereina Riniker
- Laboratory of Physical Chemistry, ETH Zurich, Vladimir-Prelog-Weg 2, 8093 Zurich, Switzerland
| |
Collapse
|
252
|
Chennuru VK, Timmappareddy SR. Simulated annealing based undersampling (SAUS): a hybrid multi-objective optimization method to tackle class imbalance. APPL INTELL 2021. [DOI: 10.1007/s10489-021-02369-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
253
|
Fraiwan L, Hassanin O. Computer-aided identification of degenerative neuromuscular diseases based on gait dynamics and ensemble decision tree classifiers. PLoS One 2021; 16:e0252380. [PMID: 34086723 PMCID: PMC8177554 DOI: 10.1371/journal.pone.0252380] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2021] [Accepted: 05/15/2021] [Indexed: 11/18/2022] Open
Abstract
This study proposes a reliable computer-aided framework to identify gait fluctuations associated with a wide range of degenerative neuromuscular disease (DNDs) and health conditions. Investigated DNDs included amyotrophic lateral sclerosis (ALS), Parkinson's disease (PD), and Huntington's disease (HD). We further performed a statistical and classification comparison elucidating the discriminative capability of different gait signals, including vertical ground reaction force (VGRF), stride duration, stance duration, and swing duration. Feature representation of these gait signals was based on statistical amplitude quantification using the root mean square (RMS), variance, kurtosis, and skewness metrics. We investigated various decision tree (DT) based ensemble methods such as bagging, adaptive boosting (AdaBoost), random under-sampling boosting (RUSBoost), and random subspace to tackle the challenge of multi-class classification. Experimental results showed that AdaBoost ensembling provided a 6.49%, 0.78%, 2.31%, and 2.72% prediction rate improvement for the VGRF, stride, stance, and swing signals, respectively. The proposed approach achieved the highest classification accuracy of 99.17%, sensitivity of 98.23%, and specificity of 99.43%, using the VGRF-based features and the adaptive boosting classification model. This work demonstrates the effective capability of using simple gait fluctuation analysis and machine learning approaches to detect DNDs. Computer-aided analysis of gait fluctuations provides a promising advent to enhance clinical diagnosis of DNDs.
Collapse
Affiliation(s)
- Luay Fraiwan
- Department of Electrical and Computer Engineering, Abu Dhabi University, Abu Dhabi, UAE
- Department of Biomedical Engineering, Jordan University of Science and Technology, Irbid, Jordan
| | - Omnia Hassanin
- Department of Electrical and Computer Engineering, Abu Dhabi University, Abu Dhabi, UAE
| |
Collapse
|
254
|
Mosquera-Lopez C, Wan E, Shastry M, Folsom J, Leitschuh J, Condon J, Rajhbeharrysingh U, Hildebrand A, Cameron M, Jacobs PG. Automated Detection of Real-World Falls: Modeled From People With Multiple Sclerosis. IEEE J Biomed Health Inform 2021; 25:1975-1984. [PMID: 33245698 DOI: 10.1109/jbhi.2020.3041035] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Falls are a major health problem with one in three people over the age of 65 falling each year, oftentimes causing hip fractures, disability, reduced mobility, hospitalization and death. A major limitation in fall detection algorithm development is an absence of real-world falls data. Fall detection algorithms are typically trained on simulated fall data that contain a well-balanced number of examples of falls and activities of daily living. However, real-world falls occur infrequently, making them difficult to capture and causing severe data imbalance. People with multiple sclerosis (MS) fall frequently, and their risk of falling increases with disease progression. Because of their high fall incidence, people with MS provide an ideal model for studying falls. This paper describes the development of a context-aware fall detection system based on inertial sensors and time of flight sensors that is robust to imbalance, which is trained and evaluated on real-world falls in people with MS. The algorithm uses an auto-encoder that detects fall candidates using reconstruction error of accelerometer signals followed by a hyper-ensemble of balanced random forests trained using both acceleration and movement features. On a clinical dataset obtained from 25 people with MS monitored over eight weeks during free-living conditions, 54 falls were observed and our system achieved a sensitivity of 92.14%, and false-positive rate of 0.65 false alarms per day.
Collapse
|
255
|
Ng WWY, Zhang Y, Zhang J, Wang DD, Wang FL. Stochastic Sensitivity Tree Boosting for Imbalanced Prediction Problems of Protein-Ligand Interaction Sites. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE 2021. [DOI: 10.1109/tetci.2019.2922340] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
256
|
Yang D, Tan K, Huang Z, Li X, Chen B, Ren G, Xiao W. An automatic method for removing empty camera trap images using ensemble learning. Ecol Evol 2021; 11:7591-7601. [PMID: 34188837 PMCID: PMC8216933 DOI: 10.1002/ece3.7591] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2021] [Revised: 03/26/2021] [Accepted: 03/30/2021] [Indexed: 11/10/2022] Open
Abstract
Camera traps often produce massive images, and empty images that do not contain animals are usually overwhelming. Deep learning is a machine-learning algorithm and widely used to identify empty camera trap images automatically. Existing methods with high accuracy are based on millions of training samples (images) and require a lot of time and personnel costs to label the training samples manually. Reducing the number of training samples can save the cost of manually labeling images. However, the deep learning models based on a small dataset produce a large omission error of animal images that many animal images tend to be identified as empty images, which may lead to loss of the opportunities of discovering and observing species. Therefore, it is still a challenge to build the DCNN model with small errors on a small dataset. Using deep convolutional neural networks and a small-size dataset, we proposed an ensemble learning approach based on conservative strategies to identify and remove empty images automatically. Furthermore, we proposed three automatic identifying schemes of empty images for users who accept different omission errors of animal images. Our experimental results showed that these three schemes automatically identified and removed 50.78%, 58.48%, and 77.51% of the empty images in the dataset when the omission errors were 0.70%, 1.13%, and 2.54%, respectively. The analysis showed that using our scheme to automatically identify empty images did not omit species information. It only slightly changed the frequency of species occurrence. When only a small dataset was available, our approach provided an alternative to users to automatically identify and remove empty images, which can significantly reduce the time and personnel costs required to manually remove empty images. The cost savings were comparable to the percentage of empty images removed by models.
Collapse
Affiliation(s)
- Deng‐Qi Yang
- Department of Mathematics and Computer ScienceDali UniversityDaliChina
- Institute of Eastern‐Himalaya Biodiversity ResearchDali UniversityDaliChina
- Collaborative Innovation Center for the Biodiversity in the Three Parallel Rivers of ChinaDaliChina
- Data Security and Application Innovation TeamDali UniversityDaliChina
| | - Kun Tan
- Institute of Eastern‐Himalaya Biodiversity ResearchDali UniversityDaliChina
- Collaborative Innovation Center for the Biodiversity in the Three Parallel Rivers of ChinaDaliChina
| | - Zhi‐Pang Huang
- Institute of Eastern‐Himalaya Biodiversity ResearchDali UniversityDaliChina
- Collaborative Innovation Center for the Biodiversity in the Three Parallel Rivers of ChinaDaliChina
| | - Xiao‐Wei Li
- Department of Mathematics and Computer ScienceDali UniversityDaliChina
- Data Security and Application Innovation TeamDali UniversityDaliChina
| | - Ben‐Hui Chen
- Department of Mathematics and Computer ScienceDali UniversityDaliChina
- Data Security and Application Innovation TeamDali UniversityDaliChina
| | - Guo‐Peng Ren
- Institute of Eastern‐Himalaya Biodiversity ResearchDali UniversityDaliChina
- Collaborative Innovation Center for the Biodiversity in the Three Parallel Rivers of ChinaDaliChina
| | - Wen Xiao
- Institute of Eastern‐Himalaya Biodiversity ResearchDali UniversityDaliChina
- Collaborative Innovation Center for the Biodiversity in the Three Parallel Rivers of ChinaDaliChina
| |
Collapse
|
257
|
Mallick C, Das AK, Nayak J, Pelusi D, Vimal S. Evolutionary Algorithm based Ensemble Extractive Summarization for Developing Smart Medical System. Interdiscip Sci 2021; 13:229-259. [PMID: 33576956 DOI: 10.1007/s12539-020-00412-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2020] [Revised: 12/17/2020] [Accepted: 12/21/2020] [Indexed: 11/25/2022]
Abstract
The amount of information in the scientific literature of the bio-medical domain is growing exponentially, which makes it difficult in developing a smart medical system. Summarization techniques help for efficient searching and understanding of relevant information from the medical documents. In the paper, an evolutionary algorithm based ensemble extractive summarization technique is devised as a smart medical application with the idea of hybrid artificial intelligence on natural language processing. We have considered the abstracts of the target article and its cited articles as the base summaries and a multi-objective evolutionary algorithm is applied for generating the ensemble summary of the target article. Each sentence of the base summaries is represented by a concept vector of the medical terms contained in it with the help of the Unified Modelling Language System (UMLS) tool which is widely used in various smart medical applications. These terms carry the key information of the sentence which is very useful to find out the semantic similarity among the sentences. Fitness functions of the evolutionary algorithm are mainly defined using clustering coefficient and sparsity index, the concepts of graph theory. After the convergence of the algorithm, the best solution of the final population gives the ensemble summary. Next, the semantic similarity of each sentence in the target article with the ensemble summary is calculated and the sentences which are most similar to the ensemble summary are considered as the summary of the target article. The method is applied to the articles available in the PubMed MEDLINE database system and experimental results are compared with some state of the art methods applied in the Bio-medical domain. Experimental results and comparative study based on the performance evaluation show that the method competes with some recently proposed summarization methods and outperforms others, which express the effectiveness of the proposed methodology. Different statistical tests have also been made to observe that the method is statistically significant.
Collapse
Affiliation(s)
- Chirantana Mallick
- Department of Computer Science and Technology, Indian Institute of Engineering Science and Technology, Shibpur, Howrah, 711103, India
| | - Asit Kumar Das
- Department of Computer Science and Technology, Indian Institute of Engineering Science and Technology, Shibpur, Howrah, 711103, India.
| | - Janmenjoy Nayak
- Department of Computer Science and Engineering, Aditya Institute of Technology and Management (AITAM), Tekkali, Andhra Pradesh, 532201, India
| | - Danilo Pelusi
- Department of Communications Sciences, University of Teramo, Teramo, Italy
| | - S Vimal
- Department of Information Technology, National Engineering College, K.R.Nagar, Kovilpatti, Thoothukudi District, Tamilnadu, 628503, India
| |
Collapse
|
258
|
How to identify early defaults in online lending: A cost-sensitive multi-layer learning framework. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2021.106963] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
259
|
Alhassan Z, Watson M, Budgen D, Alshammari R, Alessa A, Al Moubayed N. Improving Current Glycated Hemoglobin Prediction in Adults: Use of Machine Learning Algorithms With Electronic Health Records. JMIR Med Inform 2021; 9:e25237. [PMID: 34028357 PMCID: PMC8185616 DOI: 10.2196/25237] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2020] [Revised: 01/05/2021] [Accepted: 04/22/2021] [Indexed: 01/30/2023] Open
Abstract
Background Predicting the risk of glycated hemoglobin (HbA1c) elevation can help identify patients with the potential for developing serious chronic health problems, such as diabetes. Early preventive interventions based upon advanced predictive models using electronic health records data for identifying such patients can ultimately help provide better health outcomes. Objective Our study investigated the performance of predictive models to forecast HbA1c elevation levels by employing several machine learning models. We also examined the use of patient electronic health record longitudinal data in the performance of the predictive models. Explainable methods were employed to interpret the decisions made by the black box models. Methods This study employed multiple logistic regression, random forest, support vector machine, and logistic regression models, as well as a deep learning model (multilayer perceptron) to classify patients with normal (<5.7%) and elevated (≥5.7%) levels of HbA1c. We also integrated current visit data with historical (longitudinal) data from previous visits. Explainable machine learning methods were used to interrogate the models and provide an understanding of the reasons behind the decisions made by the models. All models were trained and tested using a large data set from Saudi Arabia with 18,844 unique patient records. Results The machine learning models achieved promising results for predicting current HbA1c elevation risk. When coupled with longitudinal data, the machine learning models outperformed the multiple logistic regression model used in the comparative study. The multilayer perceptron model achieved an accuracy of 83.22% for the area under receiver operating characteristic curve when used with historical data. All models showed a close level of agreement on the contribution of random blood sugar and age variables with and without longitudinal data. Conclusions This study shows that machine learning models can provide promising results for the task of predicting current HbA1c levels (≥5.7% or less). Using patients’ longitudinal data improved the performance and affected the relative importance for the predictors used. The models showed results that are consistent with comparable studies.
Collapse
Affiliation(s)
- Zakhriya Alhassan
- Department of Computer Science, Durham University, Durham, United Kingdom.,College of Computer Science and Engineering, University of Jeddah, Jeddah, Saudi Arabia
| | - Matthew Watson
- Department of Computer Science, Durham University, Durham, United Kingdom
| | - David Budgen
- Department of Computer Science, Durham University, Durham, United Kingdom
| | - Riyad Alshammari
- National Center for Artificial Intelligence, Saudi Data and Artificial Intelligence Authority, Riyadh, Saudi Arabia
| | - Ali Alessa
- Department of Information Technology Programs, Institute of Public Administration, Riyadh, Saudi Arabia
| | - Noura Al Moubayed
- Department of Computer Science, Durham University, Durham, United Kingdom
| |
Collapse
|
260
|
Ahirwal J, Nath A, Brahma B, Deb S, Sahoo UK, Nath AJ. Patterns and driving factors of biomass carbon and soil organic carbon stock in the Indian Himalayan region. THE SCIENCE OF THE TOTAL ENVIRONMENT 2021; 770:145292. [PMID: 33736385 DOI: 10.1016/j.scitotenv.2021.145292] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/07/2020] [Revised: 01/13/2021] [Accepted: 01/14/2021] [Indexed: 06/12/2023]
Abstract
Tree-based ecosystems are critical to climate change mitigation. The study analysed carbon (C) stock patterns and examined the importance of environmental variables in predicting carbon stock in biomass and soils of the Indian Himalayan Region (IHR). We conducted a synthesis of 100 studies reporting biomass carbon stock and 67 studies on soil organic carbon (SOC) stock from four land-uses: forests, plantation, agroforest, and herbaceous ecosystem from the IHR. Machine learning techniques were used to examine the importance of various environmental variables in predicting carbon stock in biomass and soils. Despite large variations in biomass C and SOC stock (mean ± SD) within the land-uses, natural forests have the highest biomass C stock (138.5 ± 87.3 Mg C ha-1), and plantation forests exhibited the highest SOC stock (168.8 ± 74.4 Mg C ha-1) in the top 1-m of soils. The relationship between the environmental variables (altitude, latitude, precipitation, and temperature) and carbon stock was not significantly correlated. The prediction of biomass carbon and SOC stock using different machine learning techniques (Adaboost, Bagging, Random Forest, and XGBoost) shows that the XGBoost model can predict the carbon stock for the IHR closely. Our study confirms that the carbon stock in the IHR vary on a large scale due to a diverse range of land-use and ecosystems within the region. Therefore, predicting the driver of carbon stock on a single environmental variable is impossible for the entire IHR. The IHR possesses a prominent carbon sink and biodiversity pool. Therefore, its protection is essential in fulfilling India's commitment to nationally determined contributions (NDC). Our data synthesis may also provide a baseline for the precise estimation of carbon stock, which will be vital for India's National Mission for Sustaining the Himalayan Ecosystem (NMSHE).
Collapse
Affiliation(s)
| | - Amitabha Nath
- Department of Information Technology, North-Eastern Hill University, Shillong, India
| | - Biplab Brahma
- Department of Ecology and Environmental Science, Assam University, Silchar, India
| | - Sourabh Deb
- Department of Forestry and Biodiversity, Tripura University, Suryamaninagar, India
| | | | - Arun Jyoti Nath
- Department of Ecology and Environmental Science, Assam University, Silchar, India.
| |
Collapse
|
261
|
Sung SF, Hung LC, Hu YH. Developing a stroke alert trigger for clinical decision support at emergency triage using machine learning. Int J Med Inform 2021; 152:104505. [PMID: 34030088 DOI: 10.1016/j.ijmedinf.2021.104505] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2021] [Revised: 05/01/2021] [Accepted: 05/17/2021] [Indexed: 11/19/2022]
Abstract
BACKGROUND Acute stroke is an urgent medical condition that requires immediate assessment and treatment. Prompt identification of patients with suspected stroke at emergency department (ED) triage followed by timely activation of code stroke systems is the key to successful management of stroke. While false negative detection of stroke may prevent patients from receiving optimal treatment, excessive false positive alarms will substantially burden stroke neurologists. This study aimed to develop a stroke-alert trigger to identify patients with suspected stroke at ED triage. METHODS Patients who arrived at the ED within 12 h of symptom onset and were suspected of a stroke or transient ischemic attack or triaged with a stroke-related symptom were included. Clinical features at ED triage were collected, including the presenting complaint, triage level, self-reported medical history (hypertension, diabetes, hyperlipidemia, heart disease, and prior stroke), vital signs, and presence of atrial fibrillation. Three rule-based algorithms, ie, Face Arm Speech Test (FAST) and two flavors of Balance, Eyes, FAST (BE-FAST), and six machine learning (ML) techniques with various resampling methods were used to build classifiers for identification of patients with suspected stroke. Logistic regression (LR) was used to find important features. RESULTS The study population consisted of 1361 patients. The values of area under the precision-recall curve (AUPRC) were 0.737, 0.710, and 0.562 for the FAST, BE-FAST-1, and BE-FAST-2 models, respectively. The values of AUPRC for the top three ML models were 0.787 for classification and regression tree with undersampling, 0.783 for LR with synthetic minority oversampling technique (SMOTE), and 0.782 for LR with class weighting. Among the ML models, logistic regression and random forest models in general achieved higher values of AUPRC, in particular in those with class weighting or SMOTE to handle class imbalance problem. In addition to the presenting complaint and triage level, age, diastolic blood pressure, body temperature, and pulse rate, were also important features for developing a stroke-alert trigger. CONCLUSIONS ML techniques significantly improved the performance of prediction models for identification of patients with suspected stroke. Such ML models can be embedded in the electronic triage system for clinical decision support at ED triage.
Collapse
Affiliation(s)
- Sheng-Feng Sung
- Division of Neurology, Department of Internal Medicine, Ditmanson Medical Foundation Chia-Yi Christian Hospital, Chiayi City, Taiwan; Department of Information Management and Institute of Healthcare Information Management, National Chung Cheng University, Chiayi County, Taiwan; Department of Nursing, Min-Hwei Junior College of Health Care Management, Tainan, Taiwan
| | - Ling-Chien Hung
- Division of Neurology, Department of Internal Medicine, Ditmanson Medical Foundation Chia-Yi Christian Hospital, Chiayi City, Taiwan
| | - Ya-Han Hu
- Department of Information Management, National Central University, Taoyuan City, Taiwan.
| |
Collapse
|
262
|
A Machine Learning Classifier Improves Mortality Prediction Compared With Pediatric Logistic Organ Dysfunction-2 Score: Model Development and Validation. Crit Care Explor 2021; 3:e0426. [PMID: 34036277 PMCID: PMC8133049 DOI: 10.1097/cce.0000000000000426] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
Supplemental Digital Content is available in the text. Objectives: To determine whether machine learning algorithms can better predict PICU mortality than the Pediatric Logistic Organ Dysfunction-2 score. Design: Retrospective study. Setting: Quaternary care medical-surgical PICU. Patients: All patients admitted to the PICU from 2013 to 2019. Interventions: None. Measurements and Main Results: We investigated the performance of various machine learning algorithms using the same variables used to calculate the Pediatric Logistic Organ Dysfunction-2 score to predict PICU mortality. We used 10,194 patient records from 2013 to 2017 for training and 4,043 patient records from 2018 to 2019 as a holdout validation cohort. Mortality rate was 3.0% in the training cohort and 3.4% in the validation cohort. The best performing algorithm was a random forest model (area under the receiver operating characteristic curve, 0.867 [95% CI, 0.863–0.895]; area under the precision-recall curve, 0.327 [95% CI, 0.246–0.414]; F1, 0.396 [95% CI, 0.321–0.468]) and significantly outperformed the Pediatric Logistic Organ Dysfunction-2 score (area under the receiver operating characteristic curve, 0.761 [95% CI, 0.713–0.810]; area under the precision-recall curve (0.239 [95% CI, 0.165–0.316]; F1, 0.284 [95% CI, 0.209–0.360]), although this difference was reduced after retraining the Pediatric Logistic Organ Dysfunction-2 logistic regression model at the study institution. The random forest model also showed better calibration than the Pediatric Logistic Organ Dysfunction-2 score, and calibration of the random forest model remained superior to the retrained Pediatric Logistic Organ Dysfunction-2 model. Conclusions: A machine learning model achieved better performance than a logistic regression-based score for predicting ICU mortality. Better estimation of mortality risk can improve our ability to adjust for severity of illness in future studies, although external validation is required before this method can be widely deployed.
Collapse
|
263
|
Wang Z, Tsai CF, Lin WC. Data cleaning issues in class imbalanced datasets: instance selection and missing values imputation for one-class classifiers. DATA TECHNOLOGIES AND APPLICATIONS 2021. [DOI: 10.1108/dta-01-2021-0027] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
PurposeClass imbalance learning, which exists in many domain problem datasets, is an important research topic in data mining and machine learning. One-class classification techniques, which aim to identify anomalies as the minority class from the normal data as the majority class, are one representative solution for class imbalanced datasets. Since one-class classifiers are trained using only normal data to create a decision boundary for later anomaly detection, the quality of the training set, i.e. the majority class, is one key factor that affects the performance of one-class classifiers.Design/methodology/approachIn this paper, we focus on two data cleaning or preprocessing methods to address class imbalanced datasets. The first method examines whether performing instance selection to remove some noisy data from the majority class can improve the performance of one-class classifiers. The second method combines instance selection and missing value imputation, where the latter is used to handle incomplete datasets that contain missing values.FindingsThe experimental results are based on 44 class imbalanced datasets; three instance selection algorithms, including IB3, DROP3 and the GA, the CART decision tree for missing value imputation, and three one-class classifiers, which include OCSVM, IFOREST and LOF, show that if the instance selection algorithm is carefully chosen, performing this step could improve the quality of the training data, which makes one-class classifiers outperform the baselines without instance selection. Moreover, when class imbalanced datasets contain some missing values, combining missing value imputation and instance selection, regardless of which step is first performed, can maintain similar data quality as datasets without missing values.Originality/valueThe novelty of this paper is to investigate the effect of performing instance selection on the performance of one-class classifiers, which has never been done before. Moreover, this study is the first attempt to consider the scenario of missing values that exist in the training set for training one-class classifiers. In this case, performing missing value imputation and instance selection with different orders are compared.
Collapse
|
264
|
Jothi Prakash V, Karthikeyan NK. Enhanced Evolutionary Feature Selection and Ensemble Method for Cardiovascular Disease Prediction. Interdiscip Sci 2021; 13:389-412. [PMID: 33988832 DOI: 10.1007/s12539-021-00430-x] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2020] [Revised: 04/01/2021] [Accepted: 04/09/2021] [Indexed: 11/26/2022]
Abstract
Cardiovascular Disease (CVD) is one among the main factors for the increase in mortality rate worldwide. The analysis and prediction of this disease is yet a highly formidable task in medical data analysis. Recent advancements in technology such as Big Data, Artificial Intelligence and the need for automated models have paved the way for developing a more reliable and efficient model for predicting heart disease. Several researches have been carried out in predicting heart diseases but the focus on choosing the important attributes that play a significant role in predicting CVD is inadequate. Hence the choice of right features for the classification and the diagnosis of the heart disease is important. The core aim of this work is to identify and select the important features and machine learning methodologies that can enhance the prediction capability of the classification models for accurately predicting CVD. The results show that the proposed enhanced evolutionary feature selection with the hybrid ensemble model outperforms the existing approaches in terms of precision, recall and accuracy. The experimental outcomes show that the proposed approach attains the maximum classification accuracy of 93.65% for statlog dataset, 82.81% for SPECTF dataset and 84.95% for coronary heart disease dataset. The proposed classification model performance is demonstrated using ROC curve against state-of-the-art methods in machine learning.
Collapse
Affiliation(s)
- V Jothi Prakash
- Department of Information Technology, Karpagam College of Engineering, Coimbatore, Tamil Nadu, India.
| | - N K Karthikeyan
- Department of Information Technology, Coimbatore Institute of Technology, Coimbatore, Tamil Nadu, India
| |
Collapse
|
265
|
Yang PT, Wu WS, Wu CC, Shih YN, Hsieh CH, Hsu JL. Breast cancer recurrence prediction with ensemble methods and cost-sensitive learning. Open Med (Wars) 2021; 16:754-768. [PMID: 34027105 PMCID: PMC8122465 DOI: 10.1515/med-2021-0282] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2020] [Revised: 03/11/2021] [Accepted: 04/03/2021] [Indexed: 11/15/2022] Open
Abstract
Breast cancer is one of the most common cancers in women all over the world. Due to the improvement of medical treatments, most of the breast cancer patients would be in remission. However, the patients have to face the next challenge, the recurrence of breast cancer which may cause more severe effects, and even death. The prediction of breast cancer recurrence is crucial for reducing mortality. This paper proposes a prediction model for the recurrence of breast cancer based on clinical nominal and numeric features. In this study, our data consist of 1,061 patients from Breast Cancer Registry from Shin Kong Wu Ho-Su Memorial Hospital between 2011 and 2016, in which 37 records are denoted as breast cancer recurrence. Each record has 85 features. Our approach consists of three stages. First, we perform data preprocessing and feature selection techniques to consolidate the dataset. Among all features, six features are identified for further processing in the following stages. Next, we apply resampling techniques to resolve the issue of class imbalance. Finally, we construct two classifiers, AdaBoost and cost-sensitive learning, to predict the risk of recurrence and carry out the performance evaluation in three-fold cross-validation. By applying the AdaBoost method, we achieve accuracy of 0.973 and sensitivity of 0.675. By combining the AdaBoost and cost-sensitive method of our model, we achieve a reasonable accuracy of 0.468 and substantially high sensitivity of 0.947 which guarantee almost no false dismissal. Our model can be used as a supporting tool in the setting and evaluation of the follow-up visit for early intervention and more advanced treatments to lower cancer mortality.
Collapse
Affiliation(s)
- Pei-Tse Yang
- Department of Computer Science and Information Engineering, Fu Jen Catholic University, New Taipei City, Taiwan, Republic of China
| | - Wen-Shuo Wu
- Department of Computer Science and Information Engineering, Fu Jen Catholic University, New Taipei City, Taiwan, Republic of China
| | - Chia-Chun Wu
- Department of Computer Science and Information Engineering, Fu Jen Catholic University, New Taipei City, Taiwan, Republic of China
| | - Yi-Nuo Shih
- Department of Occupational Therapy, Fu Jen Catholic University, New Taipei City, Taiwan, Republic of China
| | - Chung-Ho Hsieh
- Department of General Surgery, Shin Kong Wu Ho-Su Memorial Hospital, Taipei, Taiwan, Republic of China
| | - Jia-Lien Hsu
- Department of Computer Science and Information Engineering, Fu Jen Catholic University, New Taipei City, Taiwan, Republic of China
| |
Collapse
|
266
|
Accessing Imbalance Learning Using Dynamic Selection Approach in Water Quality Anomaly Detection. Symmetry (Basel) 2021. [DOI: 10.3390/sym13050818] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Automatic anomaly detection monitoring plays a vital role in water utilities’ distribution systems to reduce the risk posed by unclean water to consumers. One of the major problems with anomaly detection is imbalanced datasets. Dynamic selection techniques combined with ensemble models have proven to be effective for imbalanced datasets classification tasks. In this paper, water quality anomaly detection is formulated as a classification problem in the presences of class imbalance. To tackle this problem, considering the asymmetry dataset distribution between the majority and minority classes, the performance of sixteen previously proposed single and static ensemble classification methods embedded with resampling strategies are first optimised and compared. After that, six dynamic selection techniques, namely, Modified Class Rank (Rank), Local Class Accuracy (LCA), Overall-Local Accuracy (OLA), K-Nearest Oracles Eliminate (KNORA-E), K-Nearest Oracles Union (KNORA-U) and Meta-Learning for Dynamic Ensemble Selection (META-DES) in combination with homogeneous and heterogeneous ensemble models and three SMOTE-based resampling algorithms (SMOTE, SMOTE+ENN and SMOTE+Tomek Links), and one missing data method (missForest) are proposed and evaluated. A binary real-world drinking-water quality anomaly detection dataset is utilised to evaluate the models. The experimental results obtained reveal all the models benefitting from the combined optimisation of both the classifiers and resampling methods. Considering the three performance measures (balanced accuracy, F-score and G-mean), the result also shows that the dynamic classifier selection (DCS) techniques, in particular, the missForest+SMOTE+RANK and missForest+SMOTE+OLA models based on homogeneous ensemble-bagging with decision tree as the base classifier, exhibited better performances in terms of balanced accuracy and G-mean, while the Bg+mF+SMENN+LCA model based on homogeneous ensemble-bagging with random forest has a better overall F1-measure in comparison to the other models.
Collapse
|
267
|
Chen T, Wong YD, Shi X, Yang Y. A data-driven feature learning approach based on Copula-Bayesian Network and its application in comparative investigation on risky lane-changing and car-following maneuvers. ACCIDENT; ANALYSIS AND PREVENTION 2021; 154:106061. [PMID: 33691229 DOI: 10.1016/j.aap.2021.106061] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/06/2020] [Revised: 02/20/2021] [Accepted: 02/22/2021] [Indexed: 06/12/2023]
Abstract
The era of 'Big Data' provides opportunities for researchers to have deep insights into traffic safety. By taking advantages of 'Big Data', this study proposes a data-driven method to develop a Copula-Bayesian Network (Copula-BN) using a large-scale naturalistic driving dataset with multiple features. The Copula-BN is able to explain the causality of a risky driving maneuver. As compared with conventional BNs, the Copula-BN developed in this study has the following advantages: the Copula-BN 1. Has a more rational and explainable structure; 2. Is less likely to be over-fitting and can attain more satisfactory prediction performance; and 3. Can handle not only discrete but also continuous features. In terms of technical innovations, Shapley Additive Explanation (SHAP) is used for feature selection, while Gaussian Copula function is employed to build the dependency structure of the Copula-BN. As for applications, the Copula-BNs are used to investigate the causality of risky lane-changing (LC) and car-following (CF) maneuvers, upon which the comparisons are made between the two essential but risky driving maneuvers. In this study, the Copula-BNs are developed based on the Second Highway Research Program (SHRP2) Naturalistic Driving Study (NDS) database. Upon network evaluation, the Copula-BNs for both risky LC and CF maneuvers demonstrate satisfactory structure performance and promising prediction performance. Feature inferences are conducted based on the Copula-BNs to respectively illustrate the causation of the two risky maneuvers. Several interesting findings related to features' contribution are discussed in this paper. To a certain extent, the Copula-BN developed using the data-driven method makes a trade-off between prediction and causality within the 'Big Data'. The comparison between risky LC and CF maneuvers also provides a valuable reference for crash risk evaluation, road safety policy-making, etc. In the future, the achievements of this study could be applied in Advanced Driver-Assistance System (ADAS) and accident diagnosis system to enhance road traffic safety.
Collapse
Affiliation(s)
- Tianyi Chen
- School of Civil and Environmental Engineering, Nanyang Technological University, 639798, Singapore.
| | - Yiik Diew Wong
- School of Civil and Environmental Engineering, Nanyang Technological University, 639798, Singapore.
| | - Xiupeng Shi
- School of Civil and Environmental Engineering, Nanyang Technological University, 639798, Singapore; Institute for Infocomm Research, The Agency for Science, Technology and Research (A⁎STAR), Singapore.
| | - Yaoyao Yang
- School of Business, Renmin University of China, 100872, Beijing, China.
| |
Collapse
|
268
|
Hoyos-Osorio J, Alvarez-Meza A, Daza-Santacoloma G, Orozco-Gutierrez A, Castellanos-Dominguez G. Relevant information undersampling to support imbalanced data classification. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.01.033] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
269
|
Jin Y, Liu Y, Zhang W, Zhang S, Lou Y. A novel multi-stage ensemble model with multiple K-means-based selective undersampling: An application in credit scoring. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2021. [DOI: 10.3233/jifs-201954] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
With the advancement of machine learning, credit scoring can be performed better. As one of the widely recognized machine learning methods, ensemble learning has demonstrated significant improvements in the predictive accuracy over individual machine learning models for credit scoring. This study proposes a novel multi-stage ensemble model with multiple K-means-based selective undersampling for credit scoring. First, a new multiple K-means-based undersampling method is proposed to deal with the imbalanced data. Then, a new selective sampling mechanism is proposed to select the better-performing base classifiers adaptively. Finally, a new feature-enhanced stacking method is proposed to construct an effective ensemble model by composing the shortlisted base classifiers. In the experiments, four datasets with four evaluation indicators are used to evaluate the performance of the proposed model, and the experimental results prove the superiority of the proposed model over other benchmark models.
Collapse
Affiliation(s)
- Yilun Jin
- School of Information Management and Artificial Intelligence, Zhejiang University of Finance and Economics, Hangzhou, China
| | - Yanan Liu
- School of Information Management and Artificial Intelligence, Zhejiang University of Finance and Economics, Hangzhou, China
| | - Wenyu Zhang
- School of Information Management and Artificial Intelligence, Zhejiang University of Finance and Economics, Hangzhou, China
| | - Shuai Zhang
- School of Information Management and Artificial Intelligence, Zhejiang University of Finance and Economics, Hangzhou, China
| | - Yu Lou
- School of Information Management and Artificial Intelligence, Zhejiang University of Finance and Economics, Hangzhou, China
| |
Collapse
|
270
|
Wang L, Qi S, Liu Y, Lou H, Zuo X. Bagging k-dependence Bayesian network classifiers. INTELL DATA ANAL 2021. [DOI: 10.3233/ida-205125] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Bagging has attracted much attention due to its simple implementation and the popularity of bootstrapping. By learning diverse classifiers from resampled datasets and averaging the outcomes, bagging investigates the possibility of achieving substantial classification performance of the base classifier. Diversity has been recognized as a very important characteristic in bagging. This paper presents an efficient and effective bagging approach, that learns a set of independent Bayesian network classifiers (BNCs) from disjoint data subspaces. The number of bits needed to describe the data is measured in terms of log likelihood, and redundant edges are identified to optimize the topologies of the learned BNCs. Our extensive experimental evaluation on 54 publicly available datasets from the UCI machine learning repository reveals that the proposed algorithm achieves a competitive classification performance compared with state-of-the-art BNCs that use or do not use bagging procedures, such as tree-augmented naive Bayes (TAN), k-dependence Bayesian classifier (KDB), bagging NB or bagging TAN.
Collapse
Affiliation(s)
- Limin Wang
- College of Computer Science and Technology, Jilin University, Changchun, Jilin, China
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, China
| | - Sikai Qi
- College of Computer Science and Technology, Jilin University, Changchun, Jilin, China
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, China
| | - Yang Liu
- College of Computer Science and Technology, Jilin University, Changchun, Jilin, China
| | - Hua Lou
- Department of Software and Big Data, Changzhou College of Information Technology, Changzhou, Jiangsu, China
| | - Xin Zuo
- School of Foreign Languages, Changchun University of Technology, Changchun, Jilin, China
| |
Collapse
|
271
|
Wang Z, Sun Y. Optimization of SMOTE for imbalanced data based on AdaRBFNN and hybrid metaheuristics. INTELL DATA ANAL 2021. [DOI: 10.3233/ida-205176] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Oversampling ratio N and the minority class’ nearest neighboring number k are key hyperparameters of synthetic minority oversampling technique (SMOTE) to reconstruct the class distribution of dataset. No optimal default value exists there. Therefore, it is of necessity to discuss the influence of the output dataset on the classification performance when SMOTE adopts various hyperparameter combinations. In this paper, we propose a hyperparameter optimization algorithm for imbalanced data. By iterating to find reasonable N and k for SMOTE, so as to build a balanced and high-quality dataset. As a result, a model with outstanding performance and strong generalization ability is trained, thus effectively solving imbalanced classification. The proposed algorithm is based on the hybridization of simulated annealing mechanism (SA) and particle swarm optimization algorithm (PSO). In the optimization, Cohen’s Kappa is used to construct the fitness function, and AdaRBFNN, a new classifier, is integrated by multiple trained RBF neural networks based on AdaBoost algorithm. Kappa of each generation is calculated according to the classification results, so as to evaluate the quality of candidate solution. Experiments are conducted on seven groups of KEEL datasets. Results show that the proposed algorithm delivers excellent performance and can significantly improve the classification accuracy of the minority class.
Collapse
|
272
|
Tang X, Zhang T, Cheng N, Wang H, Zheng CH, Xia J, Zhang T. usDSM: a novel method for deleterious synonymous mutation prediction using undersampling scheme. Brief Bioinform 2021; 22:6236069. [PMID: 33866367 DOI: 10.1093/bib/bbab123] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2021] [Revised: 03/12/2021] [Accepted: 03/15/2021] [Indexed: 12/11/2022] Open
Abstract
Although synonymous mutations do not alter the encoded amino acids, they may impact protein function by interfering with the regulation of RNA splicing or altering transcript splicing. New progress on next-generation sequencing technologies has put the exploration of synonymous mutations at the forefront of precision medicine. Several approaches have been proposed for predicting the deleterious synonymous mutations specifically, but their performance is limited by imbalance of the positive and negative samples. In this study, we firstly expanded the number of samples greatly from various data sources and compared six undersampling strategies to solve the problem of the imbalanced datasets. The results suggested that cluster centroid is the most effective scheme. Secondly, we presented a computational model, undersampling scheme based method for deleterious synonymous mutation (usDSM) prediction, using 14-dimensional biology features and random forest classifier to detect the deleterious synonymous mutation. The results on the test datasets indicated that the proposed usDSM model can attain superior performance in comparison with other state-of-the-art machine learning methods. Lastly, we found that the deep learning model did not play a substantial role in deleterious synonymous mutation prediction through a lot of experiments, although it achieves superior results in other fields. In conclusion, we hope our work will contribute to the future development of computational methods for a more accurate prediction of the deleterious effect of human synonymous mutation. The web server of usDSM is freely accessible at http://usdsm.xialab.info/.
Collapse
Affiliation(s)
- Xi Tang
- GMU-GIBH Joint School of Life Sciences, Guangzhou Medical University and the Institutes of Physical Science and Information Technology, Anhui University, China
| | - Tao Zhang
- School of Computer Science and Technology, Anhui University, China
| | - Na Cheng
- Institutes of Physical Science and Information Technology, Anhui University, China
| | - Huadong Wang
- School of Computer Science and Technology, Anhui University, China
| | - Chun-Hou Zheng
- School of Computer Science and Technology, Anhui University, China
| | - Junfeng Xia
- Institutes of Physical Science and Information Technology, Anhui University, China
| | - Tiejun Zhang
- GMU-GIBH Joint School of Life Sciences, Guangzhou Medical University, China
| |
Collapse
|
273
|
Fletcher RR, Nakeshimana A, Olubeko O. Addressing Fairness, Bias, and Appropriate Use of Artificial Intelligence and Machine Learning in Global Health. Front Artif Intell 2021; 3:561802. [PMID: 33981989 PMCID: PMC8107824 DOI: 10.3389/frai.2020.561802] [Citation(s) in RCA: 71] [Impact Index Per Article: 17.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2020] [Accepted: 12/10/2020] [Indexed: 11/16/2022] Open
Abstract
In Low- and Middle- Income Countries (LMICs), machine learning (ML) and artificial intelligence (AI) offer attractive solutions to address the shortage of health care resources and improve the capacity of the local health care infrastructure. However, AI and ML should also be used cautiously, due to potential issues of fairness and algorithmic bias that may arise if not applied properly. Furthermore, populations in LMICs can be particularly vulnerable to bias and fairness in AI algorithms, due to a lack of technical capacity, existing social bias against minority groups, and a lack of legal protections. In order to address the need for better guidance within the context of global health, we describe three basic criteria (Appropriateness, Fairness, and Bias) that can be used to help evaluate the use of machine learning and AI systems: 1) APPROPRIATENESS is the process of deciding how the algorithm should be used in the local context, and properly matching the machine learning model to the target population; 2) BIAS is a systematic tendency in a model to favor one demographic group vs another, which can be mitigated but can lead to unfairness; and 3) FAIRNESS involves examining the impact on various demographic groups and choosing one of several mathematical definitions of group fairness that will adequately satisfy the desired set of legal, cultural, and ethical requirements. Finally, we illustrate how these principles can be applied using a case study of machine learning applied to the diagnosis and screening of pulmonary disease in Pune, India. We hope that these methods and principles can help guide researchers and organizations working in global health who are considering the use of machine learning and artificial intelligence.
Collapse
Affiliation(s)
- Richard Ribón Fletcher
- Massachusetts Institute of Technology, Cambridge, MA, United States
- University of Massachusetts Medical School, Worcester, MA, United States
| | | | | |
Collapse
|
274
|
Yu H, Sun H, Li J, Shi L, Bao N, Li H, Qian W, Zhou S. Effective diagnostic model construction based on discriminative breast ultrasound image regions using deep feature extraction. Med Phys 2021; 48:2920-2928. [PMID: 33690962 DOI: 10.1002/mp.14832] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2020] [Revised: 01/22/2021] [Accepted: 03/03/2021] [Indexed: 12/24/2022] Open
Abstract
PURPOSE This research aims to analyze the diagnostic contribution of different discriminative regions of the breast ultrasound image and develop a more effective diagnosis method taking advantage of the discriminative regions' complementarity. METHODS First, the discriminative regions of the original breast ultrasound image as the inner region of the lesion, the marginal zone of the lesion, and the posterior echo region of the lesion were defined. The pretrained Inception-V3 network was used to analyze the diagnostic contribution of these discriminative regions. Then, the network was applied to extract the deep features of the original image and the other three discriminative region images. Since there are many features, principal components analysis (PCA) was used to reduce the dimensionality of the extracted deep features. The selected deep features from different discriminative regions were fused to original image features and sent to the stacking ensemble learning classifier for classification experiments. In this study, 479 cases of breast ultrasound images, including 356 benign lesions and 123 malignant ones, were collected retrospectively and randomly divided into the training and validation set. RESULTS Experimental results show that by using Inception-V3, the diagnostic performance of each discriminative region is different, and the diagnostic accuracy and the area under the ROC curve (AUC) of the lesion marginal zone image (78.3%, 0.798) are higher than those of the lesion inner region image (73.3%, 0.763) and the posterior echo region image (71.7%, 0.688), but lower than those of the original image (80.0%, 0.817). Furthermore, the best classification performance was obtained when all the four types of deep features (from the original image and three discriminative region images) were fused, and the ensemble learning for classification evaluation was employed. Compared with the original image, the classification accuracy and AUC increased from 80.83%, 0.818 to 85.00%, 0.872, and the classification sensitivity and specificity varied from 0.710, 0.798 to 0.871, 0.787. CONCLUSIONS The inner region of the lesion, the marginal zone of the lesion, and the posterior echo region of the lesion play significant roles in the diagnosis of the breast ultrasound image. Deep feature fusion of these three kinds of images and the original image can effectively improve the accuracy of diagnosis.
Collapse
Affiliation(s)
- Hailong Yu
- College of Medicine and Biological Information Engineering, Northeastern University, 195 Chuangxin Road, Shenyang, China
| | - Hang Sun
- College of Medicine and Biological Information Engineering, Northeastern University, 195 Chuangxin Road, Shenyang, China
| | - Jing Li
- Department of Radiology, Affiliated Hospital of Guizhou Medical University, 28 Guiyi Road, Guiyang, China
| | - Liying Shi
- Department of Radiology, Affiliated Hospital of Guizhou Medical University, 28 Guiyi Road, Guiyang, China
| | - Nan Bao
- College of Medicine and Biological Information Engineering, Northeastern University, 195 Chuangxin Road, Shenyang, China
| | - Hong Li
- College of Medicine and Biological Information Engineering, Northeastern University, 195 Chuangxin Road, Shenyang, China
| | - Wei Qian
- College of Engineering, University of Texas at El Paso, El Paso, TX, USA
| | - Shi Zhou
- Department of Radiology, Affiliated Hospital of Guizhou Medical University, 28 Guiyi Road, Guiyang, China
| |
Collapse
|
275
|
Ma JH, Feng Z, Wu JY, Zhang Y, Di W. Learning from imbalanced fetal outcomes of systemic lupus erythematosus in artificial neural networks. BMC Med Inform Decis Mak 2021; 21:127. [PMID: 33845834 PMCID: PMC8042715 DOI: 10.1186/s12911-021-01486-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2021] [Accepted: 03/31/2021] [Indexed: 11/10/2022] Open
Abstract
OBJECTIVE To explore an effective algorithm based on artificial neural network to pick correctly the minority of pregnant women with SLE suffering fetal loss outcomes from the majority with live birth and train a well behaved model as a clinical decision assistant. METHODS We integrated the thoughts of comparative and focused study into the artificial neural network and presented an effective algorithm aiming at imbalanced learning in small dataset. RESULTS We collected 469 non-trivial pregnant patients with SLE, where 420 had live-birth outcomes and the other 49 patients ended in fetal loss. A well trained imbalanced-learning model had a high sensitivity of 19/21 ([Formula: see text]) for the identification of patients with fetal loss outcomes. DISCUSSION The misprediction of the two patients was explainable. Algorithm improvements in artificial neural network framework enhanced the identification in imbalanced learning problems and the external validation increased the reliability of algorithm. CONCLUSION The well-trained model was fully qualified to assist healthcare providers to make timely and accurate decisions.
Collapse
Affiliation(s)
- Jing-Hang Ma
- Department of Obstetrics and Gynecology, Ren Ji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
- Shanghai Key Laboratory of Gynecologic Oncology, Ren Ji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
- State Key Laboratory of Oncogenes and Related Genes, Shanghai Cancer Institute, Ren Ji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
- First Affiliated Hospital of Wenzhou Medical University, Wenzhou, China
| | - Zhen Feng
- First Affiliated Hospital of Wenzhou Medical University, Wenzhou, China
| | - Jia-Yue Wu
- Department of Obstetrics and Gynecology, Ren Ji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
- Shanghai Key Laboratory of Gynecologic Oncology, Ren Ji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
- State Key Laboratory of Oncogenes and Related Genes, Shanghai Cancer Institute, Ren Ji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
| | - Yu Zhang
- Shanghai Key Laboratory of Gynecologic Oncology, Ren Ji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
- State Key Laboratory of Oncogenes and Related Genes, Shanghai Cancer Institute, Ren Ji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
| | - Wen Di
- Department of Obstetrics and Gynecology, Ren Ji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China.
- Shanghai Key Laboratory of Gynecologic Oncology, Ren Ji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China.
- State Key Laboratory of Oncogenes and Related Genes, Shanghai Cancer Institute, Ren Ji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China.
| |
Collapse
|
276
|
Singh A, Ranjan RK, Tiwari A. Credit Card Fraud Detection under Extreme Imbalanced Data: A Comparative Study of Data-level Algorithms. J EXP THEOR ARTIF IN 2021. [DOI: 10.1080/0952813x.2021.1907795] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Affiliation(s)
- Amit Singh
- Indian Computer Emergency Response Team, Ministry of Electronics and Information Technology, New Delhi, India
| | | | - Abhishek Tiwari
- Department of Computer Science, Central University of Haryana, Mahendergarh, India
| |
Collapse
|
277
|
Chen B, Xia S, Chen Z, Wang B, Wang G. RSMOTE: A self-adaptive robust SMOTE for imbalanced problems with label noise. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2020.10.013] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
278
|
|
279
|
Mehmood Z, Asghar S. Customizing SVM as a base learner with AdaBoost ensemble to learn from multi-class problems: A hybrid approach AdaBoost-MSVM. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2021.106845] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
280
|
Bi Y, Xue B, Zhang M. Genetic Programming With a New Representation to Automatically Learn Features and Evolve Ensembles for Image Classification. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:1769-1783. [PMID: 32011275 DOI: 10.1109/tcyb.2020.2964566] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Image classification is a popular task in machine learning and computer vision, but it is very challenging due to high variation crossing images. Using ensemble methods for solving image classification can achieve higher classification performance than using a single classification algorithm. However, to obtain a good ensemble, the component (base) classifiers in an ensemble should be accurate and diverse. To solve image classification effectively, feature extraction is necessary to transform raw pixels into high-level informative features. However, this process often requires domain knowledge. This article proposes an evolutionary approach based on genetic programming to automatically and simultaneously learn informative features and evolve effective ensembles for image classification. The new approach takes raw images as inputs and returns predictions of class labels based on the evolved classifiers. To achieve this, a new individual representation, a new function set, and a new terminal set are developed to allow the new approach to effectively find the best solution. More important, the solutions of the new approach can extract informative features from raw images and can automatically address the diversity issue of the ensembles. In addition, the new approach can automatically select and optimize the parameters for the classification algorithms in the ensemble. The performance of the new approach is examined on 13 different image classification datasets of varying difficulty and compared with a large number of effective methods. The results show that the new approach achieves better classification accuracy on most datasets than the competitive methods. Further analysis demonstrates that the new approach can evolve solutions with high accuracy and diversity.
Collapse
|
281
|
Chen Z, Duan J, Kang L, Qiu G. A hybrid data-level ensemble to enable learning from highly imbalanced dataset. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2020.12.023] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
282
|
Moyano JM, Reyes O, Fardoun HM, Ventura S. Performing multi-target regression via gene expression programming-based ensemble models. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2020.12.060] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
283
|
Starovoitov VV, Golub YI. About the confusion-matrix-based assessment of the results of imbalanced data classification. INFORMATICS 2021. [DOI: 10.37661/10.37661/1816-0301-2021-18-1-61-71] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
When applying classifiers in real applications, the data imbalance often occurs when the number of elements of one class is greater than another. The article examines the estimates of the classification results for this type of data. The paper provides answers to three questions: which term is a more accurate translation of the phrase "confusion matrix", how preferable to represent data in this matrix, and what functions to be better used to evaluate the results of classification by such a matrix. The paper demonstrates on real data that the popular accuracy function cannot correctly estimate the classification errors for imbalanced data. It is also impossible to compare the values of this function, calculated by matrices with absolute quantitative results of classification and normalized by classes. If the data is imbalanced, the accuracy calculated from the confusion matrix with normalized values will usually have lower values, since it is calculated by a different formula. The same conclusion is made for most of the classification accuracy functions used in the literature for estimation of classification results. It is shown that to represent confusion matrices it is better to use absolute values of object distribution by classes instead of relative ones, since they give an idea of the amount of data tested for each class and their imbalance. When constructing classifiers, it is recommended to evaluate errors by functions that do not depend on the data imbalance, that allows to hope for more correct classification results for real data.
Collapse
Affiliation(s)
- V. V. Starovoitov
- The United Institute of Informatics Problems of the National Academy of Sciences of Belarus
| | - Yu. I. Golub
- The United Institute of Informatics Problems of the National Academy of Sciences of Belarus
| |
Collapse
|
284
|
Wang X, Zhai M, Ren Z, Ren H, Li M, Quan D, Chen L, Qiu L. Exploratory study on classification of diabetes mellitus through a combined Random Forest Classifier. BMC Med Inform Decis Mak 2021; 21:105. [PMID: 33743696 PMCID: PMC7980612 DOI: 10.1186/s12911-021-01471-4] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2020] [Accepted: 03/11/2021] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND Diabetes Mellitus (DM) has become the third chronic non-communicable disease that hits patients after tumors, cardiovascular and cerebrovascular diseases, and has become one of the major public health problems in the world. Therefore, it is of great importance to identify individuals at high risk for DM in order to establish prevention strategies for DM. METHODS Aiming at the problem of high-dimensional feature space and high feature redundancy of medical data, as well as the problem of data imbalance often faced. This study explored different supervised classifiers, combined with SVM-SMOTE and two feature dimensionality reduction methods (Logistic stepwise regression and LAASO) to classify the diabetes survey sample data with unbalanced categories and complex related factors. Analysis and discussion of the classification results of 4 supervised classifiers based on 4 data processing methods. Five indicators including Accuracy, Precision, Recall, F1-Score and AUC are selected as the key indicators to evaluate the performance of the classification model. RESULTS According to the result, Random Forest Classifier combining SVM-SMOTE resampling technology and LASSO feature screening method (Accuracy = 0.890, Precision = 0.869, Recall = 0.919, F1-Score = 0.893, AUC = 0.948) proved the best way to tell those at high risk of DM. Besides, the combined algorithm helps enhance the classification performance for prediction of high-risk people of DM. Also, age, region, heart rate, hypertension, hyperlipidemia and BMI are the top six most critical characteristic variables affecting diabetes. CONCLUSIONS The Random Forest Classifier combining with SVM-SMOTE and LASSO feature reduction method perform best in identifying high-risk people of DM from individuals. And the combined method proposed in the study would be a good tool for early screening of DM.
Collapse
Affiliation(s)
- Xuchun Wang
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, Shanxi, China
| | - Mengmeng Zhai
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, Shanxi, China
| | - Zeping Ren
- Shanxi Centre for Disease Control and Prevention, Taiyuan, 030012, Shanxi, China
| | - Hao Ren
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, Shanxi, China
| | - Meichen Li
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, Shanxi, China
| | - Dichen Quan
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, Shanxi, China
| | - Limin Chen
- Shanxi Provincial People's Hospital, Taiyuan City, Shanxi Province, China.
| | - Lixia Qiu
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, Shanxi, China.
| |
Collapse
|
285
|
ImbTreeEntropy and ImbTreeAUC: Novel R Packages for Decision Tree Learning on the Imbalanced Datasets. ELECTRONICS 2021. [DOI: 10.3390/electronics10060657] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
This paper presents two R packages ImbTreeEntropy and ImbTreeAUC to handle imbalanced data problems. ImbTreeEntropy functionality includes application of a generalized entropy functions, such as Rényi, Tsallis, Sharma–Mittal, Sharma–Taneja and Kapur, to measure impurity of a node. ImbTreeAUC provides non-standard measures to choose an optimal split point for an attribute (as well the optimal attribute for splitting) by employing local, semi-global and global AUC (Area Under the ROC curve) measures. Both packages are applicable for binary and multiclass problems and they support cost-sensitive learning, by defining a misclassification cost matrix, and weighted-sensitive learning. The packages accept all types of attributes, including continuous, ordered and nominal, where the latter type is simplified for multiclass problems to reduce the computational overheads. Both applications enable optimization of the thresholds where posterior probabilities determine final class labels in a way that misclassification costs are minimized. Model overfitting can be managed either during the growing phase or at the end using post-pruning. The packages are mainly implemented in R, however some computationally demanding functions are written in plain C++. In order to speed up learning time, parallel processing is supported as well.
Collapse
|
286
|
Peacock CJ, Lamont C, Sheen DA, Shen VK, Kreplak L, Frampton JP. Predicting the Mixing Behavior of Aqueous Solutions Using a Machine Learning Framework. ACS APPLIED MATERIALS & INTERFACES 2021; 13:11449-11460. [PMID: 33645207 DOI: 10.1021/acsami.0c21036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The most direct approach to determining if two aqueous solutions will phase-separate upon mixing is to exhaustively screen them in a pair-wise fashion. This is a time-consuming process that involves preparation of numerous stock solutions, precise transfer of highly concentrated and often viscous solutions, exhaustive agitation to ensure thorough mixing, and time-sensitive monitoring to observe the presence of emulsion characteristics indicative of phase separation. Here, we examined the pair-wise mixing behavior of 68 water-soluble compounds by observing the formation of microscopic phase boundaries and droplets of 2278 unique 2-component solutions. A series of machine learning classifiers (artificial neural network, random forest, k-nearest neighbors, and support vector classifier) were then trained on physicochemical property data associated with the 68 compounds and used to predict their miscibility upon mixing. Miscibility predictions were then compared to the experimental observations. The random forest classifier was the most successful classifier of those tested, displaying an average receiver operator characteristic area under the curve of 0.74. The random forest classifier was validated by removing either one or two compounds from the input data, training the classifier on the remaining data and then predicting the miscibility of solutions involving the removed compound(s) using the classifier. The accuracy, specificity, and sensitivity of the random forest classifier were 0.74, 0.80, and 0.51, respectively, when one of the two compounds to be examined was not represented in the training data. When asked to predict the miscibility of two compounds, neither of which were represented in the training data, the accuracy, specificity, and sensitivity values for the random forest classifier were 0.70, 0.82 and 0.29, respectively. Thus, there is potential for this machine learning approach to improve the design of screening experiments to accelerate the discovery of aqueous two-phase systems for numerous scientific and industrial applications.
Collapse
Affiliation(s)
- Chris J Peacock
- Department of Physics and Atmospheric Science, Dalhousie University, Halifax, Nova Scotia B3H4R2, Canada
| | - Connor Lamont
- Department of Chemistry, Dalhousie University, Halifax, Nova Scotia B3H 4R2, Canada
| | - David A Sheen
- Chemical Informatics Group, National Institute of Standards and Technology, Gaithersburg, Maryland 20899, United States
| | - Vincent K Shen
- Chemical Informatics Group, National Institute of Standards and Technology, Gaithersburg, Maryland 20899, United States
| | - Laurent Kreplak
- Department of Physics and Atmospheric Science, Dalhousie University, Halifax, Nova Scotia B3H4R2, Canada
- School of Biomedical Engineering, Dalhousie University, Halifax, Nova Scotia B3H 4R2, Canada
| | - John P Frampton
- School of Biomedical Engineering, Dalhousie University, Halifax, Nova Scotia B3H 4R2, Canada
- Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, Nova Scotia B3H 4R2, Canada
| |
Collapse
|
287
|
Shaw SS, Ahmed S, Malakar S, Garcia-Hernandez L, Abraham A, Sarkar R. Hybridization of ring theory-based evolutionary algorithm and particle swarm optimization to solve class imbalance problem. COMPLEX INTELL SYST 2021. [DOI: 10.1007/s40747-021-00314-z] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
AbstractMany real-life datasets are imbalanced in nature, which implies that the number of samples present in one class (minority class) is exceptionally less compared to the number of samples found in the other class (majority class). Hence, if we directly fit these datasets to a standard classifier for training, then it often overlooks the minority class samples while estimating class separating hyperplane(s) and as a result of that it missclassifies the minority class samples. To solve this problem, over the years, many researchers have followed different approaches. However the selection of the true representative samples from the majority class is still considered as an open research problem. A better solution for this problem would be helpful in many applications like fraud detection, disease prediction and text classification. Also, the recent studies show that it needs not only analyzing disproportion between classes, but also other difficulties rooted in the nature of different data and thereby it needs more flexible, self-adaptable, computationally efficient and real-time method for selection of majority class samples without loosing much of important data from it. Keeping this fact in mind, we have proposed a hybrid model constituting Particle Swarm Optimization (PSO), a popular swarm intelligence-based meta-heuristic algorithm, and Ring Theory (RT)-based Evolutionary Algorithm (RTEA), a recently proposed physics-based meta-heuristic algorithm. We have named the algorithm as RT-based PSO or in short RTPSO. RTPSO can select the most representative samples from the majority class as it takes advantage of the efficient exploration and the exploitation phases of its parent algorithms for strengthening the search process. We have used AdaBoost classifier to observe the final classification results of our model. The effectiveness of our proposed method has been evaluated on 15 standard real-life datasets having low to extreme imbalance ratio. The performance of the RTPSO has been compared with PSO, RTEA and other standard undersampling methods. The obtained results demonstrate the superiority of RTPSO over state-of-the-art class imbalance problem-solvers considered here for comparison. The source code of this work is available in https://github.com/Sayansurya/RTPSO_Class_imbalance.
Collapse
|
288
|
Classifying multiclass imbalanced data using generalized class-specific extreme learning machine. PROGRESS IN ARTIFICIAL INTELLIGENCE 2021. [DOI: 10.1007/s13748-021-00236-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
289
|
|
290
|
Shin J, Yoon S, Kim Y, Kim T, Go B, Cha Y. Effects of class imbalance on resampling and ensemble learning for improved prediction of cyanobacteria blooms. ECOL INFORM 2021. [DOI: 10.1016/j.ecoinf.2020.101202] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
291
|
An automatic sampling ratio detection method based on genetic algorithm for imbalanced data classification. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2021.106800] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
292
|
Lu Y, Cheung YM, Tang YY. Self-Adaptive Multiprototype-Based Competitive Learning Approach: A k-Means-Type Algorithm for Imbalanced Data Clustering. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:1598-1612. [PMID: 31150353 DOI: 10.1109/tcyb.2019.2916196] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Class imbalance problem has been extensively studied in the recent years, but imbalanced data clustering in unsupervised environment, that is, the number of samples among clusters is imbalanced, has yet to be well studied. This paper, therefore, studies the imbalanced data clustering problem within the framework of k -means-type competitive learning. We introduce a new method called self-adaptive multiprototype-based competitive learning (SMCL) for imbalanced clusters. It uses multiple subclusters to represent each cluster with an automatic adjustment of the number of subclusters. Then, the subclusters are merged into the final clusters based on a novel separation measure. We also propose a new internal clustering validation measure to determine the number of final clusters during the merging process for imbalanced clusters. The advantages of SMCL are threefold: 1) it inherits the advantages of competitive learning and meanwhile is applicable to the imbalanced data clustering; 2) the self-adaptive multiprototype mechanism uses a proper number of subclusters to represent each cluster with any arbitrary shape; and 3) it automatically determines the number of clusters for imbalanced clusters. SMCL is compared with the existing counterparts for imbalanced clustering on the synthetic and real datasets. The experimental results show the efficacy of SMCL for imbalanced clusters.
Collapse
|
293
|
Pei W, Xue B, Shang L, Zhang M. Genetic programming for development of cost-sensitive classifiers for binary high-dimensional unbalanced classification. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2020.106989] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
294
|
Le HL, Landa-Silva D, Galar M, Garcia S, Triguero I. EUSC: A clustering-based surrogate model to accelerate evolutionary undersampling in imbalanced classification. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2020.107033] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
295
|
Sainin MS, Alfred R, Ahmad F. ENSEMBLE META CLASSIFIER WITH SAMPLING AND FEATURE SELECTION FOR DATA WITH MULTICLASS IMBALANCE PROBLEM. JOURNAL OF INFORMATION AND COMMUNICATION TECHNOLOGY 2021. [DOI: 10.32890/jict2021.20.2.1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Ensemble learning by combining several single classifiers or another ensemble classifier is one of the procedures to solve the imbalance problem in multiclass data. However, this approach still faces the question of how the ensemble methods obtain their higher performance. In this paper, an investigation was carried out on the design of the meta classifier ensemble with sampling and feature selection for multiclass imbalanced data. The specific objectives were: 1) to improve the ensemble classifier through data-level approach (sampling and feature selection); 2) to perform experiments on sampling, feature selection, and ensemble classifier model; and 3 ) to evaluate t he performance of the ensemble classifier. To fulfil the objectives, a preliminary data collection of Malaysian plants’ leaf images was prepared and experimented, and the results were compared. The ensemble design was also tested with three other high imbalance ratio benchmark data. It was found that the design using sampling, feature selection, and ensemble classifier method via AdaboostM1 with random forest (also an ensemble classifier) provided improved performance throughout the investigation. The result of this study is important to the on-going problem of multiclass imbalance where specific structure and its performance can be improved in terms of processing time and accuracy.
Collapse
Affiliation(s)
| | - Rayner Alfred
- Faculty of Computing and Informatics, Universiti Malaysia Sabah, Malaysia
| | - Faudziah Ahmad
- School of Computing, Universiti Utara Malaysia, Malaysia
| |
Collapse
|
296
|
An adaptive boosting algorithm based on weighted feature selection and category classification confidence. APPL INTELL 2021. [DOI: 10.1007/s10489-020-02184-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
297
|
Artificial Intelligence and the Medical Physicist: Welcome to the Machine. APPLIED SCIENCES-BASEL 2021. [DOI: 10.3390/app11041691] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Artificial intelligence (AI) is a branch of computer science dedicated to giving machines or computers the ability to perform human-like cognitive functions, such as learning, problem-solving, and decision making. Since it is showing superior performance than well-trained human beings in many areas, such as image classification, object detection, speech recognition, and decision-making, AI is expected to change profoundly every area of science, including healthcare and the clinical application of physics to healthcare, referred to as medical physics. As a result, the Italian Association of Medical Physics (AIFM) has created the “AI for Medical Physics” (AI4MP) group with the aims of coordinating the efforts, facilitating the communication, and sharing of the knowledge on AI of the medical physicists (MPs) in Italy. The purpose of this review is to summarize the main applications of AI in medical physics, describe the skills of the MPs in research and clinical applications of AI, and define the major challenges of AI in healthcare.
Collapse
|
298
|
Abstract
In this Internet age, there are increasingly many threats to the security and safety of users daily. One of such threats is malicious software otherwise known as malware (ransomware, Trojans, viruses, etc.). The effect of this threat can lead to loss or malicious replacement of important information (such as bank account details, etc.). Malware creators have been able to bypass traditional methods of malware detection, which can be time-consuming and unreliable for unknown malware. This motivates the need for intelligent ways to detect malware, especially new malware which have not been evaluated or studied before. Machine learning provides an intelligent way to detect malware and comprises two stages: feature extraction and classification. This study suggests an ensemble learning-based method for malware detection. The base stage classification is done by a stacked ensemble of fully-connected and one-dimensional convolutional neural networks (CNNs), whereas the end-stage classification is done by a machine learning algorithm. For a meta-learner, we analyzed and compared 15 machine learning classifiers. For comparison, five machine learning algorithms were used: naïve Bayes, decision tree, random forest, gradient boosting, and AdaBoosting. The results of experiments made on the Windows Portable Executable (PE) malware dataset are presented. The best results were obtained by an ensemble of seven neural networks and the ExtraTrees classifier as a final-stage classifier.
Collapse
|
299
|
EnCNN-UPMWS: Waste Classification by a CNN Ensemble Using the UPM Weighting Strategy. ELECTRONICS 2021. [DOI: 10.3390/electronics10040427] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
The accurate and effective classification of household solid waste (HSW) is an indispensable component in the current procedure of waste disposal. In this paper, a novel ensemble learning model called EnCNN-UPMWS, which is based on convolutional neural networks (CNNs) and an unequal precision measurement weighting strategy (UPMWS), is proposed for the classification of HSW via waste images. First, three state-of-the-art CNNs, namely GoogLeNet, ResNet-50, and MobileNetV2, are used as ingredient classifiers to separately predict and obtain three predicted probability vectors, which are significant elements that affect the prediction performance by providing complementary information about the patterns to be classified. Then, the UPMWS is introduced to determine the weight coefficients of the ensemble models. The actual one-hot encoding labels of the validation set and the predicted probability vectors from the CNN ensemble are creatively used to calculate the weights for each classifier during the training phase, which can bring the aggregated prediction vector closer to the target label and improve the performance of the ensemble model. The proposed model was applied to two datasets, namely TrashNet (an open-access dataset) and FourTrash, which was constructed by collecting a total of 47,332 common HSW images containing four types of waste (wet waste, recyclables, harmful waste, and dry waste). The experimental results demonstrate the effectiveness of the proposed method in terms of its accuracy and F1-scores. Moreover, it was found that the UPMWS can simply and effectively enhance the performance of the ensemble learning model, and has potential applications in similar tasks of classification via ensemble learning.
Collapse
|
300
|
Ruiz-Pérez I, López-Valenciano A, Hernández-Sánchez S, Puerta-Callejón JM, De Ste Croix M, Sainz de Baranda P, Ayala F. A Field-Based Approach to Determine Soft Tissue Injury Risk in Elite Futsal Using Novel Machine Learning Techniques. Front Psychol 2021; 12:610210. [PMID: 33613389 PMCID: PMC7892460 DOI: 10.3389/fpsyg.2021.610210] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2020] [Accepted: 01/14/2021] [Indexed: 12/19/2022] Open
Abstract
Lower extremity non-contact soft tissue (LE-ST) injuries are prevalent in elite futsal. The purpose of this study was to develop robust screening models based on pre-season measures obtained from questionnaires and field-based tests to prospectively predict LE-ST injuries after having applied a range of supervised Machine Learning techniques. One hundred and thirty-nine elite futsal players underwent a pre-season screening evaluation that included individual characteristics; measures related to sleep quality, athlete burnout, psychological characteristics related to sport performance and self-reported perception of chronic ankle instability. A number of neuromuscular performance measures obtained through three field-based tests [isometric hip strength, dynamic postural control (Y-Balance) and lower extremity joints range of motion (ROM-Sport battery)] were also recorded. Injury incidence was monitored over one competitive season. There were 25 LE-ST injuries. Only those groups of measures from two of the field-based tests (ROM-Sport battery and Y-Balance), as independent data sets, were able to build robust models [area under the receiver operating characteristic curve (AUC) score ≥0.7] to identify elite futsal players at risk of sustaining a LE-ST injury. Unlike the measures obtained from the five questionnaires selected, the neuromuscular performance measures did build robust prediction models (AUC score ≥0.7). The inclusion in the same data set of the measures recorded from all the questionnaires and field-based tests did not result in models with significantly higher performance scores. The model generated by the UnderBagging technique with a cost-sensitive SMO as the base classifier and using only four ROM measures reported the best prediction performance scores (AUC = 0.767, true positive rate = 65.9% and true negative rate = 62%). The models developed might help coaches, physical trainers and medical practitioners in the decision-making process for injury prevention in futsal.
Collapse
Affiliation(s)
- Iñaki Ruiz-Pérez
- Department of Sport Sciences, Sports Research Centre, Miguel Hernández University of Elche, Elche, Spain
| | | | - Sergio Hernández-Sánchez
- Department of Pathology and Surgery, Physiotherapy Area, Miguel Hernandez University of Elche, Alicante, Spain
| | | | - Mark De Ste Croix
- School of Sport and Exercise, University of Gloucestershire, Gloucester, United Kingdom
| | - Pilar Sainz de Baranda
- Department of Physical Activity and Sport, Faculty of Sports Sciences, University of Murcia, Murcia, Spain
| | - Francisco Ayala
- Ramón y Cajal Postdoctoral Fellowship, Department of Physical Activity and Sport, Faculty of Sports Sciences, University of Murcia, Murcia, Spain
| |
Collapse
|