1
|
Kasprzak J, Westphalen CB, Frey S, Schmitt Y, Heinemann V, Fey T, Nasseh D. Supporting the decision to perform molecular profiling for cancer patients based on routinely collected data through the use of machine learning. Clin Exp Med 2024; 24:73. [PMID: 38598013 PMCID: PMC11006770 DOI: 10.1007/s10238-024-01336-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2023] [Accepted: 03/21/2024] [Indexed: 04/11/2024]
Abstract
BACKGROUND Personalized medicine offers targeted therapy options for cancer treatment. However, the decision whether to include a patient into next-generation sequencing (NGS) testing is not standardized. This may result in some patients receiving unnecessary testing while others who could benefit from it are not tested. Typically, patients who have exhausted conventional treatment options are of interest for consideration in molecularly targeted therapy. To assist clinicians in decision-making, we developed a decision support tool using routine data from a precision oncology program. METHODS We trained a machine learning model on clinical data to determine whether molecular profiling should be performed for a patient. To validate the model, the model's predictions were compared with decisions made by a molecular tumor board (MTB) using multiple patient case vignettes with their characteristics. RESULTS The prediction model included 440 patients with molecular profiling and 13,587 patients without testing. High area under the curve (AUC) scores indicated the importance of engineered features in deciding on molecular profiling. Patient age, physical condition, tumor type, metastases, and previous therapies were the most important features. During the validation MTB experts made the same decision of recommending a patient for molecular profiling only in 10 out of 15 of their previous cases but there was agreement between the experts and the model in 9 out of 15 cases. CONCLUSION Based on a historical cohort, our predictive model has the potential to assist clinicians in deciding whether to perform molecular profiling.
Collapse
Affiliation(s)
- Julia Kasprzak
- Comprehensive Cancer Center (CCC Munich LMU), LMU University Hospital Munich, Pettenkoferstraße 8a, Munich, Germany.
| | - C Benedikt Westphalen
- Comprehensive Cancer Center (CCC Munich LMU), LMU University Hospital Munich, Pettenkoferstraße 8a, Munich, Germany
| | - Simon Frey
- Roche Pharma AG, Grenzach-Wyhlen, Germany
| | | | - Volker Heinemann
- Comprehensive Cancer Center (CCC Munich LMU), LMU University Hospital Munich, Pettenkoferstraße 8a, Munich, Germany
- German Cancer Research Center (DKFZ), German Cancer Consortium (DKTK, Partner Site Munich), Heidelberg, Germany
| | - Theres Fey
- Comprehensive Cancer Center (CCC Munich LMU), LMU University Hospital Munich, Pettenkoferstraße 8a, Munich, Germany
| | - Daniel Nasseh
- Comprehensive Cancer Center (CCC Munich LMU), LMU University Hospital Munich, Pettenkoferstraße 8a, Munich, Germany
| |
Collapse
|
2
|
John M, Shaiba H. Identification of self-care problem in children using machine learning. Heliyon 2024; 10:e26977. [PMID: 38463780 PMCID: PMC10923687 DOI: 10.1016/j.heliyon.2024.e26977] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2023] [Revised: 02/14/2024] [Accepted: 02/22/2024] [Indexed: 03/12/2024] Open
Abstract
Identification of self-care problems in children is a challenging task for medical professionals owing to its complexity and time consumption. Furthermore, the shortage of occupational therapists worldwide makes the task more challenging. Machine learning methods have come to the aid of reducing the complexity associated with problems in diverse fields. This paper employs machine learning based models to identify whether a child suffers from self-care problems using SCADI dataset. The dataset exhibited high dimensionality and imbalance. Initially, the dataset was converted into lower dimensionality. Imbalanced dataset is likely to affect the performance of machine learning models. To address this issue, SMOTE oversampling method was used to reduce the wide variations in the class distribution. The classification methods used were Naïve bayes, J48 and random forest. Random forest classifier which was operated on SMOTE balanced data obtained the best classification performance with balanced accuracy of 99%. The classification model outperformed the existing expert systems.
Collapse
Affiliation(s)
- Maya John
- Artificial Intelligence and Data Analytics (AIDA) Lab, College of Computer and Information Sciences, Prince Sultan University, Riyadh, Saudi Arabia
| | - Hadil Shaiba
- Department of Computer Science, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia
| |
Collapse
|
3
|
Liu Q, Chen Y, Xie P, Luo Y, Wang B, Meng Y, Zhong J, Mei J, Zou W. Development of a predictive machine learning model for pathogen profiles in patients with secondary immunodeficiency. BMC Med Inform Decis Mak 2024; 24:48. [PMID: 38350899 PMCID: PMC10863296 DOI: 10.1186/s12911-024-02447-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Accepted: 01/30/2024] [Indexed: 02/15/2024] Open
Abstract
BACKGROUND Secondary immunodeficiency can arise from various clinical conditions that include HIV infection, chronic diseases, malignancy and long-term use of immunosuppressives, which makes the suffering patients susceptible to all types of pathogenic infections. Other than HIV infection, the possible pathogen profiles in other aetiology-induced secondary immunodeficiency are largely unknown. METHODS Medical records of the patients with secondary immunodeficiency caused by various aetiologies were collected from the First Affiliated Hospital of Nanchang University, China. Based on these records, models were developed with the machine learning method to predict the potential infectious pathogens that may inflict the patients with secondary immunodeficiency caused by various disease conditions other than HIV infection. RESULTS Several metrics were used to evaluate the models' performance. A consistent conclusion can be drawn from all the metrics that Gradient Boosting Machine had the best performance with the highest accuracy at 91.01%, exceeding other models by 13.48, 7.14, and 4.49% respectively. CONCLUSIONS The models developed in our study enable the prediction of potential infectious pathogens that may affect the patients with secondary immunodeficiency caused by various aetiologies except for HIV infection, which will help clinicians make a timely decision on antibiotic use before microorganism culture results return.
Collapse
Affiliation(s)
- Qianning Liu
- School of Statistics, Jiangxi University of Finance and Economics, Nanchang, 330013, Jiangxi, China
| | - Yifan Chen
- School of Statistics, Jiangxi University of Finance and Economics, Nanchang, 330013, Jiangxi, China
| | - Peng Xie
- Department of Infectious Diseases, The First Affiliated Hospital, Jiangxi Medical College, Nanchang University, Nanchang, 330006, Jiangxi, China
| | - Ying Luo
- Department of Infectious Diseases, The First Affiliated Hospital, Jiangxi Medical College, Nanchang University, Nanchang, 330006, Jiangxi, China
- Department of Infectious Diseases, Third People's Hospital of Jiujiang, Jiujiang, 332000, Jiangxi, China
| | - Buxuan Wang
- School of Statistics, Jiangxi University of Finance and Economics, Nanchang, 330013, Jiangxi, China
| | - Yuanxi Meng
- The First Clinical Medical College,Jiangxi Medical College, Nanchang University, Nanchang, 330006, Jiangxi, China
| | - Jiaqian Zhong
- The First Clinical Medical College,Jiangxi Medical College, Nanchang University, Nanchang, 330006, Jiangxi, China
| | - Jiaqi Mei
- The First Clinical Medical College,Jiangxi Medical College, Nanchang University, Nanchang, 330006, Jiangxi, China
| | - Wei Zou
- Department of Infectious Diseases, The First Affiliated Hospital, Jiangxi Medical College, Nanchang University, Nanchang, 330006, Jiangxi, China.
| |
Collapse
|
4
|
Severinsen I, Yu W, Walmsley T, Young B. COVERT: A classless approach to generating balanced datasets for process modelling. ISA Trans 2024; 144:1-10. [PMID: 37951753 DOI: 10.1016/j.isatra.2023.10.031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Revised: 09/04/2023] [Accepted: 10/27/2023] [Indexed: 11/14/2023]
Abstract
In this work, a classless oversampling technique, Covert, was developed to improve historical datasets from industrial processing plants to aid process modelling. Using kernel density estimation and nearest neighbour algorithms, sparse regions are identified and resampled, developing a more balanced dataset. When applied to a real dataset from a geothermal power plant, Covert outperforms current best practice (Smote) in uniformly populating the input feature space and generating credible data in the output variable. When used to develop a data-driven model Covert improved model accuracy by 20% when predicting outside the original data's feature space. Smote, however, reduced model accuracy by 6% in the same feature space. Developing reliable models of industrial processes continues to be a significant hurdle in developing a digital twin. Using Covert, existing imbalanced historical data can be used to extend the range of applicability of any process model.
Collapse
Affiliation(s)
- Isaac Severinsen
- Department of Chemical and Materials Engineering, University of Auckland, 5 Grafton Road, Auckland, 1010, New Zealand
| | - Wei Yu
- Department of Chemical and Materials Engineering, University of Auckland, 5 Grafton Road, Auckland, 1010, New Zealand
| | - Timothy Walmsley
- Ahuora - Centre for Smart Energy Systems, School of Engineering, The University of Waikato, Gate 8, Hillcrest Road, Hamilton, 3240, New Zealand
| | - Brent Young
- Department of Chemical and Materials Engineering, University of Auckland, 5 Grafton Road, Auckland, 1010, New Zealand.
| |
Collapse
|
5
|
Zhang W, Guan X, Jiao S, Wang G, Wang X. Development and validation of an artificial intelligence prediction model and a survival risk stratification for lung metastasis in colorectal cancer from highly imbalanced data: A multicenter retrospective study. Eur J Surg Oncol 2023; 49:107107. [PMID: 37883884 DOI: 10.1016/j.ejso.2023.107107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Revised: 09/08/2023] [Accepted: 10/02/2023] [Indexed: 10/28/2023]
Abstract
BACKGROUND To assist clinicians with diagnosis and optimal treatment decision-making, we attempted to develop and validate an artificial intelligence prediction model for lung metastasis (LM) in colorectal cancer (CRC) patients. METHODS The clinicopathological characteristics of 46037 CRC patients from the Surveillance, Epidemiology, and End Results (SEER) database and 2779 CRC patients from a multi-center external validation set were collected retrospectively. After feature selection by univariate and multivariate analyses, six machine learning (ML) models, including logistic regression, K-nearest neighbor, support vector machine, decision tree, random forest, and balanced random forest (BRF), were developed and validated for the LM prediction. In addition, stratified LM patients by risk score were utilized for survival analysis. RESULTS Extremely low rates of LM with 2.59% and 4.50% were present in the development and validation set. As the imbalanced learning strategy, the BRF model with an Area under the receiver operating characteristic curve (AUC) of 0.874 and an average precision (AP) of 0.184 performed best compares with other models and clinical predictor. Patients with LM in the high-risk group had significantly poorer survival (P<0.001) and failed to benefit from resection (P = 0.125). CONCLUSIONS In summary, we have utilized the BRF algorithm to develop an effective, non-invasive, and practical model for predicting LM in CRC patients based on highly imbalanced datasets. In addition, we have implemented a novel approach to stratify the survival risk of CRC patients with LM based the output of the model.
Collapse
Affiliation(s)
- Weiyuan Zhang
- Department of Colorectal Cancer Surgery, the Second Affiliated Hospital of Harbin Medical University, Harbin, 150000, China
| | - Xu Guan
- Department of Colorectal Surgery, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100000, China; Department of Colorectal Surgery, Shanxi Province Cancer Hospital/Hospital Affiliated to Cancer Hospital, Chinese Academy of Medical Sciences/Cancer Hospital Affiliated to Shanxi Medical University, Taiyuan, 030000, China.
| | - Shuai Jiao
- Department of Colorectal Surgery, Shanxi Province Cancer Hospital/Hospital Affiliated to Cancer Hospital, Chinese Academy of Medical Sciences/Cancer Hospital Affiliated to Shanxi Medical University, Taiyuan, 030000, China
| | - Guiyu Wang
- Department of Colorectal Cancer Surgery, the Second Affiliated Hospital of Harbin Medical University, Harbin, 150000, China.
| | - Xishan Wang
- Department of Colorectal Cancer Surgery, the Second Affiliated Hospital of Harbin Medical University, Harbin, 150000, China; Department of Colorectal Surgery, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100000, China; Department of Colorectal Surgery, Shanxi Province Cancer Hospital/Hospital Affiliated to Cancer Hospital, Chinese Academy of Medical Sciences/Cancer Hospital Affiliated to Shanxi Medical University, Taiyuan, 030000, China.
| |
Collapse
|
6
|
Liu X, Lu J, Chen X, Fong YHC, Ma X, Zhang F. Attention based spatio-temporal graph convolutional network with focal loss for crash risk evaluation on urban road traffic network based on multi-source risks. Accid Anal Prev 2023; 192:107262. [PMID: 37598458 DOI: 10.1016/j.aap.2023.107262] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/29/2022] [Revised: 04/17/2023] [Accepted: 08/06/2023] [Indexed: 08/22/2023]
Abstract
The urban road transportation has presented a high probability of crash occurrence, and the aim of the present study is to evaluate the crash risk for urban road networks. However, the irregular structure of urban road networks, the high-dimensional spatio-temporal correlations among multi-source risks (i.e., the contributing risks from traffic flow, meteorological conditions, road design, and so forth), and the issue of data imbalance have brought challenges to this topic. To solve these issues, an Attention based Spatio-Temporal Graph Convolutional Network (ASTGCN) model with focal loss function is used for the first time to evaluate crash risk on an urban road network. This work can be summarized as (1) adopting the spatio-temporal graph convolution structure to capture the spatio-temporal properties and characterize the multi-source risks; (2) utilizing an attention mechanism network to address the critical contributing risks during crash risk evaluation; (3) introducing the focal loss function to improve the model performance impacted by the imbalanced data; and (4) investigating the different contributions of multi-source risks to model performance. The evaluation performance is tested in a real-world urban road traffic network. The raw data consists of 1239 crash records with corresponding datasets of traffic flow characteristics, meteorological conditions, road attributes and the topological structure of the road network. At the same time, three baseline models Artificial Neural Network (ANN), Random Forest (RF), and Deep Spatio-Temporal Graph Convolutional Network (DSTGCN) are compared to the proposed ASTGCN on the same datasets. Overall, the results show that ASTGCN outperforms the baseline models in several evaluation metrics. ASTGCN with focal loss function further improves performance by tackling the issues of dataset imbalance. Additionally, it is also found that the traffic flow risk is most crucial to model performance. The findings of the present study indicate that the proposed model can efficiently evaluate dynamic crash risk for urban road networks, which will benefit the safety management of urban road transportation.
Collapse
Affiliation(s)
- Xian Liu
- Jiangsu Key Laboratory of Urban ITS, Southeast University, Nanjing 211189, China; Jiangsu Province Collaborative Innovation Center of Modern Urban Traffic Technologies, Southeast University, Nanjing 211189, China; School of Transportation, Southeast University, Nanjing 211189, China
| | - Jian Lu
- Jiangsu Key Laboratory of Urban ITS, Southeast University, Nanjing 211189, China; Jiangsu Province Collaborative Innovation Center of Modern Urban Traffic Technologies, Southeast University, Nanjing 211189, China; School of Transportation, Southeast University, Nanjing 211189, China.
| | - Xiang Chen
- School of Computing, University of Leeds, Leeds LS2 9JT, UK
| | | | - Xiaochi Ma
- Jiangsu Key Laboratory of Urban ITS, Southeast University, Nanjing 211189, China; Jiangsu Province Collaborative Innovation Center of Modern Urban Traffic Technologies, Southeast University, Nanjing 211189, China; School of Transportation, Southeast University, Nanjing 211189, China
| | - Fang Zhang
- Jiangsu Key Laboratory of Urban ITS, Southeast University, Nanjing 211189, China; Jiangsu Province Collaborative Innovation Center of Modern Urban Traffic Technologies, Southeast University, Nanjing 211189, China; School of Transportation, Southeast University, Nanjing 211189, China
| |
Collapse
|
7
|
Li Y, Yang Z, Xing L, Yuan C, Liu F, Wu D, Yang H. Crash injury severity prediction considering data imbalance: A Wasserstein generative adversarial network with gradient penalty approach. Accid Anal Prev 2023; 192:107271. [PMID: 37659275 DOI: 10.1016/j.aap.2023.107271] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Revised: 07/29/2023] [Accepted: 08/24/2023] [Indexed: 09/04/2023]
Abstract
For each road crash event, it is necessary to predict its injury severity. However, predicting crash injury severity with the imbalanced data frequently results in ineffective classifier. Due to the rarity of severe injuries in road traffic crashes, the crash data is extremely imbalanced among injury severity classes, making it challenging to the training of prediction models. To achieve interclass balance, it is possible to generate certain minority class samples using data augmentation techniques. Aiming to address the imbalance issue of crash injury severity data, this study applies a novel deep learning method, the Wasserstein generative adversarial network with gradient penalty (WGAN-GP), to investigate a massive amount of crash data, which can generate synthetic injury severity data linked to traffic crashes to rebalance the dataset. To evaluate the effectiveness of the WGAN-GP model, we systematically compare performances of various commonly-used sampling techniques (random under-sampling, random over-sampling, synthetic minority over-sampling technique and adaptive synthetic sampling) with respect to dataset balance and crash injury severity prediction. After rebalancing the dataset, this study categorizes the crash injury severity using logistic regression, multilayer perceptron, random forest, AdaBoost and XGBoost. The AUC, specificity and sensitivity are employed as evaluation indicators to compare the prediction performances. Results demonstrate that sampling techniques can considerably improve the prediction performance of minority classes in an imbalanced dataset, and the combination of XGBoost and WGAN-GP performs best with an AUC of 0.794 and a sensitivity of 0.698. Finally, the interpretability of the model is improved by the explainable machine learning technique SHAP (SHapley Additive exPlanation), allowing for a deeper understanding of the effects of each variable on crash injury severity. Findings of this study shed light on the prediction of crash injury severity with data imbalance using data-driven approaches.
Collapse
Affiliation(s)
- Ye Li
- School of Traffic and Transportation Engineering, Central South University, Changsha, Hunan 410075, China; Hunan Key Laboratory of Smart Roadway and Cooperative Vehicle-Infrastructure Systems, Changsha University of Science & Technology, Changsha, 410114 Hunan, China.
| | - Zhanhao Yang
- School of Traffic and Transportation Engineering, Central South University, Changsha, Hunan 410075, China.
| | - Lu Xing
- School of Traffic and Transportation Engineering, Changsha University of Science and Technology, Changsha, Hunan 410114, China.
| | - Chen Yuan
- School of Traffic and Transportation Engineering, Central South University, Changsha, Hunan 410075, China; Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong, China.
| | - Fei Liu
- School of Traffic and Transportation Engineering, Central South University, Changsha, Hunan 410075, China.
| | - Dan Wu
- School of Traffic and Transportation Engineering, Central South University, Changsha, Hunan 410075, China.
| | - Haifei Yang
- School of Civil and Transportation Engineering, Hohai University, Nanjing, Jiangsu 210098, China.
| |
Collapse
|
8
|
Li X, Luo G, Wang W, Wang K, Li S. Curriculum label distribution learning for imbalanced medical image segmentation. Med Image Anal 2023; 89:102911. [PMID: 37542795 DOI: 10.1016/j.media.2023.102911] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2022] [Revised: 04/27/2023] [Accepted: 07/25/2023] [Indexed: 08/07/2023]
Abstract
Label distribution learning (LDL) has the potential to resolve boundary ambiguity in semantic segmentation tasks. However, existing LDL-based segmentation methods suffer from severe label distribution imbalance: the ambiguous label distributions contain a small fraction of the data, while the unambiguous label distributions occupy the majority of the data. The imbalanced label distributions induce model-biased distribution learning and make it challenging to accurately predict ambiguous pixels. In this paper, we propose a curriculum label distribution learning (CLDL) framework to address the above data imbalance problem by performing a novel task-oriented curriculum learning strategy. Firstly, the region label distribution learning (R-LDL) is proposed to construct more balanced label distributions and improves the imbalanced model learning. Secondly, a novel learning curriculum (TCL) is proposed to enable easy-to-hard learning in LDL-based segmentation by decomposing the segmentation task into multiple label distribution estimation tasks. Thirdly, the prior perceiving module (PPM) is proposed to effectively connect easy and hard learning stages based on the priors generated from easier stages. Benefiting from the balanced label distribution construction and prior perception, the proposed CLDL effectively conducts a curriculum learning-based LDL and significantly improves the imbalanced learning. We evaluated the proposed CLDL using the publicly available BRATS2018 and MM-WHS2017 datasets. The experimental results demonstrate that our method significantly improves different segmentation metrics compared to many state-of-the-art methods. The code will be available.1.
Collapse
Affiliation(s)
- Xiangyu Li
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Gongning Luo
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China.
| | - Wei Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China.
| | - Kuanquan Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Shuo Li
- Department of Biomedical Engineering, Case Western Reserve University, Cleveland, OH 44106, USA.
| |
Collapse
|
9
|
Zeinolabedini Rezaabad M, Lacey H, Marshall L, Johnson F. Influence of resampling techniques on Bayesian network performance in predicting increased algal activity. Water Res 2023; 244:120558. [PMID: 37666153 DOI: 10.1016/j.watres.2023.120558] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/23/2023] [Revised: 08/10/2023] [Accepted: 08/30/2023] [Indexed: 09/06/2023]
Abstract
Early warning of increased algal activity is important to mitigate potential impacts on aquatic life and human health. While many methods have been developed to predict increased algal activity, an ongoing issue is that severe algal blooms often occur with low frequency in water bodies. This results in imbalanced data sets available for model specification, leading to poor predictions of the frequency of increased algal activity. One approach to address this is to resample data sets of increased algal activity to increase the prevalence of higher than normal algal activity in calibration data and ultimately improve model predictions. This study aims to investigate the use of resampling techniques to address the imbalanced dataset and determine if such methods can improve the prediction of increased algal activity. Three techniques were investigated, Kmeans under-sampling (US_Kmeans), synthetic minority over-sampling technique (SMOTE), and 'SMOTE and cluster-based under-sampling technique' (SCUT). The resampling methods were applied to a Bayesian network (BN) model of Lake Burragorang in New South Wales, Australia. The model was developed to predict chlorophyll-a (chl-a) using a range of water quality parameters as predictors. The original data and each of the balanced datasets were used for BN structures and parameter learning. The results showed that the best graphical structure was obtained by adding synthetic data from SMOTE with the highest true positive rate (TPR) and area under the curve (AUC). When compared using a fixed graphical structure for the BN, all resampling techniques increased the ability of the BN to detect events with higher probability of increased algal activity. The resampling model results can also be used to better understand the most important influences on high chl-a concentrations and suggest future data collection and model development priorities.
Collapse
Affiliation(s)
- Maryam Zeinolabedini Rezaabad
- Water Research Centre, School of Civil and Environmental Engineering, University of New South Wales, Kensington, New South Wales, Australia; ARC Training Centre Data Analytics for Resources and Environments, School of Life and Environmental Sciences, The University of Sydney, Camperdown, New South Wales, Australia.
| | | | - Lucy Marshall
- Water Research Centre, School of Civil and Environmental Engineering, University of New South Wales, Kensington, New South Wales, Australia; ARC Training Centre Data Analytics for Resources and Environments, School of Life and Environmental Sciences, The University of Sydney, Camperdown, New South Wales, Australia; Faculty of Science and Engineering, Macquarie University, North Ryde, New South Wales, Australia
| | - Fiona Johnson
- Water Research Centre, School of Civil and Environmental Engineering, University of New South Wales, Kensington, New South Wales, Australia; ARC Training Centre Data Analytics for Resources and Environments, School of Life and Environmental Sciences, The University of Sydney, Camperdown, New South Wales, Australia
| |
Collapse
|
10
|
Ghavidel A, Pazos P. Machine learning (ML) techniques to predict breast cancer in imbalanced datasets: a systematic review. J Cancer Surviv 2023:10.1007/s11764-023-01465-3. [PMID: 37749361 DOI: 10.1007/s11764-023-01465-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2023] [Accepted: 09/09/2023] [Indexed: 09/27/2023]
Abstract
Knowledge discovery in databases (KDD) is crucial in analyzing data to extract valuable insights. In medical outcome prediction, KDD is increasingly applied, particularly in diseases with high incidence, mortality, and costs, like cancer. ML techniques can develop more accurate predictive models for cancer patients' clinical outcomes, aiding informed healthcare decision-making. However, cancer prediction modeling faces challenges because of the unbalanced nature of the datasets, where there is a small minority category of patients with a cancer diagnosis compared to a majority category of cancer-free patients. Imbalanced datasets pose statistical hurdles like bias and overfitting when developing accurate prediction models. This systematic review focuses on breast cancer prediction articles published from 2008 to 2023. The objective is to examine ML methods used in three critical steps of KDD: preprocessing, data mining, and interpretation which address the imbalanced data problem in breast cancer prediction. This work synthesizes prior research in ML methods for breast cancer prediction. The findings help identify effective preprocessing strategies, including balancing and feature selection methods, robust predictive models, and evaluation metrics of those models. The study aims to inform healthcare providers and researchers about effective techniques for accurate breast cancer prediction.
Collapse
Affiliation(s)
- Arman Ghavidel
- Engineering Management and Systems Engineering, Old Dominion University, Norfolk, VA, USA
| | - Pilar Pazos
- Engineering Management and Systems Engineering, Old Dominion University, Norfolk, VA, USA.
| |
Collapse
|
11
|
Ebrahimi A, Wiil UK, Baskaran R, Peimankar A, Andersen K, Nielsen AS. AUD-DSS: a decision support system for early detection of patients with alcohol use disorder. BMC Bioinformatics 2023; 24:329. [PMID: 37658294 PMCID: PMC10474761 DOI: 10.1186/s12859-023-05450-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2023] [Accepted: 08/21/2023] [Indexed: 09/03/2023] Open
Abstract
BACKGROUND Alcohol use disorder (AUD) causes significant morbidity, mortality, and injuries. According to reports, approximately 5% of all registered deaths in Denmark could be due to AUD. The problem is compounded by the late identification of patients with AUD, a situation that can cause enormous problems, from psychological to physical to economic problems. Many individuals suffering from AUD never undergo specialist treatment during their addiction due to obstacles such as taboo and the poor performance of current screening tools. Therefore, there is a lack of rapid intervention. This can be mitigated by the early detection of patients with AUD. A clinical decision support system (DSS) powered by machine learning (ML) methods can be used to diagnose patients' AUD status earlier. METHODS This study proposes an effective AUD prediction model (AUDPM), which can be used in a DSS. The proposed model consists of four distinct components: (1) imputation to address missing values using the k-nearest neighbours approach, (2) recursive feature elimination with cross validation to select the most relevant subset of features, (3) a hybrid synthetic minority oversampling technique-edited nearest neighbour approach to remove noise and balance the distribution of the training data, and (4) an ML model for the early detection of patients with AUD. Two data sources, including a questionnaire and electronic health records of 2571 patients, were collected from Odense University Hospital in the Region of Southern Denmark for the AUD-Dataset. Then, the AUD-Dataset was used to build ML models. The results of different ML models, such as support vector machine, K-nearest neighbour, decision tree, random forest, and extreme gradient boosting, were compared. Finally, a combination of all these models in an ensemble learning approach was selected for the AUDPM. RESULTS The results revealed that the proposed ensemble AUDPM outperformed other single models and our previous study results, achieving 0.96, 0.94, 0.95, and 0.97 precision, recall, F1-score, and accuracy, respectively. In addition, we designed and developed an AUD-DSS prototype. CONCLUSION It was shown that our proposed AUDPM achieved high classification performance. In addition, we identified clinical factors related to the early detection of patients with AUD. The designed AUD-DSS is intended to be integrated into the existing Danish health care system to provide novel information to clinical staff if a patient shows signs of harmful alcohol use; in other words, it gives staff a good reason for having a conversation with patients for whom a conversation is relevant.
Collapse
Affiliation(s)
- Ali Ebrahimi
- SDU Health Informatics and Technology, The Maersk Mc-Kinney Moller Institute, University of Southern Denmark, Odense, Denmark.
| | - Uffe Kock Wiil
- SDU Health Informatics and Technology, The Maersk Mc-Kinney Moller Institute, University of Southern Denmark, Odense, Denmark
| | - Ruben Baskaran
- SDU Health Informatics and Technology, The Maersk Mc-Kinney Moller Institute, University of Southern Denmark, Odense, Denmark
| | - Abdolrahman Peimankar
- SDU Health Informatics and Technology, The Maersk Mc-Kinney Moller Institute, University of Southern Denmark, Odense, Denmark
| | - Kjeld Andersen
- Unit for Clinical Alcohol Research, Clinical Institute, University of Southern Denmark, Odense, Denmark
| | - Anette Søgaard Nielsen
- Unit for Clinical Alcohol Research, Clinical Institute, University of Southern Denmark, Odense, Denmark
| |
Collapse
|
12
|
Liu S, Roemer F, Ge Y, Bedrick EJ, Li ZM, Guermazi A, Sharma L, Eaton C, Hochberg MC, Hunter DJ, Nevitt MC, Wirth W, Kent Kwoh C, Sun X. Comparison of evaluation metrics of deep learning for imbalanced imaging data in osteoarthritis studies. Osteoarthritis Cartilage 2023; 31:1242-1248. [PMID: 37209993 PMCID: PMC10524686 DOI: 10.1016/j.joca.2023.05.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/14/2022] [Revised: 04/14/2023] [Accepted: 05/12/2023] [Indexed: 05/22/2023]
Abstract
PURPOSE To compare the evaluation metrics for deep learning methods that were developed using imbalanced imaging data in osteoarthritis studies. MATERIALS AND METHODS This retrospective study utilized 2996 sagittal intermediate-weighted fat-suppressed knee MRIs with MRI Osteoarthritis Knee Score readings from 2467 participants in the Osteoarthritis Initiative study. We obtained probabilities of the presence of bone marrow lesions (BMLs) from MRIs in the testing dataset at the sub-region (15 sub-regions), compartment, and whole-knee levels based on the trained deep learning models. We compared different evaluation metrics (e.g., receiver operating characteristic (ROC) and precision-recall (PR) curves) in the testing dataset with various class ratios (presence of BMLs vs. absence of BMLs) at these three data levels to assess the model's performance. RESULTS In a subregion with an extremely high imbalance ratio, the model achieved a ROC-AUC of 0.84, a PR-AUC of 0.10, a sensitivity of 0, and a specificity of 1. CONCLUSION The commonly used ROC curve is not sufficiently informative, especially in the case of imbalanced data. We provide the following practical suggestions based on our data analysis: 1) ROC-AUC is recommended for balanced data, 2) PR-AUC should be used for moderately imbalanced data (i.e., when the proportion of the minor class is above 5% and less than 50%), and 3) for severely imbalanced data (i.e., when the proportion of the minor class is below 5%), it is not practical to apply a deep learning model, even with the application of techniques addressing imbalanced data issues.
Collapse
Affiliation(s)
- Shen Liu
- Department of Epidemiology and Biostatistics, University of Arizona, 1295 N. Martin Ave., Tucson, AZ 85724, USA.
| | - Frank Roemer
- Department of Radiology, University of Erlangen - Nuremberg, Erlangen, Germany; Department of Radiology, Boston University School of Medicine, MA, USA.
| | - Yong Ge
- Department of Management Information Systems, University of Arizona, AZ, USA.
| | - Edward J Bedrick
- Department of Epidemiology and Biostatistics, University of Arizona, 1295 N. Martin Ave., Tucson, AZ 85724, USA.
| | - Zong-Ming Li
- University of Arizona Arthritis Center, University of Arizona College of Medicine, Tucson, AZ, USA.
| | - Ali Guermazi
- Department of Radiology, Boston University School of Medicine, MA, USA.
| | - Leena Sharma
- Feinberh School of Medicine, Northwestern University, IL, USA.
| | - Charles Eaton
- Kent Memorial Hospital, and Department of Family Medicine, Warren Alpert Medical School, and Department of Epidemiology, School of Public Health, Brown University, RI, USA.
| | - Marc C Hochberg
- School of Medicine, University of Maryland, and Medical Care Clinical Center, VA Maryland Health Care System, Baltimore, MD, USA.
| | - David J Hunter
- Sydney Musculoskeletal Health, Kolling Institute, Faculty of Medicine and Health, The University of Sydney, Sydney, 2065 NSW, Australia, and Rheumatology Department, Royal North Shore Hospital, St Leonards, NSW 2065 Australia.
| | - Michael C Nevitt
- Department of Epidemiology and Biostatistics, University of California San Francisco, CA, USA.
| | - Wolfgang Wirth
- Department of Imaging & Functional Musculoskeletal Research, Institute of Anatomy & Cell Biology, Paracelsus Medical University Salzburg & Nuremberg, Salzburg, Austria, and Ludwig Boltzmann Inst. for Arthritis and Rehabilitation, Paracelsus Medical University Salzburg & Nuremberg, Salzburg, Austria, and Chondrometrics GmbH, Ainring, Germany.
| | - C Kent Kwoh
- University of Arizona Arthritis Center, University of Arizona College of Medicine, Tucson, AZ, USA.
| | - Xiaoxiao Sun
- Department of Epidemiology and Biostatistics, University of Arizona, 1295 N. Martin Ave., Tucson, AZ 85724, USA.
| |
Collapse
|
13
|
Wang Z, An T, Wang W, Fan S, Chen L, Tian X. Qualitative and quantitative detection of aflatoxins B1 in maize kernels with fluorescence hyperspectral imaging based on the combination method of boosting and stacking. Spectrochim Acta A Mol Biomol Spectrosc 2023; 296:122679. [PMID: 37011441 DOI: 10.1016/j.saa.2023.122679] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/26/2023] [Revised: 03/17/2023] [Accepted: 03/26/2023] [Indexed: 06/19/2023]
Abstract
The most widespread, toxic, and harmful toxin is aflatoxins B1 (AFB1). The fluorescence hyperspectral imaging (HSI) system was employed for AFB1 detection in this study. This study developed the under sampling stacking (USS) algorithm for imbalanced data. The results indicated that the USS method combined with ANOVA for featured wavelength achieved the best performance with the accuracy of 0.98 for 20 or 50 μg /kg threshold using endosperm side spectra. As for the quantitative analysis, a specified function was used to compress AFB1 content, and the combination of boosting and stacking was used for regression. The support vector regression (SVR)-Boosting, Adaptive Boosting (AdaBoost), and extremely randomized trees (Extra-Trees)-Boosting were used as the base learner, while the K nearest neighbors (KNN) algorithm was used as the meta learner could obtain the best results, with the correlation coefficient of prediction (Rp) was 0.86. These results provided the basis for developing AFB1 detection and estimation technologies.
Collapse
Affiliation(s)
- Zheli Wang
- College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China; Research Center of Intelligent Equipment, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China
| | - Ting An
- Research Center of Intelligent Equipment, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China
| | - Wenchao Wang
- Research Center of Intelligent Equipment, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China
| | - Shuxiang Fan
- Research Center of Intelligent Equipment, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China
| | - Liping Chen
- College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China; Research Center of Intelligent Equipment, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China.
| | - Xi Tian
- Research Center of Intelligent Equipment, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China.
| |
Collapse
|
14
|
Hassanzadeh R, Farhadian M, Rafieemehr H. Hospital mortality prediction in traumatic injuries patients: comparing different SMOTE-based machine learning algorithms. BMC Med Res Methodol 2023; 23:101. [PMID: 37087425 PMCID: PMC10122327 DOI: 10.1186/s12874-023-01920-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2022] [Accepted: 04/13/2023] [Indexed: 04/24/2023] Open
Abstract
BACKGROUND Trauma is one of the most critical public health issues worldwide, leading to death and disability and influencing all age groups. Therefore, there is great interest in models for predicting mortality in trauma patients admitted to the ICU. The main objective of the present study is to develop and evaluate SMOTE-based machine-learning tools for predicting hospital mortality in trauma patients with imbalanced data. METHODS This retrospective cohort study was conducted on 126 trauma patients admitted to an intensive care unit at Besat hospital in Hamadan Province, western Iran, from March 2020 to March 2021. Data were extracted from the medical information records of patients. According to the imbalanced property of the data, SMOTE techniques, namely SMOTE, Borderline-SMOTE1, Borderline-SMOTE2, SMOTE-NC, and SVM-SMOTE, were used for primary preprocessing. Then, the Decision Tree (DT), Random Forest (RF), Naive Bayes (NB), Artificial Neural Network (ANN), Support Vector Machine (SVM), and Extreme Gradient Boosting (XGBoost) methods were used to predict patients' hospital mortality with traumatic injuries. The performance of the methods used was evaluated by sensitivity, specificity, Positive Predictive Value (PPV), Negative Predictive Value (NPV), accuracy, Area Under the Curve (AUC), Geometric Mean (G-means), F1 score, and P-value of McNemar's test. RESULTS Of the 126 patients admitted to an ICU, 117 (92.9%) survived and 9 (7.1%) died. The mean follow-up time from the date of trauma to the date of outcome was 3.98 ± 4.65 days. The performance of ML algorithms is not good with imbalanced data, whereas the performance of SMOTE-based ML algorithms is significantly improved. The mean area under the ROC curve (AUC) of all SMOTE-based models was more than 91%. F1-score and G-means before balancing the dataset were below 70% for all ML models except ANN. In contrast, F1-score and G-means for the balanced datasets reached more than 90% for all SMOTE-based models. Among all SMOTE-based ML methods, RF and ANN based on SMOTE and XGBoost based on SMOTE-NC achieved the highest value for all evaluation criteria. CONCLUSIONS This study has shown that SMOTE-based ML algorithms better predict outcomes in traumatic injuries than ML algorithms. They have the potential to assist ICU physicians in making clinical decisions.
Collapse
Affiliation(s)
- Roghayyeh Hassanzadeh
- Department of Biostatistics, School of Public Health, Hamadan University of Medical Sciences, Hamadan, Iran
| | - Maryam Farhadian
- Research Center for Health Sciences, Department of Biostatistics, School of Public Health, Hamadan University of Medical Sciences, Hamadan, Iran.
| | - Hassan Rafieemehr
- Department of Medical Laboratory Sciences, School of Paramedicine, Hamadan University of Medical Sciences, Hamadan, Iran.
| |
Collapse
|
15
|
Wang X, Ren H, Ren J, Song W, Qiao Y, Ren Z, Zhao Y, Linghu L, Cui Y, Zhao Z, Chen L, Qiu L. Machine learning-enabled risk prediction of chronic obstructive pulmonary disease with unbalanced data. Comput Methods Programs Biomed 2023; 230:107340. [PMID: 36640604 DOI: 10.1016/j.cmpb.2023.107340] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/09/2022] [Revised: 11/25/2022] [Accepted: 01/04/2023] [Indexed: 06/17/2023]
Abstract
BACKGROUND AND OBJECTIVE Since the early symptoms of chronic obstructive pulmonary disease (COPD) are not obvious, patients are not easily identified, causing improper time for prevention and treatment. In present study, machine learning (ML) methods were employed to construct a risk prediction model for COPD to improve its prediction efficiency. METHODS We collected data from a sample of 5807 cases with a complete COPD diagnosis from the 2019 COPD Surveillance Program in Shanxi Province and extracted 34 potentially relevant variables from the dataset. Firstly, we used feature selection methods (i.e., Generalized elastic net, Lasso and Adaptive lasso) to select ten variables. Afterwards, we employed supervised classifiers for class imbalanced data by combining the cost-sensitive learning and SMOTE resampling methods with the ML methods (Logistic Regression, SVM, Random Forest, XGBoost, LightGBM, NGBoost and Stacking), respectively. Last, we assessed their performance. RESULTS The cough frequently at age 14 and before and other 9 variables are significant parameters for COPD. The Stacking heterogeneous ensemble model showed relatively good performance in the unbalanced datasets. The Logistic Regression with class weighting enjoyed the best classification performance in the balancing data when these composite indicators (AUC, F1-Score and G-mean) were used as criteria for model comparison. The values of F1-Score and G-mean for the top three ML models were 0.290/0.660 for Logistic Regression with class weighting, 0.288/0.649 for Stacking with synthetic minority oversampling technique (SMOTE), and 0.285/0.648 for LightGBM with SMOTE. CONCLUSIONS This paper combining feature selection methods, unbalanced data processing methods and machine learning methods with data from disease surveillance questionnaires and physical measurements to identify people at risk of COPD, concluded that machine learning models based on survey questionnaires could provide an automated identification for patients at risk of COPD, and provide a simple and scientific aid for early identification of COPD.
Collapse
Affiliation(s)
- Xuchun Wang
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, Shanxi 030001, China
| | - Hao Ren
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, Shanxi 030001, China
| | - Jiahui Ren
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, Shanxi 030001, China
| | - Wenzhu Song
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, Shanxi 030001, China
| | - Yuchao Qiao
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, Shanxi 030001, China
| | - Zeping Ren
- Shanxi Centre for Disease Control and Prevention, Taiyuan, Shanxi 030012, China
| | - Ying Zhao
- Shanxi Centre for Disease Control and Prevention, Taiyuan, Shanxi 030012, China
| | - Liqin Linghu
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, Shanxi 030001, China; Shanxi Centre for Disease Control and Prevention, Taiyuan, Shanxi 030012, China
| | - Yu Cui
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, Shanxi 030001, China
| | - Zhiyang Zhao
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, Shanxi 030001, China
| | - Limin Chen
- The Fifth Hospital (Shanxi People's Hospital) of Shanxi Medical University, No. 29, Shuangtaji Street, Taiyuan, Shanxi 030012, China.
| | - Lixia Qiu
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, Shanxi 030001, China.
| |
Collapse
|
16
|
Sato H, Kimura Y, Ohba M, Ara Y, Wakabayashi S, Watanabe H. Prediction of Prednisolone Dose Correction Using Machine Learning. J Healthc Inform Res 2023; 7:84-103. [PMID: 36910914 DOI: 10.1007/s41666-023-00128-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2022] [Revised: 11/20/2022] [Accepted: 02/03/2023] [Indexed: 02/17/2023]
Abstract
Wrong dose, a common prescription error, can cause serious patient harm, especially in the case of high-risk drugs like oral corticosteroids. This study aims to build a machine learning model to predict dose-related prescription modifications for oral prednisolone tablets (i.e., highly imbalanced data with very few positive cases). Prescription data were obtained from the electronic medical records at a single institute. Cluster analysis classified the clinical departments into six clusters with similar patterns of prednisolone prescription. Two patterns of training datasets were created with/without preprocessing by the SMOTE method. Five ML models (SVM, KNN, GB, RF, and BRF) and logistic regression (LR) models were constructed by Python. The model was internally validated by five-fold stratified cross-validation and was validated with a 30% holdout test dataset. Eighty-two thousand five hundred fifty-three prescribing data for prednisolone tablets containing 135 dose-corrected positive cases were obtained. In the original dataset (without SMOTE), only the BRF model showed a good performance (in test dataset, ROC-AUC:0.917, recall: 0.951). In the training dataset preprocessed by SMOTE, performance was improved on all models. The highest performance models with SMOTE were SVM (in test dataset, ROC-AUC: 0.820, recall: 0.659) and BRF (ROC-AUC: 0.814, recall: 0.634). Although the prescribing data for dose-related collection are highly imbalanced, various techniques such as the following have allowed us to build high-performance prediction models: data preprocessing by SMOTE, stratified cross-validation, and BRF classifier corresponding to imbalanced data. ML is useful in complicated dose audits such as oral prednisolone. Supplementary Information The online version contains supplementary material available at 10.1007/s41666-023-00128-3.
Collapse
|
17
|
Wang Z, Stavrakis S, Yao B. Hierarchical deep learning with Generative Adversarial Network for automatic cardiac diagnosis from ECG signals. Comput Biol Med 2023; 155:106641. [PMID: 36773553 DOI: 10.1016/j.compbiomed.2023.106641] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Revised: 01/11/2023] [Accepted: 02/05/2023] [Indexed: 02/10/2023]
Abstract
Cardiac disease is the leading cause of death in the US. Accurate heart disease detection is critical to timely medical treatment to save patients' lives. Routine use of the electrocardiogram (ECG) is the most common method for physicians to assess the cardiac electrical activities and detect possible abnormal conditions. Fully utilizing the ECG data for reliable heart disease detection depends on developing effective analytical models. In this paper, we propose a two-level hierarchical deep learning framework with Generative Adversarial Network (GAN) for ECG signal analysis. The first-level model is composed of a Memory-Augmented Deep AutoEncoder with GAN (MadeGAN), which aims to differentiate abnormal signals from normal ECGs for anomaly detection. The second-level learning aims at robust multi-class classification for different arrhythmia identification, which is achieved by integrating the transfer learning technique to transfer knowledge from the first-level learning with the multi-branching architecture to handle the data-lacking and imbalanced data issues. We evaluate the performance of the proposed framework using real-world ECG data from the MIT-BIH arrhythmia database. Experimental results show that our proposed model outperforms existing methods that are commonly used in current practice.
Collapse
Affiliation(s)
- Zekai Wang
- Department of Industrial & Systems Engineering, The University of Tennessee, Knoxville, TN, 37996, USA
| | - Stavros Stavrakis
- University of Oklahoma Health Sciences Center, Oklahoma City, OK 73104, USA
| | - Bing Yao
- Department of Industrial & Systems Engineering, The University of Tennessee, Knoxville, TN, 37996, USA.
| |
Collapse
|
18
|
Mafarja M, Thaher T, Al-Betar MA, Too J, Awadallah MA, Abu Doush I, Turabieh H. Classification framework for faulty-software using enhanced exploratory whale optimizer-based feature selection scheme and random forest ensemble learning. APPL INTELL 2023; 53:1-43. [PMID: 36785593 PMCID: PMC9909674 DOI: 10.1007/s10489-022-04427-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/23/2022] [Indexed: 02/11/2023]
Abstract
Software Fault Prediction (SFP) is an important process to detect the faulty components of the software to detect faulty classes or faulty modules early in the software development life cycle. In this paper, a machine learning framework is proposed for SFP. Initially, pre-processing and re-sampling techniques are applied to make the SFP datasets ready to be used by ML techniques. Thereafter seven classifiers are compared, namely K-Nearest Neighbors (KNN), Naive Bayes (NB), Linear Discriminant Analysis (LDA), Linear Regression (LR), Decision Tree (DT), Support Vector Machine (SVM), and Random Forest (RF). The RF classifier outperforms all other classifiers in terms of eliminating irrelevant/redundant features. The performance of RF is improved further using a dimensionality reduction method called binary whale optimization algorithm (BWOA) to eliminate the irrelevant/redundant features. Finally, the performance of BWOA is enhanced by hybridizing the exploration strategies of the grey wolf optimizer (GWO) and harris hawks optimization (HHO) algorithms. The proposed method is called SBEWOA. The SFP datasets utilized are selected from the PROMISE repository using sixteen datasets for software projects with different sizes and complexity. The comparative evaluation against nine well-established feature selection methods proves that the proposed SBEWOA is able to significantly produce competitively superior results for several instances of the evaluated dataset. The algorithms' performance is compared in terms of accuracy, the number of features, and fitness function. This is also proved by the 2-tailed P-values of the Wilcoxon signed ranks statistical test used. In conclusion, the proposed method is an efficient alternative ML method for SFP that can be used for similar problems in the software engineering domain.
Collapse
Affiliation(s)
- Majdi Mafarja
- Department of Computer Science, Birzeit University, Birzeit, Palestine
| | - Thaer Thaher
- Department of Computer Systems Engineering, Arab American University, Jenin, Palestine
- Information Technology Engineering, Al-Quds University, Abu Dies, Jerusalem, Palestine
| | - Mohammed Azmi Al-Betar
- Artificial Intelligence Research Center (AIRC), College of Engineering and Information Technology, Ajman University, Ajman, United Arab EmiratesDeepSinghML2017, Irbid, Jordan
| | - Jingwei Too
- Faculty of Electrical Engineering, Universiti Teknikal Malaysia Melaka, Hang Tuah Jaya, 76100 Durian Tunggal Melaka, Malaysia
| | - Mohammed A. Awadallah
- Department of Computer Science, Al-Aqsa University, P.O. Box 4051, Gaza, Palestine
- Artificial Intelligence Research Center (AIRC), Ajman University, Ajman, United Arab Emirates
| | - Iyad Abu Doush
- Department of Computing, College of Engineering and Applied Sciences, American University of Kuwait, Salmiya, Kuwait
- Computer Science Department, Yarmouk University, Irbid, Jordan
| | - Hamza Turabieh
- Department of Health Management and Informatics, University of Missouri, Columbia, 5 Hospital Drive, Columbia, MO 65212 USA
| |
Collapse
|
19
|
Abstract
Generative adversarial networks (GANs) are one of the most powerful generative models, but always require a large and balanced dataset to train. Traditional GANs are not applicable to generate minority-class images in a highly imbalanced dataset. Balancing GAN (BAGAN) is proposed to mitigate this problem, but it is unstable when images in different classes look similar, e.g., flowers and cells. In this work, we propose a supervised autoencoder with an intermediate embedding model to disperse the labeled latent vectors. With the enhanced autoencoder initialization, we also build an architecture of BAGAN with gradient penalty (BAGAN-GP). Our proposed model overcomes the unstable issue in original BAGAN and converges faster to high-quality generations. Our model achieves high performance on the imbalanced scale-down version of MNIST Fashion, CIFAR-10, and one small-scale medical image dataset. https://github.com/GH920/improved-bagan-gp.
Collapse
|
20
|
Werner de Vargas V, Schneider Aranda JA, dos Santos Costa R, da Silva Pereira PR, Victória Barbosa JL. Imbalanced data preprocessing techniques for machine learning: a systematic mapping study. Knowl Inf Syst 2023; 65:31-57. [PMID: 36405957 PMCID: PMC9645765 DOI: 10.1007/s10115-022-01772-8] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2021] [Revised: 09/27/2022] [Accepted: 10/02/2022] [Indexed: 11/10/2022]
Abstract
Machine Learning (ML) algorithms have been increasingly replacing people in several application domains-in which the majority suffer from data imbalance. In order to solve this problem, published studies implement data preprocessing techniques, cost-sensitive and ensemble learning. These solutions reduce the naturally occurring bias towards the majority sample through ML. This study uses a systematic mapping methodology to assess 9927 papers related to sampling techniques for ML in imbalanced data applications from 7 digital libraries. A filtering process selected 35 representative papers from various domains, such as health, finance, and engineering. As a result of a thorough quantitative analysis of these papers, this study proposes two taxonomies-illustrating sampling techniques and ML models. The results indicate that oversampling and classical ML are the most common preprocessing techniques and models, respectively. However, solutions with neural networks and ensemble ML models have the best performance-with potentially better results through hybrid sampling techniques. Finally, none of the 35 works apply simulation-based synthetic oversampling, indicating a path for future preprocessing solutions.
Collapse
Affiliation(s)
- Vitor Werner de Vargas
- Applied Computing Graduate Program, University of Vale do Rio dos Sinos, São Leopoldo, Rio Grande do Sul 93022-750 Brazil
| | - Jorge Arthur Schneider Aranda
- Applied Computing Graduate Program, University of Vale do Rio dos Sinos, São Leopoldo, Rio Grande do Sul 93022-750 Brazil
| | - Ricardo dos Santos Costa
- Electrical Engineering Graduate Program, University of Vale do Rio dos Sinos, São Leopoldo, Rio Grande do Sul 93022-750 Brazil
| | - Paulo Ricardo da Silva Pereira
- Electrical Engineering Graduate Program, University of Vale do Rio dos Sinos, São Leopoldo, Rio Grande do Sul 93022-750 Brazil
| | - Jorge Luis Victória Barbosa
- Applied Computing Graduate Program, University of Vale do Rio dos Sinos, São Leopoldo, Rio Grande do Sul 93022-750 Brazil ,Electrical Engineering Graduate Program, University of Vale do Rio dos Sinos, São Leopoldo, Rio Grande do Sul 93022-750 Brazil
| |
Collapse
|
21
|
Sowjanya AM, Mrudula O. Effective treatment of imbalanced datasets in health care using modified SMOTE coupled with stacked deep learning algorithms. Appl Nanosci 2023; 13:1829-1840. [PMID: 35132368 PMCID: PMC8811587 DOI: 10.1007/s13204-021-02063-4] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/21/2021] [Accepted: 08/28/2021] [Indexed: 12/03/2022]
Abstract
One of the prominent uses of Predictive Analytics is Health care for more accurate predictions based on proper analysis of cumulative datasets. Often times the datasets are quite imbalanced and sampling techniques like Synthetic Minority Oversampling Technique (SMOTE) give only moderate accuracy in such cases. To overcome this problem, a two-step approach has been proposed. In the first step, SMOTE is modified to reduce the class imbalance in terms of Distance-based SMOTE (D-SMOTE) and Bi-phasic SMOTE (BP-SMOTE) which were then coupled with selective classifiers for prediction. An increase in accuracy is noted for both BP-SMOTE and D-SMOTE compared to basic SMOTE. In the second step, Machine learning, Deep Learning and Ensemble algorithms were used to develop a Stacking Ensemble Framework which showed a significant increase in accuracy for Stacking compared to individual machine learning algorithms like Decision Tree, Naïve Bayes, Neural Networks and Ensemble techniques like Voting, Bagging and Boosting. Two different methods have been developed by combing Deep learning with Stacking approach namely Stacked CNN and Stacked RNN which yielded significantly higher accuracy of 96-97% compared to individual algorithms. Framingham dataset is used for data sampling, Wisconsin Hospital data of Breast Cancer study is used for Stacked CNN and Novel Coronavirus 2019 dataset relating to forecasting COVID-19 cases, is used for Stacked RNN.
Collapse
Affiliation(s)
- A. Mary Sowjanya
- grid.411381.e0000 0001 0728 2694Department of CS & SE, Andhra University College of Engineering (A), Visakhapatnam, Andhra Pradesh India
| | - Owk Mrudula
- grid.411381.e0000 0001 0728 2694Department of CS & SE, Andhra University College of Engineering (A), Visakhapatnam, Andhra Pradesh India
| |
Collapse
|
22
|
Duong HTH, Tran LTM, To HQ, Van Nguyen K. Academic performance warning system based on data driven for higher education. Neural Comput Appl 2023; 35:5819-5837. [PMID: 36408289 PMCID: PMC9640845 DOI: 10.1007/s00521-022-07997-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2022] [Accepted: 10/25/2022] [Indexed: 11/09/2022]
Abstract
Academic probation at universities has become a matter of pressing concern in recent years, as many students face severe consequences of academic probation. We carried out research to find solutions to decrease the situation mentioned above. Our research used the power of massive data sources from the education sector and the modernity of machine learning techniques to build an academic warning system. Our system is based on academic performance that directly reflects students' academic probation status at the university. Through the research process, we provided a dataset that has been extracted and developed from raw data sources, including a wealth of information about students, subjects, and scores. We build a dataset with many features that are extremely useful in predicting students' academic warning status via feature generation techniques and feature selection strategies. Remarkably, the dataset contributed is flexible and scalable because we provided detailed calculation formulas that its materials are found in any university or college in Vietnam. That allows any university to reuse or reconstruct another similar dataset based on their raw academic database. Moreover, we variously combined data, unbalanced data handling techniques, model selection techniques, and research to propose suitable machine learning algorithms to build the best possible warning system. As a result, a two-stage academic performance warning system for higher education was proposed, with the F2-score measure of more than 74% at the beginning of the semester using the algorithm Support Vector Machine and more than 92% before the final examination using the algorithm LightGBM.
Collapse
Affiliation(s)
- Hanh Thi-Hong Duong
- Faculty of Information Science and Engineering, University of Information Technology, Ho Chi Minh City, Vietnam
- Vietnam National University, Ho Chi Minh City, Vietnam
| | - Linh Thi-My Tran
- Faculty of Information Science and Engineering, University of Information Technology, Ho Chi Minh City, Vietnam
- Vietnam National University, Ho Chi Minh City, Vietnam
| | - Huy Quoc To
- Faculty of Information Science and Engineering, University of Information Technology, Ho Chi Minh City, Vietnam
- Vietnam National University, Ho Chi Minh City, Vietnam
| | - Kiet Van Nguyen
- Faculty of Information Science and Engineering, University of Information Technology, Ho Chi Minh City, Vietnam
- Vietnam National University, Ho Chi Minh City, Vietnam
| |
Collapse
|
23
|
Zhang X, Liu K, Yuan B, Wang H, Chen S, Xue Y, Chen W, Liu M, Hu Y. A hybrid adaptive approach for instance transfer learning with dynamic and imbalanced data. INT J INTELL SYST 2022; 37:11582-11599. [PMID: 36816520 PMCID: PMC9936919 DOI: 10.1002/int.23055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2022] [Accepted: 08/16/2022] [Indexed: 11/06/2022]
Abstract
Machine learning has demonstrated success in clinical risk prediction modeling with complex electronic health record data. However, the evolving nature of clinical practices can dynamically change the underlying data distribution over time, leading to model performance drift. Adopting an outdated model is potentially risky and may result in unintentional losses. In this paper, we propose a novel Hybrid Adaptive Boosting approach (HA-Boost) for transfer learning. HA-Boost is characterized by the domain similarity-based and class imbalance-based adaptation mechanisms, which simultaneously address two critical limitations of the classical TrAdaBoost algorithm. We validated HA-Boost in predicting hospital-acquired acute kidney injury using real-world longitudinal electronic health records data. The experiment results demonstrate that HA-Boost stably outperforms the competing baselines in terms of both AUROC and AUPRC across a 7-year time span. This study has confirmed the effectiveness of transfer learning as a superior model updating approach in dynamic environment.
Collapse
Affiliation(s)
- Xiangzhou Zhang
- Big Data Decision Institute, Jinan University, Guangzhou, China
| | - Kang Liu
- Big Data Decision Institute, Jinan University, Guangzhou, China
- School of Management, Jinan University, Guangzhou, China
| | - Borong Yuan
- Big Data Decision Institute, Jinan University, Guangzhou, China
- College of Information Science and Technology, Jinan University, Guangzhou, China
| | - Hongnian Wang
- Big Data Decision Institute, Jinan University, Guangzhou, China
- School of Management, Jinan University, Guangzhou, China
| | - Shaoyong Chen
- Big Data Decision Institute, Jinan University, Guangzhou, China
- College of Information Science and Technology, Jinan University, Guangzhou, China
| | - Yunfei Xue
- Big Data Decision Institute, Jinan University, Guangzhou, China
- College of Information Science and Technology, Jinan University, Guangzhou, China
| | - Weiqi Chen
- Big Data Decision Institute, Jinan University, Guangzhou, China
| | - Mei Liu
- Division of Medical Informatics, University of Kansas Medical Center, Kansas City, KS, United States of America
| | - Yong Hu
- Big Data Decision Institute, Jinan University, Guangzhou, China
| |
Collapse
|
24
|
Roumani YF. Sports analytics in the NFL: classifying the winner of the superbowl. Ann Oper Res 2022; 325:715-730. [PMID: 36467004 PMCID: PMC9684891 DOI: 10.1007/s10479-022-05063-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Accepted: 11/02/2022] [Indexed: 06/03/2023]
Abstract
Sport teams' managers, coaches and players are always looking for new ways to win and stay competitive. The sports analytics field can help teams in gaining a competitive advantage by analyzing historical data and formulating strategies and making data driven decisions regarding game plans, play selection and player recruitment. This work focuses on the application of sports analytics in the National Football League. We compare the classification performance of several methods (C4.5, Neural Network and Random Forest) in classifying the winner of the Superbowl using data collected during the regular season. We split the data into a training set and test set and use the synthetic minority oversampling technique to address the data imbalance issue in the training set. The classification performance is compared on the test set using several measures. According to the findings, the Random Forest classifier had the highest recall, AUC, accuracy and specificity as the oversampling percentage was increased. Our results can be used to develop a decision support tool to assist team managers and coaches in developing strategies that would increase the team's chances of winning.
Collapse
Affiliation(s)
- Yazan F. Roumani
- Department of Decision and Information Sciences, Oakland University, 342 Elliot Hall, Rochester, MI 48309 USA
| |
Collapse
|
25
|
Tang J, Wang X, Wan H, Lin C, Shao Z, Chang Y, Wang H, Wu Y, Zhang T, Du Y. Joint modeling strategy for using electronic medical records data to build machine learning models: an example of intracerebral hemorrhage. BMC Med Inform Decis Mak 2022; 22:278. [PMID: 36284327 PMCID: PMC9594939 DOI: 10.1186/s12911-022-02018-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2022] [Accepted: 10/10/2022] [Indexed: 11/28/2022] Open
Abstract
Background Outliers and class imbalance in medical data could affect the accuracy of machine learning models. For physicians who want to apply predictive models, how to use the data at hand to build a model and what model to choose are very thorny problems. Therefore, it is necessary to consider outliers, imbalanced data, model selection, and parameter tuning when modeling. Methods This study used a joint modeling strategy consisting of: outlier detection and removal, data balancing, model fitting and prediction, performance evaluation. We collected medical record data for all ICH patients with admissions in 2017–2019 from Sichuan Province. Clinical and radiological variables were used to construct models to predict mortality outcomes 90 days after discharge. We used stacking ensemble learning to combine logistic regression (LR), random forest (RF), artificial neural network (ANN), support vector machine (SVM), and k-nearest neighbors (KNN) models. Accuracy, sensitivity, specificity, AUC, precision, and F1 score were used to evaluate model performance. Finally, we compared all 84 combinations of the joint modeling strategy, including training set with and without cross-validated committees filter (CVCF), five resampling techniques (random under-sampling (RUS), random over-sampling (ROS), adaptive synthetic sampling (ADASYN), Borderline synthetic minority oversampling technique (Borderline SMOTE), synthetic minority oversampling technique and edited nearest neighbor (SMOTEENN)) and no resampling, seven models (LR, RF, ANN, SVM, KNN, Stacking, AdaBoost). Results Among 4207 patients with ICH, 2909 (69.15%) survived 90 days after discharge, and 1298 (30.85%) died within 90 days after discharge. The performance of all models improved with removing outliers by CVCF except sensitivity. For data balancing processing, the performance of training set without resampling was better than that of training set with resampling in terms of accuracy, specificity, and precision. And the AUC of ROS was the best. For seven models, the average accuracy, specificity, AUC, and precision of RF were the highest. Stacking performed best in F1 score. Among all 84 combinations of joint modeling strategy, eight combinations performed best in terms of accuracy (0.816). For sensitivity, the best performance was SMOTEENN + Stacking (0.662). For specificity, the best performance was CVCF + KNN (0.987). Stacking and AdaBoost had the best performances in AUC (0.756) and F1 score (0.602), respectively. For precision, the best performance was CVCF + SVM (0.938). Conclusion This study proposed a joint modeling strategy including outlier detection and removal, data balancing, model fitting and prediction, performance evaluation, in order to provide a reference for physicians and researchers who want to build their own models. This study illustrated the importance of outlier detection and removal for machine learning and showed that ensemble learning might be a good modeling strategy. Due to the low imbalanced ratio (IR, the ratio of majority class and minority class) in this study, we did not find any improvement in models with resampling in terms of accuracy, specificity, and precision, while ROS performed best on AUC. Supplementary Information The online version contains supplementary material available at 10.1186/s12911-022-02018-x.
Collapse
Affiliation(s)
- Jianxiang Tang
- Department of Epidemiology and Health Statistics, West China School of Public Health and West China Fourth Hospital, Sichuan University, Chengdu, Sichuan, People's Republic of China
| | - Xiaoyu Wang
- Department of Neurosurgery, West China Hospital of Sichuan University, Chengdu, Sichuan, People's Republic of China
| | - Hongli Wan
- Department of Epidemiology and Health Statistics, West China School of Public Health and West China Fourth Hospital, Sichuan University, Chengdu, Sichuan, People's Republic of China.,Health Emergency Management Research Center, West China-PUMC C.C. Chen Institute of Health, Sichuan University, Chengdu, Sichuan, People's Republic of China
| | - Chunying Lin
- Department of Epidemiology and Health Statistics, West China School of Public Health and West China Fourth Hospital, Sichuan University, Chengdu, Sichuan, People's Republic of China.,Health Emergency Management Research Center, West China-PUMC C.C. Chen Institute of Health, Sichuan University, Chengdu, Sichuan, People's Republic of China
| | - Zilun Shao
- Department of Epidemiology and Health Statistics, West China School of Public Health and West China Fourth Hospital, Sichuan University, Chengdu, Sichuan, People's Republic of China.,Health Emergency Management Research Center, West China-PUMC C.C. Chen Institute of Health, Sichuan University, Chengdu, Sichuan, People's Republic of China
| | - Yang Chang
- Department of Epidemiology and Health Statistics, West China School of Public Health and West China Fourth Hospital, Sichuan University, Chengdu, Sichuan, People's Republic of China.,Health Emergency Management Research Center, West China-PUMC C.C. Chen Institute of Health, Sichuan University, Chengdu, Sichuan, People's Republic of China
| | - Hexuan Wang
- Department of Epidemiology and Health Statistics, West China School of Public Health and West China Fourth Hospital, Sichuan University, Chengdu, Sichuan, People's Republic of China.,Health Emergency Management Research Center, West China-PUMC C.C. Chen Institute of Health, Sichuan University, Chengdu, Sichuan, People's Republic of China
| | - Yi Wu
- Department of Epidemiology and Health Statistics, West China School of Public Health and West China Fourth Hospital, Sichuan University, Chengdu, Sichuan, People's Republic of China.,Health Emergency Management Research Center, West China-PUMC C.C. Chen Institute of Health, Sichuan University, Chengdu, Sichuan, People's Republic of China
| | - Tao Zhang
- Department of Epidemiology and Health Statistics, West China School of Public Health and West China Fourth Hospital, Sichuan University, Chengdu, Sichuan, People's Republic of China. .,Health Emergency Management Research Center, West China-PUMC C.C. Chen Institute of Health, Sichuan University, Chengdu, Sichuan, People's Republic of China.
| | - Yu Du
- Health Emergency Management Research Center, West China-PUMC C.C. Chen Institute of Health, Sichuan University, Chengdu, Sichuan, People's Republic of China. .,Department of Emergency and Critical Care Medicine, West China School of Public Health, West China Fourth Hospital, Sichuan University, Chengdu, Sichuan, People's Republic of China.
| |
Collapse
|
26
|
Abstract
Mobile payment systems are becoming more popular due to the increase in the number of smartphones, which, in turn, attracts the interest of fraudsters. Extant research has therefore developed various fraud detection methods using supervised machine learning. However, sufficient labeled data are rarely available and their detection performance is negatively affected by the extreme class imbalance in financial fraud data. The purpose of this study is to propose an XGBoost-based fraud detection framework while considering the financial consequences of fraud detection systems. The framework was empirically validated on a large dataset of more than 6 million mobile transactions. To demonstrate the effectiveness of the proposed framework, we conducted a comparative evaluation of existing machine learning methods designed for modeling imbalanced data and outlier detection. The results suggest that in terms of standard classification measures, the proposed semi-supervised ensemble model integrating multiple unsupervised outlier detection algorithms and an XGBoost classifier achieves the best results, while the highest cost savings can be achieved by combining random under-sampling and XGBoost methods. This study has therefore financial implications for organizations to make appropriate decisions regarding the implementation of effective fraud detection systems.
Collapse
Affiliation(s)
- Petr Hajek
- Science and Research Centre, Faculty of Economics and Administration, University of Pardubice, Studentska 84, Pardubice, 532 10 Czech Republic
| | - Mohammad Zoynul Abedin
- Department of Finance, Performance & Marketing, Teesside University International Business School, Teesside University, Middlesbrough, TS1 3BX Tees Valley UK
| | | |
Collapse
|
27
|
Wibowo P, Fatichah C. Pruning-based oversampling technique with smoothed bootstrap resampling for imbalanced clinical dataset of Covid-19. J King Saud Univ Comput Inf Sci 2022; 34:7830-7839. [PMID: 38620726 PMCID: PMC8482553 DOI: 10.1016/j.jksuci.2021.09.021] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/01/2021] [Revised: 07/28/2021] [Accepted: 09/25/2021] [Indexed: 11/25/2022]
Abstract
The Coronavirus Disease (COVID-19) was declared a pandemic disease by the World Health Organization (WHO), and it has not ended so far. Since the infection rate of the COVID-19 increases, the computational approach is needed to predict patients infected with COVID-19 in order to speed up the diagnosis time and minimize human error compared to conventional diagnoses. However, the number of negative data that is higher than positive data can result in a data imbalance situation that affects the classification performance, resulting in a bias in the model evaluation results. This study proposes a new oversampling technique, i.e., TRIM-SBR, to generate the minor class data for diagnosing patients infected with COVID-19. It is still challenging to develop the oversampling technique due to the data's generalization issue. The proposed method is based on pruning by looking for specific minority areas while retaining data generalization, resulting in minority data seeds that serve as benchmarks in creating new synthesized data using bootstrap resampling techniques. Accuracy, Specificity, Sensitivity, F-measure, and AUC are used to evaluate classifier performance in data imbalance cases. The results show that the TRIM-SBR method provides the best performance compared to other oversampling techniques.
Collapse
Affiliation(s)
- Prasetyo Wibowo
- Department of Informatics, Faculty of Intelligent Electrical and Informatics Technology Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia
| | - Chastine Fatichah
- Department of Informatics, Faculty of Intelligent Electrical and Informatics Technology Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia
| |
Collapse
|
28
|
Khodabandelu S, Basirat Z, Khaleghi S, Khafri S, Montazery Kordy H, Golsorkhtabaramiri M. Developing machine learning-based models to predict intrauterine insemination (IUI) success by address modeling challenges in imbalanced data and providing modification solutions for them. BMC Med Inform Decis Mak 2022; 22:228. [PMID: 36050710 PMCID: PMC9434923 DOI: 10.1186/s12911-022-01974-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2022] [Accepted: 08/24/2022] [Indexed: 12/03/2022] Open
Abstract
Background This study sought to provide machine learning-based classification models to predict the success of intrauterine insemination (IUI) therapy. Additionally, we sought to illustrate the effect of models fitting with balanced data vs original data with imbalanced data labels using two different types of resampling methods. Finally, we fit models with all features against optimized feature sets using various feature selection techniques.
Methods The data for the cross-sectional study were collected from 546 infertile couples with IUI at the Fatemehzahra Infertility Research Center, Babol, North of Iran. Logistic regression (LR), support vector classification, random forest, Extreme Gradient Boosting (XGBoost) and, Stacking generalization (Stack) as the machine learning classifiers were used to predict IUI success by Python v3.7. We employed the Smote-Tomek (Stomek) and Smote-ENN (SENN) resampling methods to address the imbalance problem in the original dataset. Furthermore, to increase the performance of the models, mutual information classification (MIC-FS), genetic algorithm (GA-FS), and random forest (RF-FS) were used to select the ideal feature sets for model development. Results In this study, 28% of patients undergoing IUI treatment obtained a successful pregnancy. Also, the average age of women and men was 24.98 and 29.85 years, respectively. The calibration plot in this study for IUI success prediction by machine learning models showed that between feature selection methods, the RF-FS, and among the datasets used to fit the models, the balanced dataset with the Stomek method had well-calibrating predictions than other methods. Finally, the brier scores for the LR, SVC, RF, XGBoost, and Stack models that were fitted utilizing the Stomek dataset and the chosen feature set using the Random Forest technique obtained equal to 0.202, 0.183, 0.158, 0.129, and 0.134, respectively. It showed duration of infertility, male and female age, sperm concentration, and sperm motility grading score as the most predictable factors in IUI success. Conclusion The results of this study with the XGBoost prediction model can be used to foretell the individual success of IUI for each couple before initiating therapy. Supplementary Information The online version contains supplementary material available at 10.1186/s12911-022-01974-8.
Collapse
Affiliation(s)
- Sajad Khodabandelu
- Student Research Committee, Babol University of Medical Sciences, Babol, Iran
| | - Zahra Basirat
- Infertility and Reproductive Health Research Center, Health Research Institute, Babol University of Medical Sciences, Babol, Iran
| | - Sara Khaleghi
- Student Research Committee, Babol University of Medical Sciences, Babol, Iran
| | - Soraya Khafri
- Infertility and Reproductive Health Research Center, Health Research Institute, Babol University of Medical Sciences, Babol, Iran.
| | - Hussain Montazery Kordy
- Faculty of Electrical and Computer Engineering, Babol Noshirvani University of Technology, Babol, Iran
| | - Masoumeh Golsorkhtabaramiri
- Infertility and Reproductive Health Research Center, Health Research Institute, Babol University of Medical Sciences, Babol, Iran
| |
Collapse
|
29
|
Xu X, Wang C, Gui B, Yuan X, Li C, Zhao Y, Martyniuk CJ, Su L. Application of machine learning to predict the inhibitory activity of organic chemicals on thyroid stimulating hormone receptor. Environ Res 2022; 212:113175. [PMID: 35351457 DOI: 10.1016/j.envres.2022.113175] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/27/2021] [Revised: 03/04/2022] [Accepted: 03/22/2022] [Indexed: 06/14/2023]
Abstract
With the promotion of carbon neutrality, it is also important to synchronously promote the assessment and sustainable management of chemicals so as to protect public health. Humans and animals are possibly exposed to endocrine disruptors that have inhibitory effects on thyroid stimulating hormone receptor (TSHR). As such, it is important to identify chemicals that inhibit TSHR and to develop models to predict their inhibitory activity. In this study, 5952 compounds derived from a cyclic adenosine monophosphate (cAMP) analysis, a key signaling pathway in thyrocytes, were used to establish a binary classification model comparing methods that included random forest (RF), extreme gradient boosting (XGB), and logistic regression (LR). The prediction model based on RF showed the highest identification accuracy for revealing chemicals that may inhibit TSHR. For the RF model, recall was calculated at 0.89, balance accuracy was 0.85, and its receiver operating characteristic (ROC) curve-area under (AUC) was 0.92, indicating that the model had very high predictive capacity. The lowest CDocker energy (CE) and CDocker interaction energy (CIE) for chemicals and TSHR were determined and were subsequently introduced into the predictive model as descriptors. A regression model, extreme gradient boosting-Regression (XGBR), was successfully established yielding an R2 = 0.65 to predict inhibitory activity for active compounds. Parameters that included dissociation characteristics, molecular structure, and binding energy were all key factors in the predictive model. We demonstrate that QSAR models are useful approaches, not only for identifying chemicals that inhibit TSHR, but for predicting inhibitory activity of active compounds.
Collapse
Affiliation(s)
- Xiaotian Xu
- State Environmental Protection Key Laboratory of Wetland Ecology and Vegetation Restoration, School of Environment, Northeast Normal University, Changchun, 130117, PR China
| | - Chen Wang
- State Environmental Protection Key Laboratory of Wetland Ecology and Vegetation Restoration, School of Environment, Northeast Normal University, Changchun, 130117, PR China
| | - Bingxin Gui
- State Environmental Protection Key Laboratory of Wetland Ecology and Vegetation Restoration, School of Environment, Northeast Normal University, Changchun, 130117, PR China
| | - Xiangyi Yuan
- State Environmental Protection Key Laboratory of Wetland Ecology and Vegetation Restoration, School of Environment, Northeast Normal University, Changchun, 130117, PR China
| | - Chao Li
- State Environmental Protection Key Laboratory of Wetland Ecology and Vegetation Restoration, School of Environment, Northeast Normal University, Changchun, 130117, PR China
| | - Yuanhui Zhao
- State Environmental Protection Key Laboratory of Wetland Ecology and Vegetation Restoration, School of Environment, Northeast Normal University, Changchun, 130117, PR China
| | - Christopher J Martyniuk
- Center for Environmental and Human Toxicology, Department of Physiological Sciences, College of Veterinary Medicine, UF Genetics Institute, Interdisciplinary Program in Biomedical Sciences Neuroscience, University of Florida, Gainesville, FL, 32611, USA
| | - Limin Su
- State Environmental Protection Key Laboratory of Wetland Ecology and Vegetation Restoration, School of Environment, Northeast Normal University, Changchun, 130117, PR China.
| |
Collapse
|
30
|
Liu X, Guo L, Wang H, Guo J, Yang S, Duan L. Research on imbalance machine learning methods for MR[Formula: see text]WI soft tissue sarcoma data. BMC Med Imaging 2022; 22:149. [PMID: 36028803 PMCID: PMC9417078 DOI: 10.1186/s12880-022-00876-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2022] [Accepted: 08/08/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Soft tissue sarcoma is a rare and highly heterogeneous tumor in clinical practice. Pathological grading of the soft tissue sarcoma is a key factor in patient prognosis and treatment planning while the clinical data of soft tissue sarcoma are imbalanced. In this paper, we propose an effective solution to find the optimal imbalance machine learning model for predicting the classification of soft tissue sarcoma data. METHODS In this paper, a large number of features are first obtained based on [Formula: see text]WI images using the radiomics methods.Then, we explore the methods of feature selection, sampling and classification, get 17 imbalance machine learning models based on the above features and performed extensive experiments to classify imbalanced soft tissue sarcoma data. Meanwhile, we used another dataset splitting method as well, which could improve the classification performance and verify the validity of the models. RESULTS The experimental results show that the combination of extremely randomized trees (ERT) classification algorithm using SMOTETomek and the recursive feature elimination technique (RFE) performs best compared to other methods. The accuracy of RFE+STT+ERT is 81.57% , which is close to the accuracy of biopsy, and the accuracy is 95.69% when using another dataset splitting method. CONCLUSION Preoperative predicting pathological grade of soft tissue sarcoma in an accurate and noninvasive manner is essential. Our proposed machine learning method (RFE+STT+ERT) can make a positive contribution to solving the imbalanced data classification problem, which can favorably support the development of personalized treatment plans for soft tissue sarcoma patients.
Collapse
Affiliation(s)
- Xuanxuan Liu
- College of Computer Science and Technology, Qingdao University, Qingdao, 266071 China
| | - Li Guo
- College of Computer Science and Technology, Qingdao University, Qingdao, 266071 China
| | - Hexiang Wang
- Department of Radiology, The Affiliated Hospital of Qingdao University, Qingdao, China
| | - Jia Guo
- Department of Radiology, The Affiliated Hospital of Qingdao University, Qingdao, China
| | - Shifeng Yang
- Department of Radiology, Shandong Provincial Hospital Affiliated to Shandong First Medical University, Jinan, China
| | - Lisha Duan
- Department of Radiology, The Third Hospital of Hebei Medical University, Shijiazhuang, Qingdao, China
| |
Collapse
|
31
|
Boo Y, Choi Y. Comparison of mortality prediction models for road traffic accidents: an ensemble technique for imbalanced data. BMC Public Health 2022; 22:1476. [PMID: 35918672 PMCID: PMC9344638 DOI: 10.1186/s12889-022-13719-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2021] [Accepted: 06/27/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Injuries caused by RTA are classified under the International Classification of Diseases-10 as 'S00-T99' and represent imbalanced samples with a mortality rate of only 1.2% among all RTA victims. To predict the characteristics of external causes of road traffic accident (RTA) injuries and mortality, we compared performances based on differences in the correction and classification techniques for imbalanced samples. METHODS The present study extracted and utilized data spanning over a 5-year period (2013-2017) from the Korean National Hospital Discharge In-depth Injury Survey (KNHDS), a national level survey conducted by the Korea Disease Control and Prevention Agency, A total of eight variables were used in the prediction, including patient, accident, and injury/disease characteristics. As the data was imbalanced, a sample consisting of only severe injuries was constructed and compared against the total sample. Considering the characteristics of the samples, preprocessing was performed in the study. The samples were standardized first, considering that they contained many variables with different units. Among the ensemble techniques for classification, the present study utilized Random Forest, Extra-Trees, and XGBoost. Four different over- and under-sampling techniques were used to compare the performance of algorithms using "accuracy", "precision", "recall", "F1", and "MCC". RESULTS The results showed that among the prediction techniques, XGBoost had the best performance. While the synthetic minority oversampling technique (SMOTE), a type of over-sampling, also demonstrated a certain level of performance, under-sampling was the most superior. Overall, prediction by the XGBoost model with samples using SMOTE produced the best results. CONCLUSION This study presented the results of an empirical comparison of the validity of sampling techniques and classification algorithms that affect the accuracy of imbalanced samples by combining two techniques. The findings could be used as reference data in classification analyses of imbalanced data in the medical field.
Collapse
Affiliation(s)
- Yookyung Boo
- Department of Health Administration, Dankook University, Cheonan, 31116, South Korea
| | - Youngjin Choi
- Department of Healthcare Management, Eulji University, Seongnam, 13135, South Korea.
| |
Collapse
|
32
|
Bonannella C, Hengl T, Heisig J, Parente L, Wright MN, Herold M, de Bruin S. Forest tree species distribution for Europe 2000-2020: mapping potential and realized distributions using spatiotemporal machine learning. PeerJ 2022; 10:e13728. [PMID: 35910765 PMCID: PMC9332400 DOI: 10.7717/peerj.13728] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2022] [Accepted: 06/22/2022] [Indexed: 01/17/2023] Open
Abstract
This article describes a data-driven framework based on spatiotemporal machine learning to produce distribution maps for 16 tree species (Abies alba Mill., Castanea sativa Mill., Corylus avellana L., Fagus sylvatica L., Olea europaea L., Picea abies L. H. Karst., Pinus halepensis Mill., Pinus nigra J. F. Arnold, Pinus pinea L., Pinus sylvestris L., Prunus avium L., Quercus cerris L., Quercus ilex L., Quercus robur L., Quercus suber L. and Salix caprea L.) at high spatial resolution (30 m). Tree occurrence data for a total of three million of points was used to train different algorithms: random forest, gradient-boosted trees, generalized linear models, k-nearest neighbors, CART and an artificial neural network. A stack of 305 coarse and high resolution covariates representing spectral reflectance, different biophysical conditions and biotic competition was used as predictors for realized distributions, while potential distribution was modelled with environmental predictors only. Logloss and computing time were used to select the three best algorithms to tune and train an ensemble model based on stacking with a logistic regressor as a meta-learner. An ensemble model was trained for each species: probability and model uncertainty maps of realized distribution were produced for each species using a time window of 4 years for a total of six distribution maps per species, while for potential distributions only one map per species was produced. Results of spatial cross validation show that the ensemble model consistently outperformed or performed as good as the best individual model in both potential and realized distribution tasks, with potential distribution models achieving higher predictive performances (TSS = 0.898, R2 logloss = 0.857) than realized distribution ones on average (TSS = 0.874, R2 logloss = 0.839). Ensemble models for Q. suber achieved the best performances in both potential (TSS = 0.968, R2 logloss = 0.952) and realized (TSS = 0.959, R2 logloss = 0.949) distribution, while P. sylvestris (TSS = 0.731, 0.785, R2 logloss = 0.585, 0.670, respectively, for potential and realized distribution) and P. nigra (TSS = 0.658, 0.686, R2 logloss = 0.623, 0.664) achieved the worst. Importance of predictor variables differed across species and models, with the green band for summer and the Normalized Difference Vegetation Index (NDVI) for fall for realized distribution and the diffuse irradiation and precipitation of the driest quarter (BIO17) being the most frequent and important for potential distribution. On average, fine-resolution models outperformed coarse resolution models (250 m) for realized distribution (TSS = +6.5%, R2 logloss = +7.5%). The framework shows how combining continuous and consistent Earth Observation time series data with state of the art machine learning can be used to derive dynamic distribution maps. The produced predictions can be used to quantify temporal trends of potential forest degradation and species composition change.
Collapse
Affiliation(s)
- Carmelo Bonannella
- Laboratory of Geo-Information Science and Remote Sensing, Wageningen University and Research, Wageningen, The Netherlands
- OpenGeoHub, Wageningen, The Netherlands
| | | | - Johannes Heisig
- Institute for Geoinformatics, University of Münster, Münster, Germany
| | | | - Marvin N. Wright
- Leibniz Institute for Prevention Research and Epidemiology – BIPS, Bremen, Germany
- University of Bremen, Bremen, Germany
| | - Martin Herold
- Laboratory of Geo-Information Science and Remote Sensing, Wageningen University and Research, Wageningen, The Netherlands
- Section 1.4 Remote Sensing and Geoinformatics, GFZ German Research Centre for Geosciences, Potsdam, Germany
| | - Sytze de Bruin
- Laboratory of Geo-Information Science and Remote Sensing, Wageningen University and Research, Wageningen, The Netherlands
| |
Collapse
|
33
|
Zhang S, Mi T, Wu Q, Luo Y, Grieneisen ML, Shi G, Yang F, Zhan Y. A data-augmentation approach to deriving long-term surface SO 2 across Northern China: Implications for interpretable machine learning. Sci Total Environ 2022; 827:154278. [PMID: 35248628 DOI: 10.1016/j.scitotenv.2022.154278] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/22/2021] [Revised: 02/22/2022] [Accepted: 02/27/2022] [Indexed: 06/14/2023]
Abstract
Until recently, Northern China was one of the most SO2 polluted regions in the world. The lack of long-term and spatially resolved surface SO2 data hinders retrospective evaluation of relevant environmental policies and human health effects. This study aims to derive the spatiotemporal distribution of surface SO2 across Northern China during 2005-2019. As "concept drift" causes substantial estimation bias in back-extrapolation, we propose a new approach named the robust back-extrapolation via data augmentation approach (RBE-DA) to model the long-term surface SO2. The results show that the population-weighted regional SO2 ([SO2]pw) increased from 2005 to 2007 and decreased steadily afterwards. The [SO2]pw decreased by 80.4% from 74.2 ± 28.4 μg/m3 in 2007 to 14.6 ± 4.8 μg/m3 in 2019. The predicted spatial distributions for each year show that the SO2 pollution was severe (more than 20 μg/m3) in most areas of Northern China until 2017. By using model interpretation methods, we visually reveal the mechanism of estimation bias in the back-extrapolation. Specifically, the training data is severely imbalanced with respect to the satellite-retrieved SO2 column densities (i.e., it is short on high-value samples), so the benchmark model is unable to extrapolate the effects of this important predictor. This study provides long-term surface SO2 data for post hoc evaluation and human exposure assessment in Northern China, while demonstrating that the interpretable machine learning approach is critical for model diagnostics and refinement. Leveraging satellite retrievals, the RBE-DA approach can be applied worldwide to back-extrapolate various measures of air quality.
Collapse
Affiliation(s)
- Shifu Zhang
- Department of Environmental Science and Engineering, Sichuan University, Chengdu, Sichuan 610065, China
| | - Tan Mi
- Department of Environmental Science and Engineering, Sichuan University, Chengdu, Sichuan 610065, China
| | - Qinhuizi Wu
- Department of Environmental Science and Engineering, Sichuan University, Chengdu, Sichuan 610065, China
| | - Yuzhou Luo
- Department of Land, Air, and Water Resources, University of California, Davis, CA 95616, United States
| | - Michael L Grieneisen
- Department of Land, Air, and Water Resources, University of California, Davis, CA 95616, United States
| | - Guangming Shi
- Department of Environmental Science and Engineering, Sichuan University, Chengdu, Sichuan 610065, China; National Engineering Research Center for Flue Gas Desulfurization, Chengdu, Sichuan 610065, China
| | - Fumo Yang
- Department of Environmental Science and Engineering, Sichuan University, Chengdu, Sichuan 610065, China; National Engineering Research Center for Flue Gas Desulfurization, Chengdu, Sichuan 610065, China
| | - Yu Zhan
- Department of Environmental Science and Engineering, Sichuan University, Chengdu, Sichuan 610065, China; National Engineering Research Center for Flue Gas Desulfurization, Chengdu, Sichuan 610065, China.
| |
Collapse
|
34
|
Liu R, Wang M, Zheng T, Zhang R, Li N, Chen Z, Yan H, Shi Q. An artificial intelligence-based risk prediction model of myocardial infarction. BMC Bioinformatics 2022; 23:217. [PMID: 35672659 PMCID: PMC9175344 DOI: 10.1186/s12859-022-04761-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2022] [Accepted: 05/30/2022] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND Myocardial infarction can lead to malignant arrhythmia, heart failure, and sudden death. Clinical studies have shown that early identification of and timely intervention for acute MI can significantly reduce mortality. The traditional MI risk assessment models are subjective, and the data that go into them are difficult to obtain. Generally, the assessment is only conducted among high-risk patient groups. OBJECTIVE To construct an artificial intelligence-based risk prediction model of myocardial infarction (MI) for continuous and active monitoring of inpatients, especially those in noncardiovascular departments, and early warning of MI. METHODS The imbalanced data contain 59 features, which were constructed into a specific dataset through proportional division, upsampling, downsampling, easy ensemble, and w-easy ensemble. Then, the dataset was traversed using supervised machine learning, with recursive feature elimination as the top-layer algorithm and random forest, gradient boosting decision tree (GBDT), logistic regression, and support vector machine as the bottom-layer algorithms, to select the best model out of many through a variety of evaluation indices. RESULTS GBDT was the best bottom-layer algorithm, and downsampling was the best dataset construction method. In the validation set, the F1 score and accuracy of the 24-feature downsampling GBDT model were both 0.84. In the test set, the F1 score and accuracy of the 24-feature downsampling GBDT model were both 0.83, and the area under the curve was 0.91. CONCLUSION Compared with traditional models, artificial intelligence-based machine learning models have better accuracy and real-time performance and can reduce the occurrence of in-hospital MI from a data-driven perspective, thereby increasing the cure rate of patients and improving their prognosis.
Collapse
Affiliation(s)
- Ran Liu
- MOE Key Lab for Neuroinformation, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 610054 Sichuan China
- Engineering Research Center of Medical Information Technology, Ministry of Education, West China Hospital of Sichuan University, Chengdu, 610041 Sichuan China
| | - Miye Wang
- Engineering Research Center of Medical Information Technology, Ministry of Education, West China Hospital of Sichuan University, Chengdu, 610041 Sichuan China
| | - Tao Zheng
- Engineering Research Center of Medical Information Technology, Ministry of Education, West China Hospital of Sichuan University, Chengdu, 610041 Sichuan China
| | - Rui Zhang
- Engineering Research Center of Medical Information Technology, Ministry of Education, West China Hospital of Sichuan University, Chengdu, 610041 Sichuan China
| | - Nan Li
- Engineering Research Center of Medical Information Technology, Ministry of Education, West China Hospital of Sichuan University, Chengdu, 610041 Sichuan China
| | - Zhongxiu Chen
- Department of Cardiology, West China Hospital of Sichuan University, Chengdu, 610041 Sichuan China
| | - Hongmei Yan
- MOE Key Lab for Neuroinformation, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, 610054 Sichuan China
| | - Qingke Shi
- Engineering Research Center of Medical Information Technology, Ministry of Education, West China Hospital of Sichuan University, Chengdu, 610041 Sichuan China
| |
Collapse
|
35
|
Liu S, Yao W. Prediction of lung cancer using gene expression and deep learning with KL divergence gene selection. BMC Bioinformatics 2022; 23:175. [PMID: 35549644 PMCID: PMC9103042 DOI: 10.1186/s12859-022-04689-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2021] [Accepted: 04/13/2022] [Indexed: 11/24/2022] Open
Abstract
Background Lung cancer is one of the cancers with the highest mortality rate in China. With the rapid development of high-throughput sequencing technology and the research and application of deep learning methods in recent years, deep neural networks based on gene expression have become a hot research direction in lung cancer diagnosis in recent years, which provide an effective way of early diagnosis for lung cancer. Thus, building a deep neural network model is of great significance for the early diagnosis of lung cancer. However, the main challenges in mining gene expression datasets are the curse of dimensionality and imbalanced data. The existing methods proposed by some researchers can’t address the problems of high-dimensionality and imbalanced data, because of the overwhelming number of variables measured (genes) versus the small number of samples, which result in poor performance in early diagnosis for lung cancer. Method Given the disadvantages of gene expression data sets with small datasets, high-dimensionality and imbalanced data, this paper proposes a gene selection method based on KL divergence, which selects some genes with higher KL divergence as model features. Then build a deep neural network model using Focal Loss as loss function, at the same time, we use k-fold cross validation method to verify and select the best model, we set the value of k is five in this paper. Result The deep learning model method based on KL divergence gene selection proposed in this paper has an AUC of 0.99 on the validation set. The generalization performance of model is high. Conclusion The deep neural network model based on KL divergence gene selection proposed in this paper is proved to be an accurate and effective method for lung cancer prediction.
Collapse
Affiliation(s)
- Suli Liu
- College of Public Health, Zhengzhou University, Zhengzhou, 450001, China
| | - Wu Yao
- College of Public Health, Zhengzhou University, Zhengzhou, 450001, China.
| |
Collapse
|
36
|
Tarimo CS, Bhuyan SS, Zhao Y, Ren W, Mohammed A, Li Q, Gardner M, Mahande MJ, Wang Y, Wu J. Prediction of low Apgar score at five minutes following labor induction intervention in vaginal deliveries: machine learning approach for imbalanced data at a tertiary hospital in North Tanzania. BMC Pregnancy Childbirth 2022; 22:275. [PMID: 35365129 PMCID: PMC8976377 DOI: 10.1186/s12884-022-04534-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2021] [Accepted: 02/28/2022] [Indexed: 11/18/2022] Open
Abstract
Background Prediction of low Apgar score for vaginal deliveries following labor induction intervention is critical for improving neonatal health outcomes. We set out to investigate important attributes and train popular machine learning (ML) algorithms to correctly classify neonates with a low Apgar scores from an imbalanced learning perspective. Methods We analyzed 7716 induced vaginal deliveries from the electronic birth registry of the Kilimanjaro Christian Medical Centre (KCMC). 733 (9.5%) of which constituted of low (< 7) Apgar score neonates. The ‘extra-tree classifier’ was used to assess features’ importance. We used Area Under Curve (AUC), recall, precision, F-score, Matthews Correlation Coefficient (MCC), balanced accuracy (BA), bookmaker informedness (BM), and markedness (MK) to evaluate the performance of the selected six (6) machine learning classifiers. To address class imbalances, we examined three widely used resampling techniques: the Synthetic Minority Oversampling Technique (SMOTE) and Random Oversampling Examples (ROS) and Random undersampling techniques (RUS). We applied Decision Curve Analysis (DCA) to evaluate the net benefit of the selected classifiers. Results Birth weight, maternal age, and gestational age were found to be important predictors for the low Apgar score following induced vaginal delivery. SMOTE, ROS and and RUS techniques were more effective at improving “recalls” among other metrics in all the models under investigation. A slight improvement was observed in the F1 score, BA, and BM. DCA revealed potential benefits of applying Boosting method for predicting low Apgar scores among the tested models. Conclusion There is an opportunity for more algorithms to be tested to come up with theoretical guidance on more effective rebalancing techniques suitable for this particular imbalanced ratio. Future research should prioritize a debate on which performance indicators to look up to when dealing with imbalanced or skewed data. Supplementary Information The online version contains supplementary material available at 10.1186/s12884-022-04534-0.
Collapse
Affiliation(s)
- Clifford Silver Tarimo
- Department of Epidemiology and Health Statistics, College of Public Health, Zhengzhou University, 100 Kexue Avenue, Zhengzhou, 450001, Henan, China.,Department of Science and Laboratory Technology, Dar es Salaam Institute of Technology, P.O. Box 2958, Dar es Salaam, Tanzania
| | - Soumitra S Bhuyan
- Rutgers University-New Brunswick, Edward J. Bloustein, School of Planning and Public Policy, New Brunswick, USA
| | - Yizhen Zhao
- Luoyang Orthopedic Traumatological Hospital of Henan Province, Luoyang, China
| | - Weicun Ren
- Department of Epidemiology and Health Statistics, College of Public Health, Zhengzhou University, 100 Kexue Avenue, Zhengzhou, 450001, Henan, China.,College of Sanquan, Xinxiang Medical University, Xinxiang, People's Republic of China
| | - Akram Mohammed
- Center for Biomedical Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Quanman Li
- Department of Epidemiology and Health Statistics, College of Public Health, Zhengzhou University, 100 Kexue Avenue, Zhengzhou, 450001, Henan, China
| | - Marilyn Gardner
- Department of Public Health, Western Kentucky University, 1906 College Heights Blvd, Bowling Green, KY, 42101, USA
| | - Michael Johnson Mahande
- Institute of Public Health, Kilimanjaro Christian Medical University College, P.O. Box 2240, Moshi, Tanzania
| | - Yuhui Wang
- Centre for Financial and Corporate Integrity, Coventry University, Coventry, UK
| | - Jian Wu
- Department of Epidemiology and Health Statistics, College of Public Health, Zhengzhou University, 100 Kexue Avenue, Zhengzhou, 450001, Henan, China. .,Henan Province Engineering Research Center of Health Economics & Health Technology Assessment, Henan Province, China.
| |
Collapse
|
37
|
Chu C, Wang S, Rudd JC, Ibrahim AMH, Xue Q, Devkota RN, Baker JA, Baker S, Simoneaux B, Opena G, Dong H, Liu X, Jessup KE, Chen MS, Hui K, Metz R, Johnson CD, Zhang ZS, Liu S. A new strategy for using historical imbalanced yield data to conduct genome-wide association studies and develop genomic prediction models for wheat breeding. Mol Breed 2022; 42:18. [PMID: 37309459 PMCID: PMC10248704 DOI: 10.1007/s11032-022-01287-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/31/2021] [Accepted: 03/02/2022] [Indexed: 06/14/2023]
Abstract
Using imbalanced historical yield data to predict performance and select new lines is an arduous breeding task. Genome-wide association studies (GWAS) and high throughput genotyping based on sequencing techniques can increase prediction accuracy. An association mapping panel of 227 Texas elite (TXE) wheat breeding lines was used for GWAS and a training population to develop prediction models for grain yield selection. An imbalanced set of yield data collected from 102 environments (year-by-location) over 10 years, through testing yield in 40-66 lines each year at 6-14 locations with 38-41 lines repeated in the test in any two consecutive years, was used. Based on correlations among data from different environments within two adjacent years and heritability estimated in each environment, yield data from 87 environments were selected and assigned to two correlation-based groups. The yield best linear unbiased estimation (BLUE) from each group, along with reaction to greenbug and Hessian fly in each line, was used for GWAS to reveal genomic regions associated with yield and insect resistance. A total of 74 genomic regions were associated with grain yield and two of them were commonly detected in both correlation-based groups. Greenbug resistance in TXE lines was mainly controlled by Gb3 on chromosome 7DL in addition to two novel regions on 3DL and 6DS, and Hessian fly resistance was conferred by the region on 1AS. Genomic prediction models developed in two correlation-based groups were validated using a set of 105 new advanced breeding lines and the model from correlation-based group G2 was more reliable for prediction. This research not only identified genomic regions associated with yield and insect resistance but also established the method of using historical imbalanced breeding data to develop a genomic prediction model for crop improvement. Supplementary Information The online version contains supplementary material available at 10.1007/s11032-022-01287-8.
Collapse
Affiliation(s)
- Chenggen Chu
- Texas A&M AgriLife Research Center, Amarillo, TX 79106 USA
- Sugarbeet & Potato Research Unit, Edward T. Schafer Agricultural Research Center, USDA-ARS, Fargo, ND 58102 USA
| | - Shichen Wang
- Genomics and Bioinformatics Service Center, Texas A&M AgriLife Research, College Station, TX 77843 USA
| | - Jackie C. Rudd
- Texas A&M AgriLife Research Center, Amarillo, TX 79106 USA
| | - Amir M. H. Ibrahim
- Soil and Crop Sciences Department, Texas A&M University, College Station, TX 77843 USA
| | - Qingwu Xue
- Texas A&M AgriLife Research Center, Amarillo, TX 79106 USA
| | | | - Jason A. Baker
- Texas A&M AgriLife Research Center, Amarillo, TX 79106 USA
| | - Shannon Baker
- Texas A&M AgriLife Research Center, Amarillo, TX 79106 USA
| | - Bryan Simoneaux
- Soil and Crop Sciences Department, Texas A&M University, College Station, TX 77843 USA
| | - Geraldine Opena
- Soil and Crop Sciences Department, Texas A&M University, College Station, TX 77843 USA
| | - Haixiao Dong
- Soil and Crop Sciences Department, Washington State University, Pullman, WA 99164 USA
| | - Xiaoxiao Liu
- Texas A&M AgriLife Research Center, Amarillo, TX 79106 USA
| | - Kirk E. Jessup
- Texas A&M AgriLife Research Center, Amarillo, TX 79106 USA
| | - Ming-Shun Chen
- Hard Winter Wheat Genetics Research Unit, USDA-ARS, Manhattan, KS 66506 USA
| | - Kele Hui
- Texas A&M AgriLife Research Center, Amarillo, TX 79106 USA
| | - Richard Metz
- Genomics and Bioinformatics Service Center, Texas A&M AgriLife Research, College Station, TX 77843 USA
| | - Charles D. Johnson
- Genomics and Bioinformatics Service Center, Texas A&M AgriLife Research, College Station, TX 77843 USA
| | - Zhiwu S. Zhang
- Soil and Crop Sciences Department, Washington State University, Pullman, WA 99164 USA
| | - Shuyu Liu
- Texas A&M AgriLife Research Center, Amarillo, TX 79106 USA
| |
Collapse
|
38
|
Sadeghi S, Khalili D, Ramezankhani A, Mansournia MA, Parsaeian M. Diabetes mellitus risk prediction in the presence of class imbalance using flexible machine learning methods. BMC Med Inform Decis Mak 2022; 22:36. [PMID: 35139846 PMCID: PMC8830137 DOI: 10.1186/s12911-022-01775-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2021] [Accepted: 02/07/2022] [Indexed: 12/24/2022] Open
Abstract
Background Early detection and prediction of type two diabetes mellitus incidence by baseline measurements could reduce associated complications in the future. The low incidence rate of diabetes in comparison with non-diabetes makes accurate prediction of minority diabetes class more challenging. Methods Deep neural network (DNN), extremely gradient boosting (XGBoost), and random forest (RF) performance is compared in predicting minority diabetes class in Tehran Lipid and Glucose Study (TLGS) cohort data. The impact of changing threshold, cost-sensitive learning, over and under-sampling strategies as solutions to class imbalance have been compared in improving algorithms performance. Results DNN with the highest accuracy in predicting diabetes, 54.8%, outperformed XGBoost and RF in terms of AUROC, g-mean, and f1-measure in original imbalanced data. Changing threshold based on the maximum of f1-measure improved performance in g-mean, and f1-measure in three algorithms. Repeated edited nearest neighbors (RENN) under-sampling in DNN and cost-sensitive learning in tree-based algorithms were the best solutions to tackle the imbalance issue. RENN increased ROC and Precision-Recall AUCs, g-mean and f1-measure from 0.857, 0.603, 0.713, 0.575 to 0.862, 0.608, 0.773, 0.583, respectively in DNN. Weighing improved g-mean and f1-measure from 0.667, 0.554 to 0.776, 0.588 in XGBoost, and from 0.659, 0.543 to 0.775, 0.566 in RF, respectively. Also, ROC and Precision-Recall AUCs in RF increased from 0.840, 0.578 to 0.846, 0.591, respectively. Conclusion G-mean experienced the most increase by all imbalance solutions. Weighing and changing threshold as efficient strategies, in comparison with resampling methods are faster solutions to handle class imbalance. Among sampling strategies, under-sampling methods had better performance than others.
Collapse
Affiliation(s)
- Somayeh Sadeghi
- Department of Epidemiology and Biostatistics, School of Public Health, Tehran University of Medical Sciences, P.O. Box 14155-6446, Tehran, Iran
| | - Davood Khalili
- Prevention of Metabolic Disorders Research Center, Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran.,Department of Biostatistics and Epidemiology, Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Azra Ramezankhani
- Prevention of Metabolic Disorders Research Center, Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Mohammad Ali Mansournia
- Department of Epidemiology and Biostatistics, School of Public Health, Tehran University of Medical Sciences, P.O. Box 14155-6446, Tehran, Iran.
| | - Mahboubeh Parsaeian
- Department of Epidemiology and Biostatistics, School of Public Health, Tehran University of Medical Sciences, P.O. Box 14155-6446, Tehran, Iran.
| |
Collapse
|
39
|
Chen W, Han X, Wang J, Cao Y, Jia X, Zheng Y, Zhou J, Zeng W, Wang L, Shi H, Feng J. Deep diagnostic agent forest (DDAF): A deep learning pathogen recognition system for pneumonia based on CT. Comput Biol Med 2021; 141:105143. [PMID: 34953357 DOI: 10.1016/j.compbiomed.2021.105143] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2021] [Revised: 12/05/2021] [Accepted: 12/12/2021] [Indexed: 11/03/2022]
Abstract
BACKGROUND Even though antibiotics agents are widely used, pneumonia is still one of the most common causes of death around the world. Some severe, fast-spreading pneumonia can even cause huge influence on global economy and life security. In order to give optimal medication regimens and prevent infectious pneumonia's spreading, recognition of pathogens is important. METHOD In this single-institution retrospective study, 2,353 patients with their CT volumes are included, each of whom was infected by one of 12 known kinds of pathogens. We propose Deep Diagnostic Agent Forest (DDAF) to recognize the pathogen of a patient based on ones' CT volume, which is a challenging multiclass classification problem, with large intraclass variations and small interclass variations and very imbalanced data. RESULTS The model achieves 0.899 ± 0.004 multi-way area under curves of receiver (AUC) for level-I pathogen recognition, which are five rough groups of pathogens, and 0.851 ± 0.003 AUC for level-II recognition, which are 12 fine-level pathogens. The model also outperforms the average result of seven human readers in level-I recognition and outperforms all readers in level-II recognition, who can only reach an average result of 7.71 ± 4.10% accuracy. CONCLUSION Deep learning model can help in recognition pathogens using CTs only, which might help accelerate the process of etiological diagnosis.
Collapse
Affiliation(s)
- Weixiang Chen
- Department of Automation, Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing, China
| | - Xiaoyu Han
- Department of Radiology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China; Department of Laboratory Medicine, Liyuan Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Jian Wang
- Department of Clinical Laboratory, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China; Research Center for Tissue Engineering and Regenerative Medicine, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Yukun Cao
- Department of Radiology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China; Department of Laboratory Medicine, Liyuan Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Xi Jia
- Department of Radiology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China; Department of Laboratory Medicine, Liyuan Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Yuting Zheng
- Department of Radiology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China; Department of Laboratory Medicine, Liyuan Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China
| | - Jie Zhou
- Department of Automation, Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing, China
| | - Wenjuan Zeng
- Department of Clinical Laboratory, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China; Research Center for Tissue Engineering and Regenerative Medicine, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China.
| | - Lin Wang
- Department of Clinical Laboratory, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China; Research Center for Tissue Engineering and Regenerative Medicine, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China.
| | - Heshui Shi
- Department of Radiology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China; Department of Laboratory Medicine, Liyuan Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China.
| | - Jianjiang Feng
- Department of Automation, Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing, China.
| |
Collapse
|
40
|
Wu J, Shen J, Xu M, Shao M. A novel combined dynamic ensemble selection model for imbalanced data to detect COVID-19 from complete blood count. Comput Methods Programs Biomed 2021; 211:106444. [PMID: 34614451 PMCID: PMC8479386 DOI: 10.1016/j.cmpb.2021.106444] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/27/2021] [Accepted: 09/22/2021] [Indexed: 06/01/2023]
Abstract
BACKGROUND As blood testing is radiation-free, low-cost and simple to operate, some researchers use machine learning to detect COVID-19 from blood test data. However, few studies take into consideration the imbalanced data distribution, which can impair the performance of a classifier. METHOD A novel combined dynamic ensemble selection (DES) method is proposed for imbalanced data to detect COVID-19 from complete blood count. This method combines data preprocessing and improved DES. Firstly, we use the hybrid synthetic minority over-sampling technique and edited nearest neighbor (SMOTE-ENN) to balance data and remove noise. Secondly, in order to improve the performance of DES, a novel hybrid multiple clustering and bagging classifier generation (HMCBCG) method is proposed to reinforce the diversity and local regional competence of candidate classifiers. RESULTS The experimental results based on three popular DES methods show that the performance of HMCBCG is better than only use bagging. HMCBCG+KNE obtains the best performance for COVID-19 screening with 99.81% accuracy, 99.86% F1, 99.78% G-mean and 99.81% AUC. CONCLUSION Compared to other advanced methods, our combined DES model can improve accuracy, G-mean, F1 and AUC of COVID-19 screening.
Collapse
Affiliation(s)
- Jiachao Wu
- College of Management and Economics, Tianjin University, Tianjin, 300072, China
| | - Jiang Shen
- College of Management and Economics, Tianjin University, Tianjin, 300072, China
| | - Man Xu
- Business School, Nankai University, Tianjin, 300071, China
| | - Minglai Shao
- School of New Media and Communication, Tianjin University, Tianjin, 300072, China.
| |
Collapse
|
41
|
Sayed GI, Soliman MM, Hassanien AE. A novel melanoma prediction model for imbalanced data using optimized SqueezeNet by bald eagle search optimization. Comput Biol Med 2021; 136:104712. [PMID: 34388470 DOI: 10.1016/j.compbiomed.2021.104712] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2021] [Revised: 07/28/2021] [Accepted: 07/28/2021] [Indexed: 10/20/2022]
Abstract
Skin lesion classification plays a crucial role in diagnosing various gene and related local medical cases in the field of dermoscopy. In this paper, a new model for the classification of skin lesions as either normal or melanoma is presented. The proposed melanoma prediction model was evaluated on a large publicly available dataset called ISIC 2020. The main challenge of this dataset is severe class imbalance. This paper proposes an approach to overcome this problem using a random over-sampling method followed by data augmentation. Moreover, a new hybrid version of a convolutional neural network architecture and bald eagle search (BES) optimization is proposed. The BES algorithm is used to find the optimal values of the hyperparameters of a SqueezeNet architecture. The proposed melanoma skin cancer prediction model obtained an overall accuracy of 98.37%, specificity of 96.47%, sensitivity of 100%, f-score of 98.40%, and area under the curve of 99%. The experimental results showed the robustness and efficiency of the proposed model compared with VGG19, GoogleNet, and ResNet50. Additionally, the results showed that the proposed model was very competitive compared with the state of the art.
Collapse
Affiliation(s)
| | - Mona M Soliman
- Faculty of Computers and Artificial Intelligence, Cairo University, Giza, Egypt
| | - Aboul Ella Hassanien
- Faculty of Computers and Artificial Intelligence, Cairo University, Giza, Egypt.
| |
Collapse
|
42
|
Abstract
Challenges posed by imbalanced data are encountered in many real-world applications. One of the possible approaches to improve the classifier performance on imbalanced data is oversampling. In this paper, we propose the new selective oversampling approach (SOA) that first isolates the most representative samples from minority classes by using an outlier detection technique and then utilizes these samples for synthetic oversampling. We show that the proposed approach improves the performance of two state-of-the-art oversampling methods, namely, the synthetic minority oversampling technique and adaptive synthetic sampling. The prediction performance is evaluated on four synthetic datasets and four real-world datasets, and the proposed SOA methods always achieved the same or better performance than other considered existing oversampling methods.
Collapse
Affiliation(s)
- Peter Gnip
- Department of Computers and Informatics, Technical University of Košice, Slovak Republic
| | - Liberios Vokorokos
- Department of Computers and Informatics, Technical University of Košice, Slovak Republic
| | - Peter Drotár
- Department of Computers and Informatics, Technical University of Košice, Slovak Republic
| |
Collapse
|
43
|
Xiao Y, Wu J, Lin Z. Cancer diagnosis using generative adversarial networks based on deep learning from imbalanced data. Comput Biol Med 2021; 135:104540. [PMID: 34153791 DOI: 10.1016/j.compbiomed.2021.104540] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2020] [Revised: 05/14/2021] [Accepted: 05/26/2021] [Indexed: 11/19/2022]
Abstract
BACKGROUND AND OBJECTIVE Cancer is a serious global disease due to its high mortality, and the key to effective treatment is accurate diagnosis. However, limited by sampling difficulty and actual sample size in clinical practice, data imbalance is a common problem in cancer diagnosis, while most conventional classification methods assume balanced data distribution. Therefore, addressing the imbalanced learning problem to improve the predictive performance of cancer diagnosis is significant. METHODS In the study, we dissect the data imbalance prevalent in cancer gene expression data and present an improved deep learning based Wasserstein generative adversarial network (WGAN) model, which provides a reliable training progress indicator and deeply explores the characteristics of data. The WGAN generates new samples from the minority class and solves the imbalance problem at the data level. RESULTS We analyze three publicly available data sets on RNA-seq of three kinds of cancer using the proposed WGAN and compare the results with those from two commonly adopted sampling methods. According to the results, through addressing the data imbalance problem, the balanced data distribution and the expanding sample size increase the prediction accuracy in all three data sets. CONCLUSIONS Therefore, the proposed WGAN method is superior in solving the imbalanced learning problem of gene expression data, providing significantly better prediction performance in cancer diagnosis.
Collapse
Affiliation(s)
- Yawen Xiao
- Department of Automation, Shanghai Jiao Tong University, Shanghai, 200240, China.
| | - Jun Wu
- The Center for Bioinformatics and Computational Biology, East China Normal University, Shanghai, 200241, China.
| | - Zongli Lin
- Department of Electrical and Computer Engineering, University of Virginia, Charlottesville, VA, 22904-4743, USA.
| |
Collapse
|
44
|
Chan JH, Li C. Learning from imbalanced COVID-19 chest X-ray (CXR) medical imaging data. Methods 2021; 202:31-39. [PMID: 34090971 DOI: 10.1016/j.ymeth.2021.06.002] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2021] [Revised: 05/07/2021] [Accepted: 06/01/2021] [Indexed: 12/24/2022] Open
Abstract
The trendy task of digital medical image analysis has been continually evolving. It has been an area of prominent and growing importance from both research and deployment perspectives. Nonetheless, it is necessary to realize that the use of algorithms, methodology, as well as the source of medical image data, must be strictly scrutinized. As the COVID-19 pandemic has been gripping much of the world recently, there has been much efforts gone into developing affordable testing for the masses, and it has been shown that the established and widely available chest X-rays (CXR) images may be used as a screening criteria for assistive diagnosis purpose. Thanks to the dedicated work by various individuals and organizations, publicly available CXR of COVID-19 subjects are available for analytic usage. We have also provided a publicly available CXR dataset on the Kaggle platform. As a case study, this paper presents a systematic approach to learn from a typically imbalanced set of CXR images, which consists of a limited number of publicly available COVID-19 images. Our results show that we are able to outperform the top finishers in a related Kaggle multi-class CXR challenge. The proposed methodology should be able to help guide medical personnel in obtaining a robust diagnosis model to discern COVID-19 from other conditions confidently.
Collapse
Affiliation(s)
- Jonathan H Chan
- Innovative Cognitive Computing (IC2) Research Center, School of Information Technology, King Mongkut's University of Technology Thonburi, Bangkok, Thailand.
| | - Chenqi Li
- Division of Engineering Science, Faculty of Applied Science and Engineering, University of Toronto, Toronto, Canada.
| |
Collapse
|
45
|
Chiong R, Budhi GS, Dhakal S, Chiong F. A textual-based featuring approach for depression detection using machine learning classifiers and social media texts. Comput Biol Med 2021; 135:104499. [PMID: 34174760 DOI: 10.1016/j.compbiomed.2021.104499] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2021] [Revised: 05/12/2021] [Accepted: 05/14/2021] [Indexed: 10/21/2022]
Abstract
Depression is one of the leading causes of suicide worldwide. However, a large percentage of cases of depression go undiagnosed and, thus, untreated. Previous studies have found that messages posted by individuals with major depressive disorder on social media platforms can be analysed to predict if they are suffering, or likely to suffer, from depression. This study aims to determine whether machine learning could be effectively used to detect signs of depression in social media users by analysing their social media posts-especially when those messages do not explicitly contain specific keywords such as 'depression' or 'diagnosis'. To this end, we investigate several text preprocessing and textual-based featuring methods along with machine learning classifiers, including single and ensemble models, to propose a generalised approach for depression detection using social media texts. We first use two public, labelled Twitter datasets to train and test the machine learning models, and then another three non-Twitter depression-class-only datasets (sourced from Facebook, Reddit, and an electronic diary) to test the performance of our trained models against other social media sources. Experimental results indicate that the proposed approach is able to effectively detect depression via social media texts even when the training datasets do not contain specific keywords (such as 'depression' and 'diagnose'), as well as when unrelated datasets are used for testing.
Collapse
|
46
|
Ma JH, Feng Z, Wu JY, Zhang Y, Di W. Learning from imbalanced fetal outcomes of systemic lupus erythematosus in artificial neural networks. BMC Med Inform Decis Mak 2021; 21:127. [PMID: 33845834 PMCID: PMC8042715 DOI: 10.1186/s12911-021-01486-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2021] [Accepted: 03/31/2021] [Indexed: 11/10/2022] Open
Abstract
OBJECTIVE To explore an effective algorithm based on artificial neural network to pick correctly the minority of pregnant women with SLE suffering fetal loss outcomes from the majority with live birth and train a well behaved model as a clinical decision assistant. METHODS We integrated the thoughts of comparative and focused study into the artificial neural network and presented an effective algorithm aiming at imbalanced learning in small dataset. RESULTS We collected 469 non-trivial pregnant patients with SLE, where 420 had live-birth outcomes and the other 49 patients ended in fetal loss. A well trained imbalanced-learning model had a high sensitivity of 19/21 ([Formula: see text]) for the identification of patients with fetal loss outcomes. DISCUSSION The misprediction of the two patients was explainable. Algorithm improvements in artificial neural network framework enhanced the identification in imbalanced learning problems and the external validation increased the reliability of algorithm. CONCLUSION The well-trained model was fully qualified to assist healthcare providers to make timely and accurate decisions.
Collapse
Affiliation(s)
- Jing-Hang Ma
- Department of Obstetrics and Gynecology, Ren Ji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
- Shanghai Key Laboratory of Gynecologic Oncology, Ren Ji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
- State Key Laboratory of Oncogenes and Related Genes, Shanghai Cancer Institute, Ren Ji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
- First Affiliated Hospital of Wenzhou Medical University, Wenzhou, China
| | - Zhen Feng
- First Affiliated Hospital of Wenzhou Medical University, Wenzhou, China
| | - Jia-Yue Wu
- Department of Obstetrics and Gynecology, Ren Ji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
- Shanghai Key Laboratory of Gynecologic Oncology, Ren Ji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
- State Key Laboratory of Oncogenes and Related Genes, Shanghai Cancer Institute, Ren Ji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
| | - Yu Zhang
- Shanghai Key Laboratory of Gynecologic Oncology, Ren Ji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
- State Key Laboratory of Oncogenes and Related Genes, Shanghai Cancer Institute, Ren Ji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
| | - Wen Di
- Department of Obstetrics and Gynecology, Ren Ji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China.
- Shanghai Key Laboratory of Gynecologic Oncology, Ren Ji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China.
- State Key Laboratory of Oncogenes and Related Genes, Shanghai Cancer Institute, Ren Ji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China.
| |
Collapse
|
47
|
Srinivasan R, Subalalitha CN. Sentimental analysis from imbalanced code-mixed data using machine learning approaches. Distrib Parallel Databases 2021; 41:37-52. [PMID: 33776212 PMCID: PMC7980744 DOI: 10.1007/s10619-021-07331-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 03/03/2021] [Indexed: 05/29/2023]
Abstract
Knowledge discovery from various perspectives has become a crucial asset in almost all fields. Sentimental analysis is a classification task used to classify the sentence based on the meaning of their context. This paper addresses class imbalance problem which is one of the important issues in sentimental analysis. Not much works focused on sentimental analysis with imbalanced class label distribution. The paper also focusses on another aspect of the problem which involves a concept called "Code Mixing". Code mixed data consists of text alternating between two or more languages. Class imbalance distribution is a commonly noted phenomenon in a code-mixed data. The existing works have focused more on analyzing the sentiments in a monolingual data but not in a code-mixed data. This paper addresses all these issues and comes up with a solution to analyze sentiments for a class imbalanced code-mixed data using sampling technique combined with levenshtein distance metrics. Furthermore, this paper compares the performances of various machine learning approaches namely, Random Forest Classifier, Logistic Regression, XGBoost classifier, Support Vector Machine and Naïve Bayes Classifier using F1- Score.
Collapse
Affiliation(s)
- R. Srinivasan
- Department of Computer Science and Engineering, SRM Institute of Science and Technology, Kattankulathur, 603 203 India
| | - C. N. Subalalitha
- Department of Computer Science and Engineering, SRM Institute of Science and Technology, Kattankulathur, 603 203 India
| |
Collapse
|
48
|
Wang X, Zhai M, Ren Z, Ren H, Li M, Quan D, Chen L, Qiu L. Exploratory study on classification of diabetes mellitus through a combined Random Forest Classifier. BMC Med Inform Decis Mak 2021; 21:105. [PMID: 33743696 PMCID: PMC7980612 DOI: 10.1186/s12911-021-01471-4] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2020] [Accepted: 03/11/2021] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND Diabetes Mellitus (DM) has become the third chronic non-communicable disease that hits patients after tumors, cardiovascular and cerebrovascular diseases, and has become one of the major public health problems in the world. Therefore, it is of great importance to identify individuals at high risk for DM in order to establish prevention strategies for DM. METHODS Aiming at the problem of high-dimensional feature space and high feature redundancy of medical data, as well as the problem of data imbalance often faced. This study explored different supervised classifiers, combined with SVM-SMOTE and two feature dimensionality reduction methods (Logistic stepwise regression and LAASO) to classify the diabetes survey sample data with unbalanced categories and complex related factors. Analysis and discussion of the classification results of 4 supervised classifiers based on 4 data processing methods. Five indicators including Accuracy, Precision, Recall, F1-Score and AUC are selected as the key indicators to evaluate the performance of the classification model. RESULTS According to the result, Random Forest Classifier combining SVM-SMOTE resampling technology and LASSO feature screening method (Accuracy = 0.890, Precision = 0.869, Recall = 0.919, F1-Score = 0.893, AUC = 0.948) proved the best way to tell those at high risk of DM. Besides, the combined algorithm helps enhance the classification performance for prediction of high-risk people of DM. Also, age, region, heart rate, hypertension, hyperlipidemia and BMI are the top six most critical characteristic variables affecting diabetes. CONCLUSIONS The Random Forest Classifier combining with SVM-SMOTE and LASSO feature reduction method perform best in identifying high-risk people of DM from individuals. And the combined method proposed in the study would be a good tool for early screening of DM.
Collapse
Affiliation(s)
- Xuchun Wang
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, Shanxi, China
| | - Mengmeng Zhai
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, Shanxi, China
| | - Zeping Ren
- Shanxi Centre for Disease Control and Prevention, Taiyuan, 030012, Shanxi, China
| | - Hao Ren
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, Shanxi, China
| | - Meichen Li
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, Shanxi, China
| | - Dichen Quan
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, Shanxi, China
| | - Limin Chen
- Shanxi Provincial People's Hospital, Taiyuan City, Shanxi Province, China.
| | - Lixia Qiu
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, Shanxi, China.
| |
Collapse
|
49
|
Zunair H, Hamza AB. Synthesis of COVID-19 chest X-rays using unpaired image-to-image translation. Soc Netw Anal Min 2021; 11:23. [PMID: 33643491 PMCID: PMC7903408 DOI: 10.1007/s13278-021-00731-5] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2020] [Revised: 01/05/2021] [Accepted: 02/04/2021] [Indexed: 12/28/2022]
Abstract
Motivated by the lack of publicly available datasets of chest radiographs of positive patients with coronavirus disease 2019 (COVID-19), we build the first-of-its-kind open dataset of synthetic COVID-19 chest X-ray images of high fidelity using an unsupervised domain adaptation approach by leveraging class conditioning and adversarial training. Our contributions are twofold. First, we show considerable performance improvements on COVID-19 detection using various deep learning architectures when employing synthetic images as additional training set. Second, we show how our image synthesis method can serve as a data anonymization tool by achieving comparable detection performance when trained only on synthetic data. In addition, the proposed data generation framework offers a viable solution to the COVID-19 detection in particular, and to medical image classification tasks in general. Our publicly available benchmark dataset (https://github.com/hasibzunair/synthetic-covid-cxr-dataset.) consists of 21,295 synthetic COVID-19 chest X-ray images. The insights gleaned from this dataset can be used for preventive actions in the fight against the COVID-19 pandemic.
Collapse
Affiliation(s)
- Hasib Zunair
- Concordia Institute for Information Systems Engineering, Concordia University, Montreal, QC Canada
| | - A Ben Hamza
- Concordia Institute for Information Systems Engineering, Concordia University, Montreal, QC Canada
| |
Collapse
|
50
|
Sampath V, Maurtua I, Aguilar Martín JJ, Gutierrez A. A survey on generative adversarial networks for imbalance problems in computer vision tasks. J Big Data 2021; 8:27. [PMID: 33552840 PMCID: PMC7845583 DOI: 10.1186/s40537-021-00414-0] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/30/2020] [Accepted: 01/16/2021] [Indexed: 05/21/2023]
Abstract
Any computer vision application development starts off by acquiring images and data, then preprocessing and pattern recognition steps to perform a task. When the acquired images are highly imbalanced and not adequate, the desired task may not be achievable. Unfortunately, the occurrence of imbalance problems in acquired image datasets in certain complex real-world problems such as anomaly detection, emotion recognition, medical image analysis, fraud detection, metallic surface defect detection, disaster prediction, etc., are inevitable. The performance of computer vision algorithms can significantly deteriorate when the training dataset is imbalanced. In recent years, Generative Adversarial Neural Networks (GANs) have gained immense attention by researchers across a variety of application domains due to their capability to model complex real-world image data. It is particularly important that GANs can not only be used to generate synthetic images, but also its fascinating adversarial learning idea showed good potential in restoring balance in imbalanced datasets. In this paper, we examine the most recent developments of GANs based techniques for addressing imbalance problems in image data. The real-world challenges and implementations of synthetic image generation based on GANs are extensively covered in this survey. Our survey first introduces various imbalance problems in computer vision tasks and its existing solutions, and then examines key concepts such as deep generative image models and GANs. After that, we propose a taxonomy to summarize GANs based techniques for addressing imbalance problems in computer vision tasks into three major categories: 1. Image level imbalances in classification, 2. object level imbalances in object detection and 3. pixel level imbalances in segmentation tasks. We elaborate the imbalance problems of each group, and provide GANs based solutions in each group. Readers will understand how GANs based techniques can handle the problem of imbalances and boost performance of the computer vision algorithms.
Collapse
Affiliation(s)
- Vignesh Sampath
- Autonomous and Intelligent Systems Unit, Tekniker, Member of Basque Research and Technology Alliance, Eibar, Spain
- Design and Manufacturing Engineering Department, Universidad de Zaragoza, 3 María de Luna Street, Torres Quevedo Bld, 50018 Zaragoza, Spain
| | - Iñaki Maurtua
- Autonomous and Intelligent Systems Unit, Tekniker, Member of Basque Research and Technology Alliance, Eibar, Spain
| | - Juan José Aguilar Martín
- Design and Manufacturing Engineering Department, Universidad de Zaragoza, 3 María de Luna Street, Torres Quevedo Bld, 50018 Zaragoza, Spain
| | - Aitor Gutierrez
- Autonomous and Intelligent Systems Unit, Tekniker, Member of Basque Research and Technology Alliance, Eibar, Spain
| |
Collapse
|