1
|
Ahsan MM, Ali MS, Siddique Z. Enhancing and improving the performance of imbalanced class data using novel GBO and SSG: A comparative analysis. Neural Netw 2024; 173:106157. [PMID: 38335796 DOI: 10.1016/j.neunet.2024.106157] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Revised: 01/01/2024] [Accepted: 02/01/2024] [Indexed: 02/12/2024]
Abstract
Class imbalance problem (CIP) in a dataset is a major challenge that significantly affects the performance of Machine Learning (ML) models resulting in biased predictions. Numerous techniques have been proposed to address CIP, including, but not limited to, Oversampling, Undersampling, and cost-sensitive approaches. Due to its ability to generate synthetic data, oversampling techniques such as the Synthetic Minority Oversampling Technique (SMOTE) are the most widely used methodology by researchers. However, one of SMOTE's potential disadvantages is that newly created minor samples overlap with major samples. Therefore, the probability of ML models' biased performance toward major classes increases. Generative adversarial network (GAN) has recently garnered much attention due to their ability to create real samples. However, GAN is hard to train even though it has much potential. Considering these opportunities, this work proposes two novel techniques: GAN-based Oversampling (GBO) and Support Vector Machine-SMOTE-GAN (SSG) to overcome the limitations of the existing approaches. The preliminary results show that SSG and GBO performed better on the nine imbalanced benchmark datasets than several existing SMOTE-based approaches. Additionally, it can be observed that the proposed SSG and GBO methods can accurately classify the minor class with more than 90% accuracy when tested with 20%, 30%, and 40% of the test data. The study also revealed that the minor sample generated by SSG demonstrates Gaussian distributions, which is often difficult to achieve using original SMOTE and SVM-SMOTE.
Collapse
Affiliation(s)
- Md Manjurul Ahsan
- School of Industrial and Systems Engineering, University of Oklahoma, Norman, OK 73019, USA.
| | - Md Shahin Ali
- Department of Biomedical Engineering, Islamic University, Kushtia 7003, Bangladesh.
| | - Zahed Siddique
- School of Aerospace and Mechanical Engineering, University of Oklahoma, Norman, OK 73019, USA.
| |
Collapse
|
2
|
Chen CC, Ting WC, Lee HC, Chang CC, Lin TC, Yang SF. A Cost-Effective Model for Predicting Recurrent Gastric Cancer Using Clinical Features. Diagnostics (Basel) 2024; 14:842. [PMID: 38667487 PMCID: PMC11049390 DOI: 10.3390/diagnostics14080842] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2024] [Revised: 04/14/2024] [Accepted: 04/15/2024] [Indexed: 04/28/2024] Open
Abstract
This study used artificial intelligence techniques to identify clinical cancer biomarkers for recurrent gastric cancer survivors. From a hospital-based cancer registry database in Taiwan, the datasets of the incidence of recurrence and clinical risk features were included in 2476 gastric cancer survivors. We benchmarked Random Forest using MLP, C4.5, AdaBoost, and Bagging algorithms on metrics and leveraged the synthetic minority oversampling technique (SMOTE) for imbalanced dataset issues, cost-sensitive learning for risk assessment, and SHapley Additive exPlanations (SHAPs) for feature importance analysis in this study. Our proposed Random Forest outperformed the other models with an accuracy of 87.9%, a recall rate of 90.5%, an accuracy rate of 86%, and an F1 of 88.2% on the recurrent category by a 10-fold cross-validation in a balanced dataset. We identified clinical features of recurrent gastric cancer, which are the top five features, stage, number of regional lymph node involvement, Helicobacter pylori, BMI (body mass index), and gender; these features significantly affect the prediction model's output and are worth paying attention to in the following causal effect analysis. Using an artificial intelligence model, the risk factors for recurrent gastric cancer could be identified and cost-effectively ranked according to their feature importance. In addition, they should be crucial clinical features to provide physicians with the knowledge to screen high-risk patients in gastric cancer survivors as well.
Collapse
Affiliation(s)
- Chun-Chia Chen
- Institute of Medicine, Chung Shan Medical University, Taichung 40201, Taiwan; (C.-C.C.); (S.-F.Y.)
- Division of Plastic Surgery, Department of Surgery, Chi Mei Medical Center, Tainan 704, Taiwan
- Division of Colorectal Surgery, Department of Surgery, Chung Shan Medical University Hospital, Taichung 40201, Taiwan;
| | - Wen-Chien Ting
- Division of Colorectal Surgery, Department of Surgery, Chung Shan Medical University Hospital, Taichung 40201, Taiwan;
- School of Medicine, Chung Shan Medical University, Taichung 40201, Taiwan
| | - Hsi-Chieh Lee
- Department of Computer Science and Information Engineering, National Quemoy University, Kinmen County 892, Taiwan;
| | - Chi-Chang Chang
- School of Medical Informatics, Chung Shan Medical University & IT Office, Chung Shan Medical University Hospital, Taichung 40201, Taiwan
- Department of Information Management, Ming Chuan University, Taoyuan City 33300, Taiwan
| | - Tsung-Chieh Lin
- Department of Computer Science and Information Engineering, National Quemoy University, Kinmen County 892, Taiwan;
| | - Shun-Fa Yang
- Institute of Medicine, Chung Shan Medical University, Taichung 40201, Taiwan; (C.-C.C.); (S.-F.Y.)
| |
Collapse
|
3
|
Khan MM, Alkhathami M. Anomaly detection in IoT-based healthcare: machine learning for enhanced security. Sci Rep 2024; 14:5872. [PMID: 38467709 PMCID: PMC10928137 DOI: 10.1038/s41598-024-56126-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2023] [Accepted: 02/29/2024] [Indexed: 03/13/2024] Open
Abstract
Internet of Things (IoT) integration in healthcare improves patient care while also making healthcare delivery systems more effective and economical. To fully realize the advantages of IoT in healthcare, it is imperative to overcome issues with data security, interoperability, and ethical considerations. IoT sensors periodically measure the health-related data of the patients and share it with a server for further evaluation. At the server, different machine learning algorithms are applied which help in early diagnosis of diseases and issue alerts in case vital signs are out of the normal range. Different cyber attacks can be launched on IoT devices which can result in compromised security and privacy of applications such as health care. In this paper, we utilize the publicly available Canadian Institute for Cybersecurity (CIC) IoT dataset to model machine learning techniques for efficient detection of anomalous network traffic. The dataset consists of 33 types of IoT attacks which are divided into 7 main categories. In the current study, the dataset is pre-processed, and a balanced representation of classes is used in generating a non-biased supervised (Random Forest, Adaptive Boosting, Logistic Regression, Perceptron, Deep Neural Network) machine learning models. These models are analyzed further by eliminating highly correlated features, reducing dimensionality, minimizing overfitting, and speeding up training times. Random Forest was found to perform optimally across binary and multiclass classification of IoT Attacks with an approximate accuracy of 99.55% under both reduced and all feature space. This improvement was complimented by a reduction in computational response time which is essential for real-time attack detection and response.
Collapse
Affiliation(s)
- Maryam Mahsal Khan
- Department of Computer Science, CECOS University of IT and Emerging Sciences, Peshawar, 25000, Pakistan
| | - Mohammed Alkhathami
- Information Systems Department, College of Computer and Information Sciences, Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh, 11432, Saudi Arabia.
| |
Collapse
|
4
|
Li J, Dai Y, Mu Z, Wang Z, Meng J, Meng T, Wang J. Choice of refractive surgery types for myopia assisted by machine learning based on doctors' surgical selection data. BMC Med Inform Decis Mak 2024; 24:41. [PMID: 38331788 PMCID: PMC10854042 DOI: 10.1186/s12911-024-02451-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2023] [Accepted: 02/02/2024] [Indexed: 02/10/2024] Open
Abstract
In recent years, corneal refractive surgery has been widely used in clinics as an effective means to restore vision and improve the quality of life. When choosing myopia-refractive surgery, it is necessary to comprehensively consider the differences in equipment and technology as well as the specificity of individual patients, which heavily depend on the experience of ophthalmologists. In our study, we took advantage of machine learning to learn about the experience of ophthalmologists in decision-making and assist them in the choice of corneal refractive surgery in a new case. Our study was based on the clinical data of 7,081 patients who underwent corneal refractive surgery between 2000 and 2017 at the Department of Ophthalmology, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences. Due to the long data period, there were data losses and errors in this dataset. First, we cleaned the data and deleted the samples of key data loss. Then, patients were divided into three groups according to the type of surgery, after which we used SMOTE technology to eliminate imbalance between groups. Six statistical machine learning models, including NBM, RF, AdaBoost, XGBoost, BP neural network, and DBN were selected, and a ten-fold cross-validation and grid search were used to determine the optimal hyperparameters for better performance. When tested on the dataset, the multi-class RF model showed the best performance, with agreement with ophthalmologist decisions as high as 0.8775 and Macro F1 as high as 0.8019. Furthermore, the results of the feature importance analysis based on the SHAP technique were consistent with an ophthalmologist's practical experience. Our research will assist ophthalmologists in choosing appropriate types of refractive surgery and will have beneficial clinical effects.
Collapse
Affiliation(s)
- Jiajing Li
- School of Artificial Intelligence, China University of Mining and Technology (Beijing), Beijing, China.
- Wangganzhicha Information Technology Inc., Nanjing, Jiangsu Province, China.
| | - Yuanyuan Dai
- School of Artificial Intelligence, China University of Mining and Technology (Beijing), Beijing, China
| | - Zhicheng Mu
- School of Artificial Intelligence, China University of Mining and Technology (Beijing), Beijing, China
| | - Zhonghai Wang
- Department of Ophthalmology, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences, Beijing, China
- Key Laboratory of Ocular Fundus Diseases, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Juan Meng
- Community Health Service Center of Douhudi Town, Gongan County, Jingzhou, Hubei Province, China
| | - Tao Meng
- Wangganzhicha Information Technology Inc., Nanjing, Jiangsu Province, China
| | - Jimin Wang
- Department of Information Management, Peking University, Beijing, China
| |
Collapse
|
5
|
Mohseni-Takalloo S, Mohseni H, Mozaffari-Khosravi H, Mirzaei M, Hosseinzadeh M. The effect of data balancing approaches on the prediction of metabolic syndrome using non-invasive parameters based on random forest. BMC Bioinformatics 2024; 25:18. [PMID: 38212697 PMCID: PMC10782700 DOI: 10.1186/s12859-024-05633-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Accepted: 01/02/2024] [Indexed: 01/13/2024] Open
Abstract
BACKGROUND Metabolic syndrome (MetS) is a cluster of metabolic abnormalities (including obesity, insulin resistance, hypertension, and dyslipidemia), which can be used to identify at-risk populations for diabetes and cardiovascular diseases, the main causes of morbidity and mortality worldwide. The achievement of a simple approach for diagnosing MetS without needing biochemical tests is so valuable. The present study aimed to predict MetS using non-invasive features based on a successful random forest learning algorithm. Also, to deal with the problem of data imbalance that naturally exists in this type of data, the effect of two different data balancing approaches, including the Synthetic Minority Over-sampling Technique (SMOTE) and Random Splitting data balancing (SplitBal), on model performance is investigated. RESULTS The most important determinant for MetS prediction was waist circumference. Applying a random forest learning algorithm to imbalanced data, the trained models reach 86.9% and 79.4% accuracies and 37.1% and 38.2% sensitivities in men and women, respectively. However, by applying the SplitBal data balancing technique, the best results were obtained, and despite that the accuracy of the trained models decreased by 7.8% and 11.3%, but their sensitivity improved significantly to 82.3% and 73.7% in men and women, respectively. CONCLUSIONS The random forest learning method, along with data balancing techniques, especially SplitBal, could create MetS prediction models with promising results that can be applied as a useful prognostic tool in health screening programs.
Collapse
Affiliation(s)
- Sahar Mohseni-Takalloo
- School of Public Health, Bam University of Medical Sciences, Bam, Iran
- Research Center for Food Hygiene and Safety, School of Public Health, Shahid Sadoughi University of Medical Sciences, Yazd, Iran
- Department of Nutrition, School of Public Health, Shahid Sadoughi University of Medical Sciences, Yazd, Iran
| | - Hadis Mohseni
- Computer Engineering Department, Shahid Bahonar University of Kerman, Kerman, Iran
| | - Hassan Mozaffari-Khosravi
- Research Center for Food Hygiene and Safety, School of Public Health, Shahid Sadoughi University of Medical Sciences, Yazd, Iran
- Department of Nutrition, School of Public Health, Shahid Sadoughi University of Medical Sciences, Yazd, Iran
| | - Masoud Mirzaei
- Yazd Cardiovascular Research Centre, Non-Communicable Diseases Research Institute, Shahid Sadoughi University of Medical Sciences, Yazd, Iran
| | - Mahdieh Hosseinzadeh
- Research Center for Food Hygiene and Safety, School of Public Health, Shahid Sadoughi University of Medical Sciences, Yazd, Iran.
- Department of Nutrition, School of Public Health, Shahid Sadoughi University of Medical Sciences, Yazd, Iran.
| |
Collapse
|
6
|
Museru ML, Nazari R, Giglou AN, Opare K, Karimi M. Advancing flood damage modeling for coastal Alabama residential properties: A multivariable machine learning approach. Sci Total Environ 2024; 907:167872. [PMID: 37852490 DOI: 10.1016/j.scitotenv.2023.167872] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Revised: 10/13/2023] [Accepted: 10/14/2023] [Indexed: 10/20/2023]
Abstract
Flooding is a global threat and predicting flood risk accurately is vital for effective mitigation and increasing society's awareness of the negative impacts of floods. Over the years, researchers have worked on physical and data-driven models to predict flood damage, striving to improve accuracy and understanding. However, the challenge lies in the scarcity and limitedness of comprehensive datasets needed to develop these models. This study aims to enhance the National Flood Insurance Program (NFIP) claims dataset from Hurricane Katrina in coastal Alabama to make it adequate for multi-variable flood damage assessment. The NFIP claims dataset was combined with the Alabama property dataset, simulated flood hazard information, and property location characteristics. Oversampling techniques are employed to address data imbalance in the datasets. Subsequently, several ensemble machine learning approaches, including random forest, extra tree, extreme gradient boosting, and categorical boosting, are utilized to develop multi-variable flood damage models. The validation of these models demonstrates that extreme gradient boosting performs best, achieving satisfactory results in identifying damaged properties with precision (0.89), recall (0.90), and F1-score (0.90), as well as determining relative damage with R-squared (0.59), root mean squared error (0.21), and Spearman correlation (0.70). Utilizing data oversampling techniques improves the model performance of imbalanced flood damage datasets. Despite the dataset's limitations and data augmentation techniques employed, the model's output explanation based on SHapley Additive exPlanations (SHAP) is constructive as it aligns with the study's expectations regarding the interaction of different features to produce the final results.
Collapse
Affiliation(s)
- Mujungu Lawrence Museru
- Sustainable Smart Cities Research Center, University of Alabama at Birmingham (UAB), Birmingham, AL, USA; Department of Civil, Construction, and Environmental Engineering, University of Alabama-Birmingham, Birmingham, AL 35294-4440, USA
| | - Rouzbeh Nazari
- Sustainable Smart Cities Research Center, University of Alabama at Birmingham (UAB), Birmingham, AL, USA; Department of Civil, Construction, and Environmental Engineering, University of Alabama-Birmingham, Birmingham, AL 35294-4440, USA; Department of Environmental Health Science, School of Public Health, Ryals Public Health Building, University of Alabama at Birmingham, 1665 University Boulevard, Birmingham, AL 35294-0022, USA.
| | - Abolfazl N Giglou
- Sustainable Smart Cities Research Center, University of Alabama at Birmingham (UAB), Birmingham, AL, USA; Department of Civil, Construction, and Environmental Engineering, University of Alabama-Birmingham, Birmingham, AL 35294-4440, USA
| | - Kofi Opare
- Sustainable Smart Cities Research Center, University of Alabama at Birmingham (UAB), Birmingham, AL, USA; Department of Civil, Construction, and Environmental Engineering, University of Alabama-Birmingham, Birmingham, AL 35294-4440, USA
| | - Maryam Karimi
- Sustainable Smart Cities Research Center, University of Alabama at Birmingham (UAB), Birmingham, AL, USA; Department of Civil, Construction, and Environmental Engineering, University of Alabama-Birmingham, Birmingham, AL 35294-4440, USA; Department of Environmental Health Science, School of Public Health, Ryals Public Health Building, University of Alabama at Birmingham, 1665 University Boulevard, Birmingham, AL 35294-0022, USA.
| |
Collapse
|
7
|
Nath A, Chaube R. Mining Chemogenomic Spaces for Prediction of Drug-Target Interactions. Methods Mol Biol 2024; 2714:155-169. [PMID: 37676598 DOI: 10.1007/978-1-0716-3441-7_9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/08/2023]
Abstract
The pipeline of drug discovery consists of a number of processes; drug-target interaction determination is one of the salient steps among them. Computational prediction of drug-target interactions can facilitate in reducing the search space of experimental wet lab-based verifications steps, thus considerably reducing time and other resources dedicated to the drug discovery pipeline. While machine learning-based methods are more widespread for drug-target interaction prediction, network-centric methods are also evolving. In this chapter, we focus on the process of the drug-target interaction prediction from the perspective of using machine learning algorithms and the various stages involved for developing an accurate predictor.
Collapse
Affiliation(s)
- Abhigyan Nath
- Department of Biochemistry, Pt. Jawahar Lal Nehru Memorial Medical College, Raipur, India
| | - Radha Chaube
- Department of Zoology, Institute of Science, Banaras Hindu University, Varanasi, India
| |
Collapse
|
8
|
Xu Y, Park Y, Park JD, Sun B. Predicting Nurse Turnover for Highly Imbalanced Data Using the Synthetic Minority Over-Sampling Technique and Machine Learning Algorithms. Healthcare (Basel) 2023; 11:3173. [PMID: 38132063 PMCID: PMC10742910 DOI: 10.3390/healthcare11243173] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2023] [Revised: 12/11/2023] [Accepted: 12/13/2023] [Indexed: 12/23/2023] Open
Abstract
Predicting nurse turnover is a growing challenge within the healthcare sector, profoundly impacting healthcare quality and the nursing profession. This study employs the Synthetic Minority Over-sampling Technique (SMOTE) to address class imbalance issues in the 2018 National Sample Survey of Registered Nurses dataset and predict nurse turnover using machine learning algorithms. Four machine learning algorithms, namely logistic regression, random forests, decision tree, and extreme gradient boosting, were applied to the SMOTE-enhanced dataset. The data were split into 80% training and 20% validation sets. Eighteen carefully selected variables from the database served as predictive features, and the machine learning model identified age, working hours, electric health record/electronic medical record, individual income, and job type as important features concerning nurse turnover. The study includes a performance comparison based on accuracy, precision, recall (sensitivity), F1-score, and AUC. In summary, the results demonstrate that SMOTE-enhanced random forests exhibit the most robust predictive power in the classical approach (with all 18 predictive variables) and an optimized approach (utilizing eight key predictive variables). Extreme gradient boosting, decision tree, and logistic regression follow in performance. Notably, age emerges as the most influential factor in nurse turnover, with working hours, electric health record/electronic medical record usability, individual income, and region also playing significant roles. This research offers valuable insights for healthcare researchers and stakeholders, aiding in selecting suitable machine learning algorithms for nurse turnover prediction.
Collapse
Affiliation(s)
- Yuan Xu
- School of Maritime Economics and Management, Collaborative Innovation Center for Transport Studies, Dalian Maritime University, 1 Linghai Road, Dalian 116026, China;
| | - Yongshin Park
- Department of Marketing, Operations, and Analytics, Bill Munday School of Business, St. Edward’s University, 3001 South Congress, Austin, TX 78704, USA
| | - Ju Dong Park
- Department of Maritime Police and Production System, Gyeongsang National University, Tongyeong-si 53064, Gyeongsangnam-do, Republic of Korea
| | - Bora Sun
- School of Nursing, The University of Texas Austin, 1710 Red River St., Austin, TX 78712, USA;
| |
Collapse
|
9
|
Semary NA, Ahmed W, Amin K, Pławiak P, Hammad M. Improving sentiment classification using a RoBERTa-based hybrid model. Front Hum Neurosci 2023; 17:1292010. [PMID: 38130432 PMCID: PMC10733963 DOI: 10.3389/fnhum.2023.1292010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2023] [Accepted: 11/23/2023] [Indexed: 12/23/2023] Open
Abstract
Introduction Several attempts have been made to enhance text-based sentiment analysis's performance. The classifiers and word embedding models have been among the most prominent attempts. This work aims to develop a hybrid deep learning approach that combines the advantages of transformer models and sequence models with the elimination of sequence models' shortcomings. Methods In this paper, we present a hybrid model based on the transformer model and deep learning models to enhance sentiment classification process. Robustly optimized BERT (RoBERTa) was selected for the representative vectors of the input sentences and the Long Short-Term Memory (LSTM) model in conjunction with the Convolutional Neural Networks (CNN) model was used to improve the suggested model's ability to comprehend the semantics and context of each input sentence. We tested the proposed model with two datasets with different topics. The first dataset is a Twitter review of US airlines and the second is the IMDb movie reviews dataset. We propose using word embeddings in conjunction with the SMOTE technique to overcome the challenge of imbalanced classes of the Twitter dataset. Results With an accuracy of 96.28% on the IMDb reviews dataset and 94.2% on the Twitter reviews dataset, the hybrid model that has been suggested outperforms the standard methods. Discussion It is clear from these results that the proposed hybrid RoBERTa-(CNN+ LSTM) method is an effective model in sentiment classification.
Collapse
Affiliation(s)
- Noura A. Semary
- Department of Information Technology, Faculty of Computers and Information, Menoufia University, Shibin El Kom, Egypt
| | - Wesam Ahmed
- Department of Information Technology, Faculty of Computers and Information, Menoufia University, Shibin El Kom, Egypt
- Department of Information Technology, Faculty of Computers and Artificial Intelligence, South Valley University, Hurghada, Egypt
| | - Khalid Amin
- Department of Information Technology, Faculty of Computers and Information, Menoufia University, Shibin El Kom, Egypt
| | - Paweł Pławiak
- Department of Computer Science, Faculty of Computer Science and Telecommunications, Cracow University of Technology, Krakow, Poland
- Institute of Theoretical and Applied Informatics, Polish Academy of Sciences, Gliwice, Poland
| | - Mohamed Hammad
- Department of Information Technology, Faculty of Computers and Information, Menoufia University, Shibin El Kom, Egypt
- EIAS Data Science Lab, College of Computer and Information Sciences, Prince Sultan University, Riyadh, Saudi Arabia
| |
Collapse
|
10
|
Chen J, Qi TD, Vu J, Wen Y. A deep learning approach for inpatient length of stay and mortality prediction. J Biomed Inform 2023; 147:104526. [PMID: 37852346 DOI: 10.1016/j.jbi.2023.104526] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Revised: 10/11/2023] [Accepted: 10/15/2023] [Indexed: 10/20/2023]
Abstract
PURPOSE Accurate prediction of the Length of Stay (LoS) and mortality in the Intensive Care Unit (ICU) is crucial for effective hospital management, and it can assist clinicians for real-time demand capacity (RTDC) administration, thereby improving healthcare quality and service levels. METHODS This paper proposes a novel one-dimensional (1D) multi-scale convolutional neural network architecture, namely 1D-MSNet, to predict inpatients' LoS and mortality in ICU. First, a 1D multi-scale convolution framework is proposed to enlarge the convolutional receptive fields and enhance the richness of the convolutional features. Following the convolutional layers, an atrous causal spatial pyramid pooling (SPP) module is incorporated into the networks to extract high-level features. The optimized Focal Loss (FL) function is combined with the synthetic minority over-sampling technique (SMOTE) to mitigate the imbalanced-class issue. RESULTS On the MIMIC-IV v1.0 benchmark dataset, the proposed approach achieves the optimum R-Square and RMSE values of 0.57 and 3.61 for the LoS prediction, and the highest test accuracy of 97.73% for the mortality prediction. CONCLUSION The proposed approach presents a superior performance in comparison with other state-of-the-art, and it can effectively perform the LoS and mortality prediction tasks.
Collapse
Affiliation(s)
- Junde Chen
- Fowler School of Engineering, Chapman University, Orange 92866, CA, USA
| | - Trudi Di Qi
- Fowler School of Engineering, Chapman University, Orange 92866, CA, USA
| | - Jacqueline Vu
- Fowler School of Engineering, Chapman University, Orange 92866, CA, USA
| | - Yuxin Wen
- Fowler School of Engineering, Chapman University, Orange 92866, CA, USA.
| |
Collapse
|
11
|
Tang X, Wu Z, Liu W, Tian J, Liu L. Exploring effective ways to increase reliable positive samples for machine learning-based urban waterlogging susceptibility assessments. J Environ Manage 2023; 344:118682. [PMID: 37567005 DOI: 10.1016/j.jenvman.2023.118682] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/08/2023] [Revised: 07/10/2023] [Accepted: 07/25/2023] [Indexed: 08/13/2023]
Abstract
Machine learning (ML)-based urban waterlogging susceptibility studies suffer from class imbalance, as fewer positive samples are generally available than potential negative samples. Few studies have considered optimizing the results by improving the quality of training samples. To address this issue, we explored effective approaches to reliably increase the numbers of positive samples for such studies. The Synthetic Minority Over-Sampling Technique (SMOTE) and Optimized Seed Spread Algorithm (OSSA), representative of oversampling (synthesizing new samples based on the feature space) and physical (simulating potential inundated area based on the mechanisms of water flow) approaches, respectively, were employed to increase the number of positive samples. Waterlogging in Shenzhen was selected as a case study using eight selected spatial variables. An elaborate experiment was conducted to compare the quality of added samples based on the classifiers' performance and accuracy of waterlogging susceptibility maps (WSMs). The results indicated that (1) the performance of classifiers generated with SMOTE was worse than the original samples, while the use of OSSA improved the trained classifiers, and (2) the accuracy of WSMs was not improved with SMOTE but increased markedly with OSSA. These results may be driven by the diversity of information and features of the added samples. This study indicates the use of SMOTE fails to synthesize reliable samples when applied to waterlogging analysis in Shenzhen, whereas an effective solution for generating reliable positive samples is to use OSSA that simulates the potential submerged regions based on the mechanisms of disaster occurrence and spread.
Collapse
Affiliation(s)
- Xianzhe Tang
- Guangdong Province Key Laboratory for Land Use and Consolidation, South China Agricultural University, Guangzhou 510642, China; College of Natural Resources and Environment, Joint Institute for Environment & Education, South China Agricultural University, Guangzhou 510642, China
| | - Zhanyu Wu
- College of Natural Resources and Environment, Joint Institute for Environment & Education, South China Agricultural University, Guangzhou 510642, China
| | - Wei Liu
- State Key Laboratory of Resources and Environmental Information System, Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing 100101, China; University of Chinese Academy of Sciences, Beijing 100049, China.
| | - Juwei Tian
- Guangdong Province Key Laboratory for Land Use and Consolidation, South China Agricultural University, Guangzhou 510642, China
| | - Luo Liu
- Guangdong Province Key Laboratory for Land Use and Consolidation, South China Agricultural University, Guangzhou 510642, China.
| |
Collapse
|
12
|
Karamti H, Alharthi R, Anizi AA, Alhebshi RM, Eshmawi AA, Alsubai S, Umer M. Improving Prediction of Cervical Cancer Using KNN Imputed SMOTE Features and Multi-Model Ensemble Learning Approach. Cancers (Basel) 2023; 15:4412. [PMID: 37686692 PMCID: PMC10486648 DOI: 10.3390/cancers15174412] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2023] [Revised: 08/02/2023] [Accepted: 08/09/2023] [Indexed: 09/10/2023] Open
Abstract
Objective: Cervical cancer ranks among the top causes of death among females in developing countries. The most important procedures that should be followed to guarantee the minimizing of cervical cancer's aftereffects are early identification and treatment under the finest medical guidance. One of the best methods to find this sort of malignancy is by looking at a Pap smear image. For automated detection of cervical cancer, the available datasets often have missing values, which can significantly affect the performance of machine learning models. Methods: To address these challenges, this study proposes an automated system for predicting cervical cancer that efficiently handles missing values with SMOTE features to achieve high accuracy. The proposed system employs a stacked ensemble voting classifier model that combines three machine learning models, along with KNN Imputer and SMOTE up-sampled features for handling missing values. Results: The proposed model achieves 99.99% accuracy, 99.99% precision, 99.99% recall, and 99.99% F1 score when using KNN imputed SMOTE features. The study compares the performance of the proposed model with multiple other machine learning algorithms under four scenarios: with missing values removed, with KNN imputation, with SMOTE features, and with KNN imputed SMOTE features. The study validates the efficacy of the proposed model against existing state-of-the-art approaches. Conclusions: This study investigates the issue of missing values and class imbalance in the data collected for cervical cancer detection and might aid medical practitioners in timely detection and providing cervical cancer patients with better care.
Collapse
Affiliation(s)
- Hanen Karamti
- Department of Computer Sciences, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia;
| | - Raed Alharthi
- Department of Computer Science and Engineering, University of Hafr Al-Batin, Hafar Al-Batin 39524, Saudi Arabia;
| | - Amira Al Anizi
- Department of Computer Sciences, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia;
| | - Reemah M. Alhebshi
- Department of Computer Science, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia;
| | - Ala’ Abdulmajid Eshmawi
- Department of Cybersecurity, College of Computer Science and Engineering, University of Jeddah, Jeddah 23218, Saudi Arabia;
| | - Shtwai Alsubai
- Department of Computer Science, College of Computer Engineering and Sciences, Prince Sattam bin Abdulaziz University, P.O. Box 151, Al-Kharj 11942, Saudi Arabia;
| | - Muhammad Umer
- Department of Computer Science & Information Technology, The Islamia University of Bahawalpur, Bahawalpur 63100, Pakistan
| |
Collapse
|
13
|
Angaitkar P, Janghel RR, Sahu TP. DL-TCNN: Deep Learning-based Temporal Convolutional Neural Network for prediction of conformational B-cell epitopes. 3 Biotech 2023; 13:297. [PMID: 37575599 PMCID: PMC10412510 DOI: 10.1007/s13205-023-03716-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2023] [Accepted: 07/24/2023] [Indexed: 08/15/2023] Open
Abstract
Prediction of conformational B-cell epitopes (CBCE) is an essential phase for vaccine design, drug invention, and accurate disease diagnosis. Many laboratorial and computational approaches have been developed to predict CBCE. However, laboratorial experiments are costly and time consuming, leading to the popularity of Machine Learning (ML)-based computational methods. Although ML methods have succeeded in many domains, achieving higher accuracy in CBCE prediction remains a challenge. To overcome this drawback and consider the limitations of ML methods, this paper proposes a novel DL-based framework for CBCE prediction, leveraging the capabilities of deep learning in the medical domain. The proposed model is named Deep Learning-based Temporal Convolutional Neural Network (DL-TCNN), which hybridizes empirical hyper-tuned 1D-CNN and TCN. TCN is an architecture that employs causal convolutions and dilations, adapting well to sequential input with extensive receptive fields. To train the proposed model, physicochemical features are firstly extracted from antigen sequences. Next, the Synthetic Minority Oversampling Technique (SMOTE) is applied to address the class imbalance problem. Finally, the proposed DL-TCNN is employed for the prediction of CBCE. The model's performance is evaluated and validated on a benchmark antigen-antibody dataset. The DL-TCNN achieves 94.44% accuracy, and 0.989 AUC score for the training dataset, 78.53% accuracy, and 0.661 AUC score for the validation dataset; and 85.10% accuracy, 0.855 AUC score for the testing dataset. The proposed model outperforms all the existing CBCE methods.
Collapse
Affiliation(s)
- Pratik Angaitkar
- Department of Information Technology, National Institute of Technology, Raipur, G.E. Road, Raipur, C.G. 492010 India
| | - Rekh Ram Janghel
- Department of Information Technology, National Institute of Technology, Raipur, G.E. Road, Raipur, C.G. 492010 India
| | - Tirath Prasad Sahu
- Department of Information Technology, National Institute of Technology, Raipur, G.E. Road, Raipur, C.G. 492010 India
| |
Collapse
|
14
|
Zhou T, Jiao H. Exploration of the Stacking Ensemble Machine Learning Algorithm for Cheating Detection in Large-Scale Assessment. Educ Psychol Meas 2023; 83:831-854. [PMID: 37398846 PMCID: PMC10311957 DOI: 10.1177/00131644221117193] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/04/2023]
Abstract
Cheating detection in large-scale assessment received considerable attention in the extant literature. However, none of the previous studies in this line of research investigated the stacking ensemble machine learning algorithm for cheating detection. Furthermore, no study addressed the issue of class imbalance using resampling. This study explored the application of the stacking ensemble machine learning algorithm to analyze the item response, response time, and augmented data of test-takers to detect cheating behaviors. The performance of the stacking method was compared with that of two other ensemble methods (bagging and boosting) as well as six base non-ensemble machine learning algorithms. Issues related to class imbalance and input features were addressed. The study results indicated that stacking, resampling, and feature sets including augmented summary data generally performed better than its counterparts in cheating detection. Compared with other competing machine learning algorithms investigated in this study, the meta-model from stacking using discriminant analysis based on the top two base models-Gradient Boosting and Random Forest-generally performed the best when item responses and the augmented summary statistics were used as the input features with an under-sampling ratio of 10:1 among all the study conditions.
Collapse
Affiliation(s)
- Todd Zhou
- Winston Churchill High School, Potomac, MD, USA
| | - Hong Jiao
- University of Maryland, College Park, USA
| |
Collapse
|
15
|
Ma F, Li H. Online painting image clustering for the mental health of college art students based on improved CNN and SMOTE. PeerJ Comput Sci 2023; 9:e1462. [PMID: 37547389 PMCID: PMC10403178 DOI: 10.7717/peerj-cs.1462] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2023] [Accepted: 06/06/2023] [Indexed: 08/08/2023]
Abstract
In modern education, mental health problems have become the focus and difficulty of students' education. Painting therapy has been integrated into the school's art education as an effective mental health intervention. Deep learning can automatically learn the image features and abstract the low-level image features into high-level features. However, traditional image classification models are prone to lose background information, resulting in poor adaptability of the classification model. Therefore, this article extracts the lost colour of painting images based on K-means clustering and proposes a painting style classification model based on an improved convolutional neural network (CNN), where a modified Synthetic Minority Oversampling Technique (SMOTE) is proposed to amplify the data. Then, the CNN network structure is optimized by adjusting the network's vertical depth and horizontal width. Finally, a new activation function, PPReLU, is proposed to suppress the excessive value of the positive part. The experimental results show that the proposed model has the highest accuracy in classifying painting image styles by comparing it with state-of-the-art methods, whose accuracy is up to 91.55%, which is 8.7% higher than that of traditional CNN.
Collapse
Affiliation(s)
- Fake Ma
- Henan Economy and Trade Vocational College, Zhengzhou, China
| | - Huwei Li
- Henan Economy and Trade Vocational College, Zhengzhou, China
| |
Collapse
|
16
|
Welvaars K, Oosterhoff JHF, van den Bekerom MPJ, Doornberg JN, van Haarst EP. Implications of resampling data to address the class imbalance problem (IRCIP): an evaluation of impact on performance between classification algorithms in medical data. JAMIA Open 2023; 6:ooad033. [PMID: 37266187 PMCID: PMC10232287 DOI: 10.1093/jamiaopen/ooad033] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2023] [Revised: 04/04/2023] [Accepted: 05/11/2023] [Indexed: 06/03/2023] Open
Abstract
Objective When correcting for the "class imbalance" problem in medical data, the effects of resampling applied on classifier algorithms remain unclear. We examined the effect on performance over several combinations of classifiers and resampling ratios. Materials and Methods Multiple classification algorithms were trained on 7 resampled datasets: no correction, random undersampling, 4 ratios of Synthetic Minority Oversampling Technique (SMOTE), and random oversampling with the Adaptive Synthetic algorithm (ADASYN). Performance was evaluated in Area Under the Curve (AUC), precision, recall, Brier score, and calibration metrics. A case study on prediction modeling for 30-day unplanned readmissions in previously admitted Urology patients was presented. Results For most algorithms, using resampled data showed a significant increase in AUC and precision, ranging from 0.74 (CI: 0.69-0.79) to 0.93 (CI: 0.92-0.94), and 0.35 (CI: 0.12-0.58) to 0.86 (CI: 0.81-0.92) respectively. All classification algorithms showed significant increases in recall, and significant decreases in Brier score with distorted calibration overestimating positives. Discussion Imbalance correction resulted in an overall improved performance, yet poorly calibrated models. There can still be clinical utility due to a strong discriminating performance, specifically when predicting only low and high risk cases is clinically more relevant. Conclusion Resampling data resulted in increased performances in classification algorithms, yet produced an overestimation of positive predictions. Based on the findings from our case study, a thoughtful predefinition of the clinical prediction task may guide the use of resampling techniques in future studies aiming to improve clinical decision support tools.
Collapse
Affiliation(s)
- Koen Welvaars
- Corresponding Author: Koen Welvaars, MSc, Data Science Team, OLVG, Jan Tooropstraat 164, 1061 AE Amsterdam, the Netherlands;
| | | | | | | | | | | |
Collapse
|
17
|
Işık Ü, Güven A, Batbat T. Evaluation of Emotions from Brain Signals on 3D VAD Space via Artificial Intelligence Techniques. Diagnostics (Basel) 2023; 13:2141. [PMID: 37443535 DOI: 10.3390/diagnostics13132141] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2023] [Revised: 06/12/2023] [Accepted: 06/14/2023] [Indexed: 07/15/2023] Open
Abstract
Recent achievements have made emotion studies a rising field contributing to many areas, such as health technologies, brain-computer interfaces, psychology, etc. Emotional states can be evaluated in valence, arousal, and dominance (VAD) domains. Most of the work uses only VA due to the easiness of differentiation; however, very few studies use VAD like this study. Similarly, segment comparisons of emotion analysis with handcrafted features also use VA space. At this point, we primarily focused on VAD space to evaluate emotions and segmentations. The DEAP dataset is used in this study. A comprehensive analytical approach is implemented with two sub-studies: first, segmentation (Segments I-VIII), and second, binary cross-comparisons and evaluations of eight emotional states, in addition to comparisons of selected segments (III, IV, and V), class separation levels (5, 4-6, and 3-7), and unbalanced and balanced data with SMOTE. In both sub-studies, Wavelet Transform is applied to electroencephalography signals to separate the brain waves into their bands (α, β, γ, and θ bands), twenty-four attributes are extracted, and Sequential Minimum Optimization, K-Nearest Neighbors, Fuzzy Unordered Rule Induction Algorithm, Random Forest, Optimized Forest, Bagging, Random Committee, and Random Subspace are used for classification. In our study, we have obtained high accuracy results, which can be seen in the figures in the second part. The best accuracy result in this study for unbalanced data is obtained for Low Arousal-Low Valence-High Dominance and High Arousal-High Valence-Low Dominance emotion comparisons (Segment III and 4.5-5.5 class separation), and an accuracy rate of 98.94% is obtained with the IBk classifier. Data-balanced results mostly seem to outperform unbalanced results.
Collapse
Affiliation(s)
- Ümran Işık
- Biomedical Engineering Graduate Program, Graduate School of Natural and Applied Sciences, Erciyes University, 38039 Kayseri, Türkiye
| | - Ayşegül Güven
- Department of Biomedical Engineering, Engineering Faculty, Erciyes University, 38039 Kayseri, Türkiye
| | - Turgay Batbat
- Department of Biomedical Engineering, Engineering Faculty, Erciyes University, 38039 Kayseri, Türkiye
| |
Collapse
|
18
|
Saminathan S, Malathy C. Ensemble-based classification approach for PM2.5 concentration forecasting using meteorological data. Front Big Data 2023; 6:1175259. [PMID: 37360751 PMCID: PMC10289837 DOI: 10.3389/fdata.2023.1175259] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Accepted: 05/09/2023] [Indexed: 06/28/2023] Open
Abstract
Air pollution is a serious challenge to humankind as it poses many health threats. It can be measured using the air quality index (AQI). Air pollution is the result of contamination of both outdoor and indoor environments. The AQI is being monitored by various institutions globally. The measured air quality data are kept mostly for public use. Using the previously calculated AQI values, the future values of AQI can be predicted, or the class/category value of the numeric value can be obtained. This forecast can be performed with more accuracy using supervised machine learning methods. In this study, multiple machine-learning approaches were used to classify PM2.5 values. The values for the pollutant PM2.5 were classified into different groups using machine learning algorithms such as logistic regression, support vector machines, random forest, extreme gradient boosting, and their grid search equivalents, along with the deep learning method multilayer perceptron. After performing multiclass classification using these algorithms, the parameters accuracy and per-class accuracy were used to compare the methods. As the dataset used was imbalanced, a SMOTE-based approach for balancing the dataset was used. Compared to all other classifiers that use the original dataset, the accuracy of the random forest multiclass classifier with SMOTE-based dataset balancing was found to provide better accuracy.
Collapse
Affiliation(s)
- S. Saminathan
- Department of Computing Technologies, Faculty of Engineering and Technology, SRM Institute of Science and Technology, Kattankulathur, Tamil Nadu, India
| | - C. Malathy
- Department of Networking and Communications, Faculty of Engineering and Technology, SRM Institute of Science and Technology, Kattankulathur, Tamil Nadu, India
| |
Collapse
|
19
|
Chhabra D, Juneja M, Chutani G. An efficient ensemble based machine learning approach for predicting Chronic Kidney Disease. Curr Med Imaging 2023:CMIR-EPUB-131580. [PMID: 37157217 DOI: 10.2174/1573405620666230508104538] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2022] [Revised: 02/01/2023] [Accepted: 03/16/2023] [Indexed: 05/10/2023]
Abstract
BACKGROUND Chronic kidney disease (CKD) is a long-term risk to one's health that can result in kidney failure. CKD is one of today's most serious diseases, and early detection can aid in proper treatment. Machine learning techniques have proven to be reliable in the early medical diagnosis. OBJECTIVE The paper aims to perform CKD prediction using machine learning classification approaches. The dataset used for the present study for detecting CKD was obtained from the machine learning repository at the University of California, Irvine (UCI). METHOD In this study, twelve machine learning-based classification algorithms with full features were used. Since the CKD dataset had a class imbalance issue, the Synthetic Minority Over-Sampling technique (SMOTE) was used to alleviate the problem of class imbalance and review the performance based on machine learning classification models using the K fold cross-validation technique. The proposed work compares the results of twelve classifiers with and without the SMOTE technique, and then the top three classifiers with the highest accuracy, Support Vector Machine, Random Forest, and Adaptive Boosting classification algorithms were selected to use the ensemble technique to improve performance. RESULTS The accuracy achieved using a stacking classifier as an ensemble technique with cross-validation is 99.5%. CONCLUSION The study provides an ensemble learning approach in which the top three best-performing classifiers in terms of cross-validation results are stacked in an ensemble model after balancing the dataset using SMOTE. This proposed technique could be applied to other diseases in the future, making disease detection less intrusive and cost-effective.
Collapse
Affiliation(s)
- Divyanshi Chhabra
- University Institute of Engineering and Technology, Panjab University, Chandigarh 160025, India
| | - Mamta Juneja
- University Institute of Engineering and Technology, Panjab University, Chandigarh 160025, India
| | - Gautam Chutani
- University Institute of Engineering and Technology, Panjab University, Chandigarh 160025, India
| |
Collapse
|
20
|
Fatlawi HK, Kiss A. An Elastic Self-Adjusting Technique for Rare-Class Synthetic Oversampling Based on Cluster Distortion Minimization in Data Stream. Sensors (Basel) 2023; 23:s23042061. [PMID: 36850659 PMCID: PMC9963940 DOI: 10.3390/s23042061] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/24/2023] [Revised: 02/08/2023] [Accepted: 02/10/2023] [Indexed: 06/12/2023]
Abstract
Adaptive machine learning has increasing importance due to its ability to classify a data stream and handle the changes in the data distribution. Various resources, such as wearable sensors and medical devices, can generate a data stream with an imbalanced distribution of classes. Many popular oversampling techniques have been designed for imbalanced batch data rather than a continuous stream. This work proposes a self-adjusting window to improve the adaptive classification of an imbalanced data stream based on minimizing cluster distortion. It includes two models; the first chooses only the previous data instances that preserve the coherence of the current chunk's samples. The second model relaxes the strict filter by excluding the examples of the last chunk. Both models include generating synthetic points for oversampling rather than the actual data points. The evaluation of the proposed models using the Siena EEG dataset showed their ability to improve the performance of several adaptive classifiers. The best results have been obtained using Adaptive Random Forest in which Sensitivity reached 96.83% and Precision reached 99.96%.
Collapse
Affiliation(s)
- Hayder K. Fatlawi
- Department of Information Systems, ELTE Eötvös Loránd University, 1117 Budapest, Hungary
- Center of Information Technology Research and Development, University of Kufa, Najaf 540011, Iraq
| | - Attila Kiss
- Department of Information Systems, ELTE Eötvös Loránd University, 1117 Budapest, Hungary
- Department of Informatics, J. Selye University, 94501 Komárno, Slovakia
| |
Collapse
|
21
|
Mafarja M, Thaher T, Al-Betar MA, Too J, Awadallah MA, Abu Doush I, Turabieh H. Classification framework for faulty-software using enhanced exploratory whale optimizer-based feature selection scheme and random forest ensemble learning. APPL INTELL 2023; 53:1-43. [PMID: 36785593 PMCID: PMC9909674 DOI: 10.1007/s10489-022-04427-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/23/2022] [Indexed: 02/11/2023]
Abstract
Software Fault Prediction (SFP) is an important process to detect the faulty components of the software to detect faulty classes or faulty modules early in the software development life cycle. In this paper, a machine learning framework is proposed for SFP. Initially, pre-processing and re-sampling techniques are applied to make the SFP datasets ready to be used by ML techniques. Thereafter seven classifiers are compared, namely K-Nearest Neighbors (KNN), Naive Bayes (NB), Linear Discriminant Analysis (LDA), Linear Regression (LR), Decision Tree (DT), Support Vector Machine (SVM), and Random Forest (RF). The RF classifier outperforms all other classifiers in terms of eliminating irrelevant/redundant features. The performance of RF is improved further using a dimensionality reduction method called binary whale optimization algorithm (BWOA) to eliminate the irrelevant/redundant features. Finally, the performance of BWOA is enhanced by hybridizing the exploration strategies of the grey wolf optimizer (GWO) and harris hawks optimization (HHO) algorithms. The proposed method is called SBEWOA. The SFP datasets utilized are selected from the PROMISE repository using sixteen datasets for software projects with different sizes and complexity. The comparative evaluation against nine well-established feature selection methods proves that the proposed SBEWOA is able to significantly produce competitively superior results for several instances of the evaluated dataset. The algorithms' performance is compared in terms of accuracy, the number of features, and fitness function. This is also proved by the 2-tailed P-values of the Wilcoxon signed ranks statistical test used. In conclusion, the proposed method is an efficient alternative ML method for SFP that can be used for similar problems in the software engineering domain.
Collapse
Affiliation(s)
- Majdi Mafarja
- Department of Computer Science, Birzeit University, Birzeit, Palestine
| | - Thaer Thaher
- Department of Computer Systems Engineering, Arab American University, Jenin, Palestine
- Information Technology Engineering, Al-Quds University, Abu Dies, Jerusalem, Palestine
| | - Mohammed Azmi Al-Betar
- Artificial Intelligence Research Center (AIRC), College of Engineering and Information Technology, Ajman University, Ajman, United Arab EmiratesDeepSinghML2017, Irbid, Jordan
| | - Jingwei Too
- Faculty of Electrical Engineering, Universiti Teknikal Malaysia Melaka, Hang Tuah Jaya, 76100 Durian Tunggal Melaka, Malaysia
| | - Mohammed A. Awadallah
- Department of Computer Science, Al-Aqsa University, P.O. Box 4051, Gaza, Palestine
- Artificial Intelligence Research Center (AIRC), Ajman University, Ajman, United Arab Emirates
| | - Iyad Abu Doush
- Department of Computing, College of Engineering and Applied Sciences, American University of Kuwait, Salmiya, Kuwait
- Computer Science Department, Yarmouk University, Irbid, Jordan
| | - Hamza Turabieh
- Department of Health Management and Informatics, University of Missouri, Columbia, 5 Hospital Drive, Columbia, MO 65212 USA
| |
Collapse
|
22
|
Azlim Khan AK, Ahamed Hassain Malim NH. Comparative Studies on Resampling Techniques in Machine Learning and Deep Learning Models for Drug-Target Interaction Prediction. Molecules 2023; 28. [PMID: 36838652 DOI: 10.3390/molecules28041663] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2022] [Revised: 01/23/2023] [Accepted: 01/24/2023] [Indexed: 02/12/2023] Open
Abstract
The prediction of drug-target interactions (DTIs) is a vital step in drug discovery. The success of machine learning and deep learning methods in accurately predicting DTIs plays a huge role in drug discovery. However, when dealing with learning algorithms, the datasets used are usually highly dimensional and extremely imbalanced. To solve this issue, the dataset must be resampled accordingly. In this paper, we have compared several data resampling techniques to overcome class imbalance in machine learning methods as well as to study the effectiveness of deep learning methods in overcoming class imbalance in DTI prediction in terms of binary classification using ten (10) cancer-related activity classes from BindingDB. It is found that the use of Random Undersampling (RUS) in predicting DTIs severely affects the performance of a model, especially when the dataset is highly imbalanced, thus, rendering RUS unreliable. It is also found that SVM-SMOTE can be used as a go-to resampling method when paired with the Random Forest and Gaussian Naïve Bayes classifiers, whereby a high F1 score is recorded for all activity classes that are severely and moderately imbalanced. Additionally, the deep learning method called Multilayer Perceptron recorded high F1 scores for all activity classes even when no resampling method was applied.
Collapse
|
23
|
Din NU, Zhang L, Yang Y. Automated Battery Making Fault Classification Using Over-Sampled Image Data CNN Features. Sensors (Basel) 2023; 23:1927. [PMID: 36850526 PMCID: PMC9965985 DOI: 10.3390/s23041927] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/14/2023] [Revised: 01/26/2023] [Accepted: 02/03/2023] [Indexed: 06/18/2023]
Abstract
Due to the tremendous expectations placed on batteries to produce a reliable and secure product, fault detection has become a critical part of the manufacturing process. Manually, it takes much labor and effort to test each battery individually for manufacturing faults including burning, welding that is too high, missing welds, shifting, welding holes, and so forth. Additionally, manual battery fault detection takes too much time and is extremely expensive. We solved this issue by using image processing and machine learning techniques to automatically detect faults in the battery manufacturing process. Our approach will reduce the need for human intervention, save time, and be easy to implement. A CMOS camera was used to collect a large number of images belonging to eight common battery manufacturing faults. The welding area of the batteries' positive and negative terminals was captured from different distances, between 40 and 50 cm. Before deploying the learning models, first, we used the CNN for feature extraction from the image data. To over-sample the dataset, we used the Synthetic Minority Over-sampling Technique (SMOTE) since the dataset was highly imbalanced, resulting in over-fitting of the learning model. Several machine learning and deep learning models were deployed on the CNN-extracted features and over-sampled data. Random forest achieved a significant 84% accuracy with our proposed approach. Additionally, we applied K-fold cross-validation with the proposed approach to validate the significance of the approach, and the logistic regression achieved an 81.897% mean accuracy score and a +/- 0.0255 standard deviation.
Collapse
|
24
|
Chandrashekar K, Setlur AS, Sabhapathi C A, Raiker SS, Singh S, Niranjan V. Decision Support System and Web-Application Using Supervised Machine Learning Algorithms for Easy Cancer Classifications. Cancer Inform 2023; 22:11769351221147244. [PMID: 36714384 PMCID: PMC9880585 DOI: 10.1177/11769351221147244] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2022] [Accepted: 12/06/2022] [Indexed: 01/24/2023] Open
Abstract
Using a decision support system (DSS) that classifies various cancers provides support to the clinicians/researchers to make better decisions that can aid in early cancer diagnosis, thereby reducing chances of incorrect disease diagnosis. Thus, this work aimed at designing a classification model that can predict accurately for 5 different cancer types comprising of 20 cancer exomes, using the mutations identified from whole exome cancer analysis. Initially, a basic model was designed using supervised machine learning classification algorithms such as K-nearest neighbor (KNN), support vector machine (SVM), decision tree, naïve bayes and random forest (RF), among which decision tree and random forest performed better in terms of preliminary model accuracy. However, output predictions were incorrect due to less training scores. Thus, 16 essential features were then selected for model improvement using 2 approaches. All imbalanced datasets were balanced using SMOTE. In the first approach, all features from 20 cancer exome datasets were trained and models were designed using decision tree and random forest. Balanced datasets for decision tree model showed an accuracy of 77%, while with the RF model, the accuracy improved to 82% where all 5 cancer types were predicted correctly. Area under the curve for RF model was closer to 1, than decision tree model. In the second approach, all 15 datasets were trained, while 5 were tested. However, only 2 cancer types were predicted correctly. To cross validate RF model, Matthew's correlation co-efficient (MCC) test was performed. For method 1, the MCC test and MCC cross validation was found to be 0.7796 and 0.9356 respectively. Likewise, for second approach, MCC was observed to be 0.9365, corroborating the accuracy of the designed model. The model was successfully deployed using Streamlit as a web application for easy use. This study presents insights for allowing easy cancer classifications.
Collapse
Affiliation(s)
| | | | | | | | | | - Vidya Niranjan
- Vidya Niranjan, Department of
Biotechnology, R V College of Engineering, Bengaluru, Karnataka 560059, India.
| |
Collapse
|
25
|
Sachdeva RK, Bathla P, Rani P, Solanki V, Ahuja R. A systematic method for diagnosis of hepatitis disease using machine learning. Innov Syst Softw Eng 2023; 19:71-80. [PMID: 36628173 PMCID: PMC9818056 DOI: 10.1007/s11334-022-00509-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/12/2022] [Accepted: 11/22/2022] [Indexed: 06/17/2023]
Abstract
Hepatitis is among the deadliest diseases on the planet. Machine learning approaches can contribute toward diagnosing hepatitis disease based on a few characteristics. On the UCI dataset, authors assessed distinct classifiers' performance in order to develop a systematic strategy for hepatitis disease diagnosis. The classifiers used are support vector machine, logistic regression (LR), K-nearest neighbor, and random forest. The classifiers were employed without class balancing and in conjunction with class balancing using SMOTE strategy. Both studies, classification without class balancing and with class balancing, were compared in terms of different performance parameters. After adopting class balancing, the efficiency of classifiers improved significantly. LR with SMOTE provided the highest level of accuracy (93.18%).
Collapse
Affiliation(s)
- Ravi Kumar Sachdeva
- Chitkara University Institute of Engineering and Technology, Chitkara University, Rajpura, Punjab India
| | | | - Pooja Rani
- MMICTBM, Maharishi Markandeshwar (Deemed to be University), Mullana, Ambala, Haryana India
| | - Vikas Solanki
- Chitkara University Institute of Engineering and Technology, Chitkara University, Rajpura, Punjab India
| | - Rakesh Ahuja
- Chitkara University Institute of Engineering and Technology, Chitkara University, Rajpura, Punjab India
| |
Collapse
|
26
|
Lu M, Wang M, Zhang Q, Yu M, He C, Zhang Y, Li Y. A vision transformer for lightning intensity estimation using 3D weather radar. Sci Total Environ 2022; 853:158496. [PMID: 36063932 DOI: 10.1016/j.scitotenv.2022.158496] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/04/2022] [Revised: 08/11/2022] [Accepted: 08/30/2022] [Indexed: 06/15/2023]
Abstract
Lightning has strong destructive powers; its blast wave, high temperature, and high voltage can pose a great threat to human production, life, and personal safety. The destructive power of high-intensity lightning is much greater than that of low-intensity lightning. The estimation of lightning intensity can provide an important reference for determining the lightning protection level and lightning disaster risk assessment. Lightning is a type of small-scale severe convective weather phenomenon. Weather radar is one of the best monitoring systems that can frequently sample the detailed three-dimensional (3D) structures of convective storms, with a small spatial scale and short lifetime at high temporal and spatial resolutions. Therefore, it is possible to extract the 3D spatial feature strongly correlated with lightning from 3D weather radar for estimating lightning intensity. This paper proposes a Vision Transformer model for lightning intensity estimation that can automatically estimate lightning intensity from 3D weather radar data. In an experiment, we transferred the task of estimating lightning intensity into a multicategory classification task. A framework was designed to produce lightning feature samples for model input from 3D weather radar and lightning location data. Then, the Synthetic Minority Over-Sampling Technique (SMOTE) algorithm was used to balance and optimize the sample distribution. Finally, samples were input into the proposed lightning intensity estimation model based on Vision Transformer for training and evaluation. Experimental results show that the proposed model based on Vision Transformers performs well with lightning intensity estimation.
Collapse
Affiliation(s)
- Mingyue Lu
- Collaborative Innovation Center on Forecast and Evaluation of Meteorological Disasters, Nanjing University of Information Science & Technology, Nanjing 210044, China; Geographic Science College, Nanjing University of Information Science & Technology, Nanjing 210044, China.
| | - Menglong Wang
- Collaborative Innovation Center on Forecast and Evaluation of Meteorological Disasters, Nanjing University of Information Science & Technology, Nanjing 210044, China; Geographic Science College, Nanjing University of Information Science & Technology, Nanjing 210044, China
| | - Qian Zhang
- School of Management Engineering, Xi'an University of Finance and Economics, Xi'an 710100, China
| | - Manzhu Yu
- Department of Geography, The Pennsylvania State University, University Park, PA 16802, USA
| | - Caifen He
- Ningbo Zhenhai District Meteorological Bureau, Ningbo 315012, China
| | - Yadong Zhang
- Collaborative Innovation Center on Forecast and Evaluation of Meteorological Disasters, Nanjing University of Information Science & Technology, Nanjing 210044, China; Geographic Science College, Nanjing University of Information Science & Technology, Nanjing 210044, China
| | - Yuchen Li
- Collaborative Innovation Center on Forecast and Evaluation of Meteorological Disasters, Nanjing University of Information Science & Technology, Nanjing 210044, China; Geographic Science College, Nanjing University of Information Science & Technology, Nanjing 210044, China
| |
Collapse
|
27
|
Karim M, Saad Missen MM, Umer M, Fida A, Eshmawi AA, Mohamed A, Ashraf I. Comprehension of polarity of articles by citation sentiment analysis using TF-IDF and ML classifiers. PeerJ Comput Sci 2022; 8:e1107. [PMID: 37346319 PMCID: PMC10280177 DOI: 10.7717/peerj-cs.1107] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2022] [Accepted: 08/29/2022] [Indexed: 06/23/2023]
Abstract
Sentiment analysis has been researched extensively during the last few years, however, the sentiment analysis of citations in a research article is an unexplored research area. Sentiment analysis of citations can provide new applications in bibliometrics and provide insights for a better understanding of scientific knowledge. Citation count, as it is used today to measure the quality of a paper, does not portray the quality of a scientific article, as the article may be cited to indicate its weakness. So determining the polarity of a citation is an important task to quantify the quality of the cited article and ascertain its impact and ranking. This article presents an approach to determine the polarity of the cited article using term frequency-inverse document frequency and machine learning classifiers. To analyze the influence of an imbalanced dataset, several experiments are performed with and without the synthetic minority oversampling technique (SMOTE) and uni-gram and bi-gram term frequency-inverse document frequency (TF-IDF). Results indicate that the proposed methodology achieves high accuracy of 99.0% with the extra tree classifier when trained on SMOTE oversampled dataset and bi-gram features.
Collapse
Affiliation(s)
- Musarat Karim
- Department of Computer Science & Information Technology, Islamia University, Bahawalpur, Bahawalpur, Pakistan
| | - Malik Muhammad Saad Missen
- Department of Computer Science & Information Technology, Islamia University, Bahawalpur, Bahawalpur, Pakistan
| | - Muhammad Umer
- Department of Computer Science & Information Technology, Islamia University, Bahawalpur, Bahawalpur, Pakistan
| | - Alisha Fida
- Department of Computer Science & Information Technology, Islamia University, Bahawalpur, Bahawalpur, Pakistan
| | - Ala’ Abdulmajid Eshmawi
- University of Jeddah, Department of Cybersecurity, College of Computer Science and Engineering, Jeddah, Saudi Arabia
| | - Abdullah Mohamed
- University Research Centre, Future University in Egypt, Cairo, Egypt
| | - Imran Ashraf
- Information and Communication Engineering, Yeungnam University, Gyeongsan, Korea
| |
Collapse
|
28
|
Shah SMA, Usman SM, Khalid S, Rehman IU, Anwar A, Hussain S, Ullah SS, Elmannai H, Algarni AD, Manzoor W. An Ensemble Model for Consumer Emotion Prediction Using EEG Signals for Neuromarketing Applications. Sensors (Basel) 2022; 22:9744. [PMID: 36560113 PMCID: PMC9782208 DOI: 10.3390/s22249744] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/01/2022] [Revised: 11/21/2022] [Accepted: 11/26/2022] [Indexed: 06/17/2023]
Abstract
Traditional advertising techniques seek to govern the consumer's opinion toward a product, which may not reflect their actual behavior at the time of purchase. It is probable that advertisers misjudge consumer behavior because predicted opinions do not always correspond to consumers' actual purchase behaviors. Neuromarketing is the new paradigm of understanding customer buyer behavior and decision making, as well as the prediction of their gestures for product utilization through an unconscious process. Existing methods do not focus on effective preprocessing and classification techniques of electroencephalogram (EEG) signals, so in this study, an effective method for preprocessing and classification of EEG signals is proposed. The proposed method involves effective preprocessing of EEG signals by removing noise and a synthetic minority oversampling technique (SMOTE) to deal with the class imbalance problem. The dataset employed in this study is a publicly available neuromarketing dataset. Automated features were extracted by using a long short-term memory network (LSTM) and then concatenated with handcrafted features like power spectral density (PSD) and discrete wavelet transform (DWT) to create a complete feature set. The classification was done by using the proposed hybrid classifier that optimizes the weights of two machine learning classifiers and one deep learning classifier and classifies the data between like and dislike. The machine learning classifiers include the support vector machine (SVM), random forest (RF), and deep learning classifier (DNN). The proposed hybrid model outperforms other classifiers like RF, SVM, and DNN and achieves an accuracy of 96.89%. In the proposed method, accuracy, sensitivity, specificity, precision, and F1 score were computed to evaluate and compare the proposed method with recent state-of-the-art methods.
Collapse
Affiliation(s)
- Syed Mohsin Ali Shah
- Department of Computer Science, Shaheed Zulfikar Ali Bhutto Institute of Science and Technology, Islamabad 44000, Pakistan
| | - Syed Muhammad Usman
- Department of Creative Technologies, Faculty of Computing and AI, Air University, Islamabad 44000, Pakistan
| | - Shehzad Khalid
- Department of Computer Engineering, Bahria University, Islamabad 44000, Pakistan
| | - Ikram Ur Rehman
- School of Computing and Engineering, The University of West London, London W5 5RF, UK
| | - Aamir Anwar
- School of Computing and Engineering, The University of West London, London W5 5RF, UK
| | - Saddam Hussain
- School of Digital Science, Universiti Brunei Darussalam, Jalan Tungku Link, Gadong BE1410, Brunei
| | - Syed Sajid Ullah
- Department of Information and Communication Technology, University of Agder (UiA), N-4898 Grimstad, Norway
| | - Hela Elmannai
- Department of Information Technology, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia
| | - Abeer D. Algarni
- Department of Information Technology, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia
| | - Waleed Manzoor
- Department of Computer Engineering, Bahria University, Islamabad 44000, Pakistan
| |
Collapse
|
29
|
Ali Z, Hayat MF, Shaukat K, Alam TM, Hameed IA, Luo S, Basheer S, Ayadi M, Ksibi A. A Proposed Framework for Early Prediction of Schistosomiasis. Diagnostics (Basel) 2022; 12. [PMID: 36553145 DOI: 10.3390/diagnostics12123138] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Revised: 12/08/2022] [Accepted: 12/08/2022] [Indexed: 12/15/2022] Open
Abstract
Schistosomiasis is a neglected tropical disease that continues to be a leading cause of illness and mortality around the globe. The causing parasites are affixed to the skin through defiled water and enter the human body. Failure to diagnose Schistosomiasis can result in various medical complications, such as ascites, portal hypertension, esophageal varices, splenomegaly, and growth retardation. Early prediction and identification of risk factors may aid in treating disease before it becomes incurable. We aimed to create a framework by incorporating the most significant features to predict Schistosomiasis using machine learning techniques. A dataset of advanced Schistosomiasis has been employed containing recovery and death cases. A total data of 4316 individuals containing recovery and death cases were included in this research. The dataset contains demographics, socioeconomic, and clinical factors with lab reports. Data preprocessing techniques (missing values imputation, outlier removal, data normalisation, and data transformation) have also been employed for better results. Feature selection techniques, including correlation-based feature selection, Information gain, gain ratio, ReliefF, and OneR, have been utilised to minimise a large number of features. Data resampling algorithms, including Random undersampling, Random oversampling, Cluster Centroid, Near miss, and SMOTE, are applied to address the data imbalance problem. We applied four machine learning algorithms to construct the model: Gradient Boosting, Light Gradient Boosting, Extreme Gradient Boosting and CatBoost. The performance of the proposed framework has been evaluated based on Accuracy, Precision, Recall and F1-Score. The results of our proposed framework stated that the CatBoost model showed the best performance with the highest accuracy of (87.1%) compared with Gradient Boosting (86%), Light Gradient Boosting (86.7%) and Extreme Gradient Boosting (86.9%). Our proposed framework will assist doctors and healthcare professionals in the early diagnosis of Schistosomiasis.
Collapse
|
30
|
Wang H, Li H, Gao W, Xie J. PrUb-EL: A hybrid framework based on deep learning for identifying ubiquitination sites in Arabidopsis thaliana using ensemble learning strategy. Anal Biochem 2022; 658:114935. [PMID: 36206844 DOI: 10.1016/j.ab.2022.114935] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2022] [Revised: 09/25/2022] [Accepted: 09/26/2022] [Indexed: 12/30/2022]
Abstract
Identification of ubiquitination sites is central to many biological experiments. Ubiquitination is a kind of post-translational protein modification (PTM). It is a key mechanism for increasing protein diversity and plays a vital role in regulating cell function. In recent years, many models have been developed to predict ubiquitination sites in humans, mice and yeast. However, few studies have predicted ubiquitination sites in Arabidopsis thaliana. In view of this, a deep network model named PrUb-EL is proposed to predict ubiquitination sites in Arabidopsis thaliana. Firstly, six features based on the protein sequence are extracted with amino acid index database (AAindex), dipeptide deviates from the expected mean (DDE), dipeptide composition (DPC), blocks substitution matrix (BLOSUM62), enhanced amino acid composition (EAAC) and binary encoding. Secondly, the synthetic minority over-sampling technique (SMOTE) is utilized to process the imbalanced data set. Then a new classifier named DG is presented, which includes Dense block, Residual block and Gated recurrent unit (GRU) block. Finally, each of six feature extraction methods is integrated into the DG model, and the ensemble learning strategy is used to gain the final prediction result. Experimental results show that PrUb-EL has good predictive ability with the accuracy (ACC) and area under the ROC curve (auROC) values of 91.00% and 97.70% using 5-fold cross-validation, respectively. Note that the values of ACC and auROC are 88.58% and 96.09% in the independent test, respectively. Compared with previous studies, our model has significantly improved performance thus it is an excellent method for identifying ubiquitination sites in Arabidopsis thaliana. The datasets and code used for the article are available at https://github.com/Tom-Wangy/PreUb-EL.git.
Collapse
Affiliation(s)
- Houqiang Wang
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| | - Hong Li
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China.
| | - Weifeng Gao
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| | - Jin Xie
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| |
Collapse
|
31
|
Gu X, Ding Y, Xiao P, He T. A GHKNN model based on the physicochemical property extraction method to identify SNARE proteins. Front Genet 2022; 13:935717. [PMID: 36506312 PMCID: PMC9727185 DOI: 10.3389/fgene.2022.935717] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2022] [Accepted: 11/02/2022] [Indexed: 11/24/2022] Open
Abstract
There is a great deal of importance to SNARE proteins, and their absence from function can lead to a variety of diseases. The SNARE protein is known as a membrane fusion protein, and it is crucial for mediating vesicle fusion. The identification of SNARE proteins must therefore be conducted with an accurate method. Through extensive experiments, we have developed a model based on graph-regularized k-local hyperplane distance nearest neighbor model (GHKNN) binary classification. In this, the model uses the physicochemical property extraction method to extract protein sequence features and the SMOTE method to upsample protein sequence features. The combination achieves the most accurate performance for identifying all protein sequences. Finally, we compare the model based on GHKNN binary classification with other classifiers and measure them using four different metrics: SN, SP, ACC, and MCC. In experiments, the model performs significantly better than other classifiers.
Collapse
Affiliation(s)
- Xingyue Gu
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China,Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China,*Correspondence: Pengfeng Xiao, ; Tao He, ; Yijie Ding,
| | - Pengfeng Xiao
- State Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, China,*Correspondence: Pengfeng Xiao, ; Tao He, ; Yijie Ding,
| | - Tao He
- Beidahuang Industry Group General Hospital, Harbin, China,*Correspondence: Pengfeng Xiao, ; Tao He, ; Yijie Ding,
| |
Collapse
|
32
|
Jiang L, Jiang J, Wang X, Zhang Y, Zheng B, Liu S, Zhang Y, Liu C, Wan Y, Xiang D, Lv Z. IUP-BERT: Identification of Umami Peptides Based on BERT Features. Foods 2022; 11. [PMID: 36429332 DOI: 10.3390/foods11223742] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2022] [Revised: 11/14/2022] [Accepted: 11/16/2022] [Indexed: 11/23/2022] Open
Abstract
Umami is an important widely-used taste component of food seasoning. Umami peptides are specific structural peptides endowing foods with a favorable umami taste. Laboratory approaches used to identify umami peptides are time-consuming and labor-intensive, which are not feasible for rapid screening. Here, we developed a novel peptide sequence-based umami peptide predictor, namely iUP-BERT, which was based on the deep learning pretrained neural network feature extraction method. After optimization, a single deep representation learning feature encoding method (BERT: bidirectional encoder representations from transformer) in conjugation with the synthetic minority over-sampling technique (SMOTE) and support vector machine (SVM) methods was adopted for model creation to generate predicted probabilistic scores of potential umami peptides. Further extensive empirical experiments on cross-validation and an independent test showed that iUP-BERT outperformed the existing methods with improvements, highlighting its effectiveness and robustness. Finally, an open-access iUP-BERT web server was built. To our knowledge, this is the first efficient sequence-based umami predictor created based on a single deep-learning pretrained neural network feature extraction method. By predicting umami peptides, iUP-BERT can help in further research to improve the palatability of dietary supplements in the future.
Collapse
|
33
|
El Barakaz F, Boutkhoum O, Hanine M, El Moutaouakkil A, Rustam F, Din S, Ashraf I. Optimization of Imbalanced and Multidimensional Learning Under Bayes Minimum Risk and Savings Measure. Big Data 2022; 10:425-439. [PMID: 35723636 DOI: 10.1089/big.2021.0225] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The full potential of data analysis is crippled by imbalanced and high-dimensional data, which makes these topics significantly important. Consequently, substantial research efforts have been directed to obtain dimension reduction and resolve data imbalance, especially in the context of fraud detection analysis. This work aims to investigate the effectiveness of hybrid learning methods for alleviating the class imbalance and integrating dimensionality reduction techniques. In this regard, the current study examines different classification combinations to achieve optimal savings and improve classification performance. Against this background, several well-known machine learning models are selected such as logistic regression, random forest, CatBoost (CB), and XGBoost. These models are constructed and optimized based on Bayes minimum risk (BMR) associated with the oversampling method synthetic minority oversampling technique (SMOTE) and different feature selection (FS) techniques, both univariate and multivariate. To investigate the performance of the proposed approach, different possible scenarios are analyzed both with and without balancing, with and without FS, and optimization using BMR. With a major insight about the best method to use, BMR shows a good optimization when used with SMOTE, symmetrical uncertainty for FS, and CB as a boosted classifier, principally in terms of F1 score and savings metrics.
Collapse
Affiliation(s)
- Fatima El Barakaz
- Laroseri Laboratory, Faculty of Sciences, Chouaib Doukkali University, El Jadida, Morocco
| | - Omar Boutkhoum
- Laroseri Laboratory, Faculty of Sciences, Chouaib Doukkali University, El Jadida, Morocco
| | - Mohamed Hanine
- Department of Telecommunications, Networks and Informatics, LTI Laboratory, ENSA, Chouaib Doukkali University, El Jadida, Morocco
| | | | - Furqan Rustam
- Department of Software Engineering, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan
| | - Sadia Din
- Department of Information and Communication Engineering, Yeungnam University, Gyeongsan-si, Republic of Korea
| | - Imran Ashraf
- Department of Information and Communication Engineering, Yeungnam University, Gyeongsan-si, Republic of Korea
| |
Collapse
|
34
|
Okey OD, Maidin SS, Adasme P, Lopes Rosa R, Saadi M, Carrillo Melgarejo D, Zegarra Rodríguez D. BoostedEnML: Efficient Technique for Detecting Cyberattacks in IoT Systems Using Boosted Ensemble Machine Learning. Sensors (Basel) 2022; 22:7409. [PMID: 36236506 PMCID: PMC9572777 DOI: 10.3390/s22197409] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/08/2022] [Revised: 09/19/2022] [Accepted: 09/22/2022] [Indexed: 06/16/2023]
Abstract
Following the recent advances in wireless communication leading to increased Internet of Things (IoT) systems, many security threats are currently ravaging IoT systems, causing harm to information. Considering the vast application areas of IoT systems, ensuring that cyberattacks are holistically detected to avoid harm is paramount. Machine learning (ML) algorithms have demonstrated high capacity in helping to mitigate attacks on IoT devices and other edge systems with reasonable accuracy. However, the dynamics of operation of intruders in IoT networks require more improved IDS models capable of detecting multiple attacks with a higher detection rate and lower computational resource requirement, which is one of the challenges of IoT systems. Many ensemble methods have been used with different ML classifiers, including decision trees and random forests, to propose IDS models for IoT environments. The boosting method is one of the approaches used to design an ensemble classifier. This paper proposes an efficient method for detecting cyberattacks and network intrusions based on boosted ML classifiers. Our proposed model is named BoostedEnML. First, we train six different ML classifiers (DT, RF, ET, LGBM, AD, and XGB) and obtain an ensemble using the stacking method and another with a majority voting approach. Two different datasets containing high-profile attacks, including distributed denial of service (DDoS), denial of service (DoS), botnets, infiltration, web attacks, heartbleed, portscan, and botnets, were used to train, evaluate, and test the IDS model. To ensure that we obtained a holistic and efficient model, we performed data balancing with synthetic minority oversampling technique (SMOTE) and adaptive synthetic (ADASYN) techniques; after that, we used stratified K-fold to split the data into training, validation, and testing sets. Based on the best two models, we construct our proposed BoostedEnsML model using LightGBM and XGBoost, as the combination of the two classifiers gives a lightweight yet efficient model, which is part of the target of this research. Experimental results show that BoostedEnsML outperformed existing ensemble models in terms of accuracy, precision, recall, F-score, and area under the curve (AUC), reaching 100% in each case on the selected datasets for multiclass classification.
Collapse
Affiliation(s)
- Ogobuchi Daniel Okey
- Department of Systems Engineering and Automation, Federal University of Lavras, Lavras 37203-202, MG, Brazil
| | - Siti Sarah Maidin
- Faculty of Data Science and Information Technology (FDSIT), INTI International University, Nilai 71800, Malaysia
| | - Pablo Adasme
- Department of Electrical Engineering, University of Santiago de Chile, Santiago 9170124, Chile
| | - Renata Lopes Rosa
- Department of Computer Science, Federal University of Lavras, Lavras 37200-000, MG, Brazil
| | - Muhammad Saadi
- Department of Electrical Engineering, University of Central Punjab, Lahore 54000, Pakistan
| | - Dick Carrillo Melgarejo
- Department of Electrical Engineering, School of Energy Systems, Lappeenranta-Lahti University of Technology, FI-53851 Lappeenranta, Finland
| | | |
Collapse
|
35
|
Yan Y, Bao X, Chen B, Li Y, Yin J, Zhu G, Li Q. Interpretable machine learning framework reveals microbiome features of oral disease. Microbiol Res 2022; 265:127198. [PMID: 36126491 DOI: 10.1016/j.micres.2022.127198] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2022] [Revised: 08/25/2022] [Accepted: 09/13/2022] [Indexed: 11/16/2022]
Abstract
BACKGROUND Although the oral microbiome plays an important role in the progression of oral diseases, the microbes closely related to these diseases remain largely uncharacterized. RESULTS We collected saliva samples from 140 individuals and performed 16 S amplicon sequencing. An interpretable machine learning framework for imbalanced high-dimensional big data of clinical microbial samples was developed to identify 14 oral microbiome features associated with oral diseases. Microbiome risk scores (MRSs) with the identified features were constructed with SHapley Additive exPlanations (SHAP). Correlations of the MRSs with individual physiological indicators and lifestyle habits were calculated. CONCLUSION Our results reveal a set of oral microbiome features associated with oral diseases. Our study demonstrates the feasibility of preventing oral disease through lifestyle interventions and provides a reference method for the era of precision medicine aimed at individualized medicine.
Collapse
Affiliation(s)
- Yueyang Yan
- Key Laboratory for Zoonoses Research of the Ministry of Education, Institute of Zoonosis, College of Veterinary Medicine, Jilin University, Changchun 130062, China
| | - Xin Bao
- Hospital of Stomatology, Jilin University, 1500 Qinghua Road, Changchun 130021, China
| | - Bohua Chen
- Department of Stomatology, The Fifth Affiliated Hospital of Sun Yat-sen University, 52 Meihua East Road, Xiangzhou District, Zhuhai City, Guangdong Province, China
| | - Ying Li
- Key Laboratory of Symbol Computation and Knowledge Engineering, Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Jigang Yin
- Key Laboratory for Zoonoses Research of the Ministry of Education, Institute of Zoonosis, College of Veterinary Medicine, Jilin University, Changchun 130062, China
| | - Guan Zhu
- Key Laboratory for Zoonoses Research of the Ministry of Education, Institute of Zoonosis, College of Veterinary Medicine, Jilin University, Changchun 130062, China
| | - Qiushi Li
- Key Laboratory for Zoonoses Research of the Ministry of Education, Institute of Zoonosis, College of Veterinary Medicine, Jilin University, Changchun 130062, China; Department of Stomatology, The Fifth Affiliated Hospital of Sun Yat-sen University, 52 Meihua East Road, Xiangzhou District, Zhuhai City, Guangdong Province, China.
| |
Collapse
|
36
|
Tang M, Meng C, Wu H, Zhu H, Yi J, Tang J, Wang Y. Fault Detection for Wind Turbine Blade Bolts Based on GSG Combined with CS-LightGBM. Sensors (Basel) 2022; 22:s22186763. [PMID: 36146110 PMCID: PMC9505918 DOI: 10.3390/s22186763] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/10/2022] [Revised: 08/24/2022] [Accepted: 08/25/2022] [Indexed: 05/27/2023]
Abstract
Aiming at the problem of class imbalance in the wind turbine blade bolts operation-monitoring dataset, a fault detection method for wind turbine blade bolts based on Gaussian Mixture Model-Synthetic Minority Oversampling Technique-Gaussian Mixture Model (GSG) combined with Cost-Sensitive LightGBM (CS-LightGBM) was proposed. Since it is difficult to obtain the fault samples of blade bolts, the GSG oversampling method was constructed to increase the fault samples in the blade bolt dataset. The method obtains the optimal number of clusters through the BIC criterion, and uses the GMM based on the optimal number of clusters to optimally cluster the fault samples in the blade bolt dataset. According to the density distribution of fault samples in inter-clusters, we synthesized new fault samples using SMOTE in an intra-cluster. This retains the distribution characteristics of the original fault class samples. Then, we used the GMM with the same initial cluster center to cluster the fault class samples that were added to new samples, and removed the synthetic fault class samples that were not clustered into the corresponding clusters. Finally, the synthetic data training set was used to train the CS-LightGBM fault detection model. Additionally, the hyperparameters of CS-LightGBM were optimized by the Bayesian optimization algorithm to obtain the optimal CS-LightGBM fault detection model. The experimental results show that compared with six models including SMOTE-LightGBM, CS-LightGBM, K-means-SMOTE-LightGBM, etc., the proposed fault detection model is superior to the other comparison methods in the false alarm rate, missing alarm rate and F1-score index. The method can well realize the fault detection of large wind turbine blade bolts.
Collapse
Affiliation(s)
- Mingzhu Tang
- School of Energy and Power Engineering, Changsha University of Science & Technology, Changsha 410114, China
| | - Caihua Meng
- School of Energy and Power Engineering, Changsha University of Science & Technology, Changsha 410114, China
| | - Huawei Wu
- Hubei Key Laboratory of Power System Design and Test for Electrical Vehicle, Hubei University of Arts and Science, Xiangyang 441053, China
| | - Hongqiu Zhu
- School of Automation, Central South University, Changsha 410083, China
| | - Jiabiao Yi
- School of Energy and Power Engineering, Changsha University of Science & Technology, Changsha 410114, China
| | - Jun Tang
- School of Energy and Power Engineering, Changsha University of Science & Technology, Changsha 410114, China
| | - Yifan Wang
- School of Energy and Power Engineering, Changsha University of Science & Technology, Changsha 410114, China
| |
Collapse
|
37
|
Prasetiyowati MI, Maulidevi NU, Surendro K. The accuracy of Random Forest performance can be improved by conducting a feature selection with a balancing strategy. PeerJ Comput Sci 2022; 8:e1041. [PMID: 35875646 PMCID: PMC9299283 DOI: 10.7717/peerj-cs.1041] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2021] [Accepted: 06/22/2022] [Indexed: 06/12/2023]
Abstract
One of the significant purposes of building a model is to increase its accuracy within a shorter timeframe through the feature selection process. It is carried out by determining the importance of available features in a dataset using Information Gain (IG). The process is used to calculate the amounts of information contained in features with high values selected to accelerate the performance of an algorithm. In selecting informative features, a threshold value (cut-off) is used by the Information Gain (IG). Therefore, this research aims to determine the time and accuracy-performance needed to improve feature selection by integrating IG, the Fast Fourier Transform (FFT), and Synthetic Minor Oversampling Technique (SMOTE) methods. The feature selection model is then applied to the Random Forest, a tree-based machine learning algorithm with random feature selection. A total of eight datasets consisting of three balanced and five imbalanced datasets were used to conduct this research. Furthermore, the SMOTE found in the imbalance dataset was used to balance the data. The result showed that the feature selection using Information Gain, FFT, and SMOTE improved the performance accuracy of Random Forest.
Collapse
Affiliation(s)
- Maria Irmina Prasetiyowati
- Doctoral Program of Electrical Engineering and Informatics, School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Bandung, Jawa Barat, Indonesia
| | - Nur Ulfa Maulidevi
- Department of Electrical Engineering and Informatics, School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Bandung, Jawa Barat, Indonesia
| | - Kridanto Surendro
- Department of Electrical Engineering and Informatics, School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Bandung, Jawa Barat, Indonesia
| |
Collapse
|
38
|
Kogut T, Tomczak A, Słowik A, Oberski T. Seabed Modelling by Means of Airborne Laser Bathymetry Data and Imbalanced Learning for Offshore Mapping. Sensors (Basel) 2022; 22:s22093121. [PMID: 35590809 PMCID: PMC9100212 DOI: 10.3390/s22093121] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/23/2022] [Revised: 04/15/2022] [Accepted: 04/18/2022] [Indexed: 11/16/2022]
Abstract
An important problem associated with the aerial mapping of the seabed is the precise classification of point clouds characterizing the water surface, bottom, and bottom objects. This study aimed to improve the accuracy of classification by addressing the asymmetric amount of data representing these three groups. A total of 53 Synthetic Minority Oversampling Technique (SMOTE) algorithms were adjusted and evaluated to balance the amount of data. The prepared data set was used to train the Multi-Layer Perceptron (MLP) neural network used for classifying the point cloud. Data balancing contributed to significantly increasing the accuracy of classification. The best overall classification accuracy achieved varied from 95.8% to 97.0%, depending on the oversampling algorithm used, and was significantly better than the classification accuracy obtained for unbalanced data and data with downsampling (89.6% and 93.5%, respectively). Some of the algorithms allow for 10% increased detection of points on the objects compared to unbalanced data or data with simple downsampling. The results suggest that the use of selected oversampling algorithms can aid in improving the point cloud classification and making the airborne laser bathymetry technique more appropriate for seabed mapping.
Collapse
Affiliation(s)
- Tomasz Kogut
- Department of Geodesy and Offshore Survey, Maritime University of Szczecin, Żołnierska 46, 71-250 Szczecin, Poland;
- Correspondence:
| | - Arkadiusz Tomczak
- Department of Geodesy and Offshore Survey, Maritime University of Szczecin, Żołnierska 46, 71-250 Szczecin, Poland;
| | - Adam Słowik
- Department of Computer Engineering, Koszalin University of Technology, Sniadeckich 2, 75-453 Koszalin, Poland;
| | - Tomasz Oberski
- Department of Geodesy and Geoinformatics, Koszalin University of Technology, Sniadeckich 2, 75-453 Koszalin, Poland;
| |
Collapse
|
39
|
Kumari M, Subbarao N. A hybrid resampling algorithms SMOTE and ENN based deep learning models for identification of Marburg virus inhibitors. Future Med Chem 2022. [PMID: 35393862 DOI: 10.4155/fmc-2021-0290] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
Background: Marburg virus (MARV) is a sporadic outbreak of a zoonotic disease that causes lethal hemorrhagic fever in humans. We propose a deep learning model with resampling techniques and predict the inhibitory activity of MARV from unknown compounds in the virtual screening process. Methodology & results: We applied resampling techniques to solve the imbalanced data problem. The classifier model comparisons revealed that the hybrid model of synthetic minority oversampling technique - edited nearest neighbor and artificial neural network (SMOTE-ENN + ANN) achieved better classification performance with 95% overall accuracy. The trained SMOTE-ENN+ANN hybrid model predicted as lead molecules; 25 out of 87,043 from ChemDiv, four out of 340 from ChEMBL anti-viral library, three out of 918 from Phytochemical database, and seven out of 419 from Natural products from NCI divsetIV, and 214 out of 1,12,267 from Natural compounds ZINC database for MARV. Conclusion: Our studies reveal that the proposed SMOTE-ENN + ANN hybrid model can improve overall accuracy more effectively and predict new lead molecules against MARV.
Collapse
|
40
|
Kim J, Mun S, Lee S, Jeong K, Baek Y. Prediction of metabolic and pre-metabolic syndromes using machine learning models with anthropometric, lifestyle, and biochemical factors from a middle-aged population in Korea. BMC Public Health 2022; 22:664. [PMID: 35387629 PMCID: PMC8985311 DOI: 10.1186/s12889-022-13131-x] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2021] [Accepted: 03/30/2022] [Indexed: 01/10/2023] Open
Abstract
Background Metabolic syndrome (MetS) is a complex condition that appears as a cluster of metabolic abnormalities, and is closely associated with the prevalence of various diseases. Early prediction of the risk of MetS in the middle-aged population provides greater benefits for cardiovascular disease-related health outcomes. This study aimed to apply the latest machine learning techniques to find the optimal MetS prediction model for the middle-aged Korean population. Methods We retrieved 20 data types from the Korean Medicine Daejeon Citizen Cohort, a cohort study on a community-based population of adults aged 30–55 years. The data included sex, age, anthropometric data, lifestyle-related data, and blood indicators of 1991 individuals. Participants satisfying two (pre-MetS) or ≥ 3 (MetS) of the five NECP-ATP III criteria were included in the MetS group. MetS prediction used nine machine learning models based on the following algorithms: Decision tree, Gaussian Naïve Bayes, K-nearest neighbor, eXtreme gradient boosting (XGBoost), random forest, logistic regression, support vector machine, multi-layer perceptron, and 1D convolutional neural network. All analyses were performed by sequentially inputting the features in three steps according to their characteristics. The models’ performances were compared after applying the synthetic minority oversampling technique (SMOTE) to resolve data imbalance. Results MetS was detected in 33.85% of the subjects. Among the MetS prediction models, the tree-based random forest and XGBoost models showed the best performance, which improved with the number of features used. As a measure of the models’ performance, the area under the receiver operating characteristic curve (AUC) increased by up to 0.091 when the SMOTE was applied, with XGBoost showing the highest AUC of 0.851. Body mass index and waist-to-hip ratio were identified as the most important features in the MetS prediction models for this population. Conclusions Tree-based machine learning models were useful in identifying MetS with high accuracy in middle-aged Koreans. Early diagnosis of MetS is important and requires a multidimensional approach that includes self-administered questionnaire, anthropometric, and biochemical measurements.
Collapse
Affiliation(s)
- Junho Kim
- KM Data Division, Korea Institute of Oriental Medicine, 1672 Yuseongdae-ro, Yuseong-gu, Daejeon, Republic of Korea
| | - Sujeong Mun
- KM Data Division, Korea Institute of Oriental Medicine, 1672 Yuseongdae-ro, Yuseong-gu, Daejeon, Republic of Korea
| | - Siwoo Lee
- KM Data Division, Korea Institute of Oriental Medicine, 1672 Yuseongdae-ro, Yuseong-gu, Daejeon, Republic of Korea
| | - Kyoungsik Jeong
- KM Data Division, Korea Institute of Oriental Medicine, 1672 Yuseongdae-ro, Yuseong-gu, Daejeon, Republic of Korea
| | - Younghwa Baek
- KM Data Division, Korea Institute of Oriental Medicine, 1672 Yuseongdae-ro, Yuseong-gu, Daejeon, Republic of Korea.
| |
Collapse
|
41
|
Liu X, Fu L, Chun-Wei Lin J, Liu S. SRAS-net: Low-resolution chromosome image classification based on deep learning. IET Syst Biol 2022; 16:85-97. [PMID: 35373918 PMCID: PMC9290780 DOI: 10.1049/syb2.12042] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2022] [Revised: 02/14/2022] [Accepted: 03/15/2022] [Indexed: 12/03/2022] Open
Abstract
Prenatal karyotype diagnosis is important to determine if the foetus has genetic diseases and some congenital diseases. Chromosome classification is an important part of karyotype analysis, and the task is tedious and lengthy. Chromosome classification methods based on deep learning have achieved good results, but if the quality of the chromosome image is not high, these methods cannot learn image features well, resulting in unsatisfactory classification results. Moreover, the existing methods generally have a poor effect on sex chromosome classification. Therefore, in this work, the authors propose to use a super‐resolution network, Self‐Attention Negative Feedback Network, and combine it with traditional neural networks to obtain an efficient chromosome classification method called SRAS‐net. The method first inputs the low‐resolution chromosome images into the super‐resolution network to generate high‐resolution chromosome images and then uses the traditional deep learning model to classify the chromosomes. To solve the problem of inaccurate sex chromosome classification, the authors also propose to use the SMOTE algorithm to generate a small number of sex chromosome samples to ensure a balanced number of samples while allowing the model to learn more sex chromosome features. Experimental results show that our method achieves 97.55% accuracy and is better than state‐of‐the‐art methods.
Collapse
Affiliation(s)
- Xiangbin Liu
- Hunan Provincial Key Laboratory of Intelligent Computing and Language Information Processing, Hunan Normal University, Changsha, China.,College of Information Science and Engineering, Hunan Normal University, Changsha, China.,Hunan Xiangjiang Artificial Intelligence Academy, Changsha, China
| | - Lijun Fu
- Hunan Provincial Key Laboratory of Intelligent Computing and Language Information Processing, Hunan Normal University, Changsha, China.,College of Information Science and Engineering, Hunan Normal University, Changsha, China.,Hunan Xiangjiang Artificial Intelligence Academy, Changsha, China
| | - Jerry Chun-Wei Lin
- Department of Computer Science, Electrical Engineering and Mathematical Sciences, Western Norway University of Applied Sciences, Bergen, Norway
| | - Shuai Liu
- Hunan Provincial Key Laboratory of Intelligent Computing and Language Information Processing, Hunan Normal University, Changsha, China.,College of Information Science and Engineering, Hunan Normal University, Changsha, China.,Hunan Xiangjiang Artificial Intelligence Academy, Changsha, China
| |
Collapse
|
42
|
Anuntakarun S, Lertampaiporn S, Laomettachit T, Wattanapornprom W, Ruengjitchatchawalya M. mSRFR: a machine learning model using microalgal signature features for ncRNA classification. BioData Min 2022; 15:8. [PMID: 35313925 PMCID: PMC8935802 DOI: 10.1186/s13040-022-00291-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2021] [Accepted: 02/06/2022] [Indexed: 11/10/2022] Open
Abstract
This work presents mSRFR (microalgae SMOTE Random Forest Relief model), a classification tool for noncoding RNAs (ncRNAs) in microalgae, including green algae, diatoms, golden algae, and cyanobacteria. First, the SMOTE technique was applied to address the challenge of imbalanced data due to the different numbers of microalgae ncRNAs from different species in the EBI RNA-central database. Then the top 20 significant features from a total of 106 features, including sequence-based, secondary structure, base-pair, and triplet sequence-structure features, were selected using the Relief feature selection method. Next, ten-fold cross-validation was applied to choose a classifier algorithm with the highest performance among Support Vector Machine, Random Forest, Decision Tree, Naïve Bayes, K-nearest Neighbor, and Neural Network, based on the receiver operating characteristic (ROC) area. The results showed that the Random Forest classifier achieved the highest ROC area of 0.992. Then, the Random Forest algorithm was selected and compared with other tools, including RNAcon, CPC, CPC2, CNCI, and CPPred. Our model achieved a high accuracy of about 97% and a low false-positive rate of about 2% in predicting the test dataset of microalgae. Furthermore, the top features from Relief revealed that the %GA dinucleotide is a signature feature of microalgal ncRNAs when compared to Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana, and Homo sapiens.
Collapse
Affiliation(s)
- Songtham Anuntakarun
- Bioinformatics and Systems Biology Program, School of Bioresources and Technology, King Mongkut's University of Technology Thonburi (KMUTT), Bangkok, 10150, Thailand.,School of Information Technology, KMUTT, Bang Mod, Thung Khru, Bangkok, 10140, Thailand
| | - Supatcha Lertampaiporn
- Biochemical Engineering and Systems Biology Research Group, National Center for Genetic Engineering and Biotechnology (BIOTEC), National Science and Technology Development Agency at King Mongkut's University of Technology Thonburi, Bang Khun Thian, Bangkok, 10150, Thailand
| | - Teeraphan Laomettachit
- Bioinformatics and Systems Biology Program, School of Bioresources and Technology, King Mongkut's University of Technology Thonburi (KMUTT), Bangkok, 10150, Thailand
| | | | - Marasri Ruengjitchatchawalya
- Bioinformatics and Systems Biology Program, School of Bioresources and Technology, King Mongkut's University of Technology Thonburi (KMUTT), Bangkok, 10150, Thailand. .,Biotechnology program, School of Bioresources and Technology, KMUTT, Bang Khun Thian, Bangkok, 10150, Thailand. .,Algal Biotechnology Research Group, Pilot Plant Development and Training Institute (PDTI), KMUTT, Bang Khun Thian, Bangkok, 10150, Thailand.
| |
Collapse
|
43
|
Vu BN, Bi J, Wang W, Huff A, Kondragunta S, Liu Y. Application of geostationary satellite and high-resolution meteorology data in estimating hourly PM 2.5 levels during the Camp Fire episode in California. Remote Sens Environ 2022; 271:112890. [PMID: 37033879 PMCID: PMC10081518 DOI: 10.1016/j.rse.2022.112890] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Wildland fire smoke contains large amounts of PM2.5 that can traverse tens to hundreds of kilometers, resulting in significant deterioration of air quality and excess mortality and morbidity in downwind regions. Estimating PM2.5 levels while considering the impact of wildfire smoke has been challenging due to the lack of ground monitoring coverage near the smoke plumes. We aim to estimate total PM2.5 concentration during the Camp Fire episode, the deadliest wildland fire in California history. Our random forest (RF) model combines calibrated low-cost sensor data (PurpleAir) with regulatory monitor measurements (Air Quality System, AQS) to bolster ground observations, Geostationary Operational Environmental Satellite-16 (GOES-16)'s high temporal resolution to achieve hourly predictions, and oversampling techniques (Synthetic Minority Oversampling Technique, SMOTE) to reduce model underestimation at high PM2.5 levels. In addition, meteorological fields at 3 km resolution from the High-Resolution Rapid Refresh model and land use variables were also included in the model. Our AQS-only model achieved an out of bag (OOB) R2 (RMSE) of 0.84 (12.00 μg/m3) and spatial and temporal cross-validation (CV) R2 (RMSE) of 0.74 (16.28 μg/m3) and 0.73 (16.58 μg/m3), respectively. Our AQS + Weighted PurpleAir Model achieved OOB R2 (RMSE) of 0.86 (9.52 μg/m3) and spatial and temporal CV R2 (RMSE) of 0.75 (14.93 μg/m3) and 0.79 (11.89 μg/m3), respectively. Our AQS + Weighted PurpleAir + SMOTE Model achieved OOB R2 (RMSE) of 0.92 (10.44 μg/m3) and spatial and temporal CV R2 (RMSE) of 0.84 (12.36 μg/m3) and 0.85 (14.88 μg/m3), respectively. Hourly predictions from our model may aid in epidemiological investigations of intense and acute exposure to PM2.5 during the Camp Fire episode.
Collapse
Affiliation(s)
- Bryan N. Vu
- Gangarosa Department of Environmental Health, Rollins School of Public Health, Emory University, Atlanta, GA, United States
- Department of Environmental Health, Harvard T.H. Chan School of Public Health, Harvard University, Boston, MA, United States
| | - Jianzhao Bi
- Department of Environmental & Occupational Health Sciences, School of Public Health, University of Washington, Seattle, WA, United States
| | - Wenhao Wang
- Gangarosa Department of Environmental Health, Rollins School of Public Health, Emory University, Atlanta, GA, United States
| | - Amy Huff
- I.M. Systems Group, 5825 University Research Ct, Suite 3250, College Park, MD, United States
| | - Shobha Kondragunta
- Satellite Meteorology and Climatology Division, STAR Center for Satellite Applications and Research, National Oceanic and Atmospheric Administration, Washington, DC, United States
| | - Yang Liu
- Gangarosa Department of Environmental Health, Rollins School of Public Health, Emory University, Atlanta, GA, United States
| |
Collapse
|
44
|
Chen PN, Lee CC, Liang CM, Pao SI, Huang KH, Lin KF. General deep learning model for detecting diabetic retinopathy. BMC Bioinformatics 2021; 22:84. [PMID: 34749634 PMCID: PMC8576963 DOI: 10.1186/s12859-021-04005-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2021] [Accepted: 02/08/2021] [Indexed: 01/04/2023] Open
Abstract
BACKGROUND Doctors can detect symptoms of diabetic retinopathy (DR) early by using retinal ophthalmoscopy, and they can improve diagnostic efficiency with the assistance of deep learning to select treatments and support personnel workflow. Conventionally, most deep learning methods for DR diagnosis categorize retinal ophthalmoscopy images into training and validation data sets according to the 80/20 rule, and they use the synthetic minority oversampling technique (SMOTE) in data processing (e.g., rotating, scaling, and translating training images) to increase the number of training samples. Oversampling training may lead to overfitting of the training model. Therefore, untrained or unverified images can yield erroneous predictions. Although the accuracy of prediction results is 90%-99%, this overfitting of training data may distort training module variables. RESULTS This study uses a 2-stage training method to solve the overfitting problem. In the training phase, to build the model, the Learning module 1 used to identify the DR and no-DR. The Learning module 2 on SMOTE synthetic datasets to identify the mild-NPDR, moderate NPDR, severe NPDR and proliferative DR classification. These two modules also used early stopping and data dividing methods to reduce overfitting by oversampling. In the test phase, we use the DIARETDB0, DIARETDB1, eOphtha, MESSIDOR, and DRIVE datasets to evaluate the performance of the training network. The prediction accuracy achieved to 85.38%, 84.27%, 85.75%, 86.73%, and 92.5%. CONCLUSIONS Based on the experiment, a general deep learning model for detecting DR was developed, and it could be used with all DR databases. We provided a simple method of addressing the imbalance of DR databases, and this method can be used with other medical images.
Collapse
Affiliation(s)
- Ping-Nan Chen
- Department of Biomedical Engineering, National Defense Medical Center, Taipei, 114, Taiwan, ROC.
| | - Chia-Chiang Lee
- Graduate Institute of Applied Science and Technology, National Taiwan University of Science and Technology, Taipei, 106, Taiwan, ROC
| | - Chang-Min Liang
- Department of Ophthalmology, Tri-Service General Hospital, National Defense Medical Center, Taipei, 114, Taiwan, ROC
| | - Shu-I Pao
- Department of Ophthalmology, Tri-Service General Hospital, National Defense Medical Center, Taipei, 114, Taiwan, ROC
| | - Ke-Hao Huang
- Department of Ophthalmology, Tri-Service General Hospital, National Defense Medical Center, Taipei, 114, Taiwan, ROC
| | - Ke-Feng Lin
- Graduate Institute of Applied Science and Technology, National Taiwan University of Science and Technology, Taipei, 106, Taiwan, ROC.,Department of Medical Records, Tri-Service General Hospital, National Defense Medical Center, Taipei, 114, Taiwan, ROC
| |
Collapse
|
45
|
Qasim HM, Ata O, Ansari MA, Alomary MN, Alghamdi S, Almehmadi M. Hybrid Feature Selection Framework for the Parkinson Imbalanced Dataset Prediction Problem. Medicina (Kaunas) 2021; 57:1217. [PMID: 34833435 DOI: 10.3390/medicina57111217] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/12/2021] [Revised: 10/29/2021] [Accepted: 11/05/2021] [Indexed: 11/16/2022]
Abstract
Background and Objectives: Recently, many studies have focused on the early detection of Parkinson's disease (PD). This disease belongs to a group of neurological problems that immediately affect brain cells and influence the movement, hearing, and various cognitive functions. Medical data sets are often not equally distributed in their classes and this gives a bias in the classification of patients. We performed a Hybrid feature selection framework that can deal with imbalanced datasets like PD. Use the SOMTE algorithm to deal with unbalanced datasets. Removing the contradiction from the features in the dataset and decrease the processing time by using Recursive Feature Elimination (RFE), and Principle Component Analysis (PCA). Materials and Methods: PD acoustic datasets and the characteristics of control subjects were used to construct classification models such as Bagging, K-nearest neighbour (KNN), multilayer perceptron, and the support vector machine (SVM). In the prepressing stage, the synthetic minority over-sampling technique (SMOTE) with two-feature selection RFE and PCA were used. The PD dataset comprises a large difference between the numbers of the infected and uninfected patients, which causes the classification bias problem. Therefore, SMOTE was used to resolve this problem. Results: For model evaluation, the train-test split technique was used for the experiment. All the models were Grid-search tuned, the evaluation results of the SVM model showed the highest accuracy of 98.2%, and the KNN model exhibited the highest specificity of 99%. Conclusions: the proposed method is compared with the current modern methods of detecting Parkinson's disease and other methods for medical diseases, it was noted that our developed system could treat data bias and reach a high prediction of PD and this can be beneficial for health organizations to properly prioritize assets.
Collapse
|
46
|
Zhang Y, Jiang Z, Chen C, Wei Q, Gu H, Yu B. DeepStack-DTIs: Predicting Drug-Target Interactions Using LightGBM Feature Selection and Deep-Stacked Ensemble Classifier. Interdiscip Sci 2021; 14:311-330. [PMID: 34731411 DOI: 10.1007/s12539-021-00488-7] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2021] [Revised: 10/19/2021] [Accepted: 10/21/2021] [Indexed: 12/12/2022]
Abstract
Accurate prediction of drug-target interactions (DTIs), which is often used in the fields of drug discovery and drug repositioning, is regarded a key challenge in the study of drug science. In this paper, a new method called DeepStack-DTIs is proposed to predict DTIs. First, for the target protein, pseudo-position specific score matrix, pseudo amino acid composition and SPIDER3 are used to extract the different feature information of the target protein. Meanwhile, the path-based fingerprint features of each drug are extracted. Then, the synthetic minority oversampling technique (SMOTE) and light gradient boosting machine (LightGBM) are used for data balancing and feature selection, respectively. Finally, the processed features are input to the deep-stacked ensemble classifier composed of gated recurrent unit (GRU), deep neural network (DNN), support vector machine (SVM), eXtreme gradient boosting (XGBoost) and logistic regression (LR) to predict DTIs. Under the five-fold cross-validation and compared with existing methods, the proposed method achieves higher prediction accuracy on the gold standard dataset. To evaluate the predictive power of DeepStack-DTIs, we validate the method on another dataset and predict the drug-target interaction network. The results indicate that DeepStack-DTIs has excellent predictive ability than the other methods, and provides novel insights for the prediction of DTIs. A novel method DeepStack-DTIs for drug-target interactions prediction. PsePSSM, PseAAC, SPIDER3 and FP2 are fused to convert protein sequence and drug molecule information into digital information, respectively. The SMOTE algorithm is used to balance the dataset and LightGBM feature selection algorithm is employed to remove redundant and irrelevant features to select the optimal feature subset. This optimal feature subset is inputted into the deep-stacked ensemble classifier to predict drug-target interactions. The experimental results show DeepStack-DTIs method can significantly improve the prediction accuracy of drug-target interactions.
Collapse
Affiliation(s)
- Yan Zhang
- College of Mechanical and Electrical Engineering, Qingdao University of Science and Technology, Qingdao, 266061, China.,College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Zhiwen Jiang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Cheng Chen
- School of Computer Science and Technology, Shandong University, Qingdao, 266237, China
| | - Qinqin Wei
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Haiming Gu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China.,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Bin Yu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China. .,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao, 266061, China. .,Key Laboratory of Computational Science and Application of Hainan Province, Haikou, 571158, China.
| |
Collapse
|
47
|
Venkata Vara Prasad D, Senthil Kumar P, Venkataramana LY, Prasannamedha G, Harshana S, Jahnavi Srividya S, Harrinei K, Indraganti S. Automating water quality analysis using ML and auto ML techniques. Environ Res 2021; 202:111720. [PMID: 34297938 DOI: 10.1016/j.envres.2021.111720] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/03/2021] [Revised: 07/02/2021] [Accepted: 07/09/2021] [Indexed: 06/13/2023]
Abstract
Generation of unprocessed effluents, municipal refuse, factory wastes, junking of compostable and non-compostable effluents has hugely contaminated nature-provided water bodies like rivers, lakes and ponds. Therefore, there is a necessity to look into the water standards before the usage. This is a problem that can greatly benefit from Artificial Intelligence (AI). Traditional methods require human inspection and is time consuming. Automatic Machine Learning (AutoML) facilities supply machine learning with push of a button, or, on a minimum level, ensure to retain algorithm execution, data pipelines, and code, generally, are kept from sight and are anticipated to be the stepping stone for normalising AI. However, it is still a field under research. This work aims to recognize the areas where an AutoML system falls short or outperforms a traditional expert system built by data scientists. Keeping this as the motive, this work dives into the Machine Learning (ML) algorithms for comparing AutoML and an expert architecture built by the authors for Water Quality Assessment to evaluate the Water Quality Index, which gives the general water quality, and the Water Quality Class, a term classified on the basis of the Water Quality Index. The results prove that the accuracy of AutoML and TPOT was 1.4 % higher than conventional ML techniques for binary class water data. For Multi class water data, AutoML was 0.5 % higher and TPOT was 0.6% higher than conventional ML techniques.
Collapse
Affiliation(s)
- D Venkata Vara Prasad
- Department of Computer Science and Engineering, Sri Sivasubramaniya Nadar College of Engineering, Chennai, 603110, India; Centre of Excellence in Water Research (CEWAR), Sri Sivasubramaniya Nadar College of Engineering, Chennai, 603110, India
| | - P Senthil Kumar
- Sri Sivasubramaniya Nadar College of Engineering, Department of Chemical Engineering, Chennai, 603110, India; Centre of Excellence in Water Research (CEWAR), Sri Sivasubramaniya Nadar College of Engineering, Chennai, 603110, India.
| | - Lokeswari Y Venkataramana
- Department of Computer Science and Engineering, Sri Sivasubramaniya Nadar College of Engineering, Chennai, 603110, India; Centre of Excellence in Water Research (CEWAR), Sri Sivasubramaniya Nadar College of Engineering, Chennai, 603110, India
| | - G Prasannamedha
- Sri Sivasubramaniya Nadar College of Engineering, Department of Chemical Engineering, Chennai, 603110, India; Centre of Excellence in Water Research (CEWAR), Sri Sivasubramaniya Nadar College of Engineering, Chennai, 603110, India
| | - S Harshana
- Department of Computer Science and Engineering, Sri Sivasubramaniya Nadar College of Engineering, Chennai, 603110, India
| | - S Jahnavi Srividya
- Department of Computer Science and Engineering, Sri Sivasubramaniya Nadar College of Engineering, Chennai, 603110, India
| | - K Harrinei
- Department of Computer Science and Engineering, Sri Sivasubramaniya Nadar College of Engineering, Chennai, 603110, India
| | - Sravya Indraganti
- Sri Sivasubramaniya Nadar College of Engineering, Department of Chemical Engineering, Chennai, 603110, India; Centre of Excellence in Water Research (CEWAR), Sri Sivasubramaniya Nadar College of Engineering, Chennai, 603110, India
| |
Collapse
|
48
|
Garrafa E, Vezzoli M, Ravanelli M, Farina D, Borghesi A, Calza S, Maroldi R. Early prediction of in-hospital death of COVID-19 patients: a machine-learning model based on age, blood analyses, and chest x-ray score. eLife 2021; 10:70640. [PMID: 34661530 PMCID: PMC8550757 DOI: 10.7554/elife.70640] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2021] [Accepted: 10/17/2021] [Indexed: 12/15/2022] Open
Abstract
An early-warning model to predict in-hospital mortality on admission of COVID-19 patients at an emergency department (ED) was developed and validated using a machine-learning model. In total, 2782 patients were enrolled between March 2020 and December 2020, including 2106 patients (first wave) and 676 patients (second wave) in the COVID-19 outbreak in Italy. The first-wave patients were divided into two groups with 1474 patients used to train the model, and 632 to validate it. The 676 patients in the second wave were used to test the model. Age, 17 blood analytes, and Brescia chest X-ray score were the variables processed using a random forests classification algorithm to build and validate the model. Receiver operating characteristic (ROC) analysis was used to assess the model performances. A web-based death-risk calculator was implemented and integrated within the Laboratory Information System of the hospital. The final score was constructed by age (the most powerful predictor), blood analytes (the strongest predictors were lactate dehydrogenase, D-dimer, neutrophil/lymphocyte ratio, C-reactive protein, lymphocyte %, ferritin std, and monocyte %), and Brescia chest X-ray score (https://bdbiomed.shinyapps.io/covid19score/). The areas under the ROC curve obtained for the three groups (training, validating, and testing) were 0.98, 0.83, and 0.78, respectively. The model predicts in-hospital mortality on the basis of data that can be obtained in a short time, directly at the ED on admission. It functions as a web-based calculator, providing a risk score which is easy to interpret. It can be used in the triage process to support the decision on patient allocation.
Collapse
Affiliation(s)
- Emirena Garrafa
- Department of Molecular and Translational Medicine, University of Brescia, Brescia, Italy.,ASST Spedali Civili di Brescia, Department of Laboratory, Brescia, Italy
| | - Marika Vezzoli
- Department of Molecular and Translational Medicine, University of Brescia, Brescia, Italy
| | - Marco Ravanelli
- Department of Medical and Surgical Specialties, Radiological Sciences and Public Health, University of Brescia, Brescia, Italy.,ASST Spedali Civili di Brescia, Department of Radiology, Brescia, Italy
| | - Davide Farina
- Department of Medical and Surgical Specialties, Radiological Sciences and Public Health, University of Brescia, Brescia, Italy.,ASST Spedali Civili di Brescia, Department of Radiology, Brescia, Italy
| | - Andrea Borghesi
- Department of Medical and Surgical Specialties, Radiological Sciences and Public Health, University of Brescia, Brescia, Italy.,ASST Spedali Civili di Brescia, Department of Radiology, Brescia, Italy
| | - Stefano Calza
- Department of Molecular and Translational Medicine, University of Brescia, Brescia, Italy
| | - Roberto Maroldi
- Department of Medical and Surgical Specialties, Radiological Sciences and Public Health, University of Brescia, Brescia, Italy.,ASST Spedali Civili di Brescia, Department of Radiology, Brescia, Italy
| |
Collapse
|
49
|
Hatzidaki E, Iliopoulos A, Papasotiriou I. A Novel Method for Colorectal Cancer Screening Based on Circulating Tumor Cells and Machine Learning. Entropy (Basel) 2021; 23:e23101248. [PMID: 34681972 PMCID: PMC8534570 DOI: 10.3390/e23101248] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/12/2021] [Revised: 09/20/2021] [Accepted: 09/21/2021] [Indexed: 02/07/2023]
Abstract
Colorectal cancer is one of the most common types of cancer, and it can have a high mortality rate if left untreated or undiagnosed. The fact that CRC becomes symptomatic at advanced stages highlights the importance of early screening. The reference screening method for CRC is colonoscopy, an invasive, time-consuming procedure that requires sedation or anesthesia and is recommended from a certain age and above. The aim of this study was to build a machine learning classifier that can distinguish cancer from non-cancer samples. For this, circulating tumor cells were enumerated using flow cytometry. Their numbers were used as a training set for building an optimized SVM classifier that was subsequently used on a blind set. The SVM classifier’s accuracy on the blind samples was found to be 90.0%, sensitivity was 80.0%, specificity was 100.0%, precision was 100.0% and AUC was 0.98. Finally, in order to test the generalizability of our method, we also compared the performances of different classifiers developed by various machine learning models, using over-sampling datasets generated by the SMOTE algorithm. The results showed that SVM achieved the best performances according to the validation accuracy metric. Overall, our results demonstrate that CTCs enumerated by flow cytometry can provide significant information, which can be used in machine learning algorithms to successfully discriminate between healthy and colorectal cancer patients. The clinical significance of this method could be the development of a simple, fast, non-invasive cancer screening tool based on blood CTC enumeration by flow cytometry and machine learning algorithms.
Collapse
Affiliation(s)
- Eleana Hatzidaki
- Research Genetic Cancer Centre SA (RGCC), 53100 Florina, Greece; (E.H.); (A.I.)
| | - Aggelos Iliopoulos
- Research Genetic Cancer Centre SA (RGCC), 53100 Florina, Greece; (E.H.); (A.I.)
| | - Ioannis Papasotiriou
- Research Genetic Cancer Centre International GmbH, 6300 Zug, Switzerland
- Correspondence:
| |
Collapse
|
50
|
Aldraimli M, Soria D, Grishchuck D, Ingram S, Lyon R, Mistry A, Oliveira J, Samuel R, Shelley LEA, Osman S, Dwek MV, Azria D, Chang-Claude J, Gutiérrez-Enríquez S, De Santis MC, Rosenstein BS, De Ruysscher D, Sperk E, Symonds RP, Stobart H, Vega A, Veldeman L, Webb A, Talbot CJ, West CM, Rattay T, Chaussalet TJ. A data science approach for early-stage prediction of Patient's susceptibility to acute side effects of advanced radiotherapy. Comput Biol Med 2021; 135:104624. [PMID: 34247131 DOI: 10.1016/j.compbiomed.2021.104624] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2021] [Revised: 06/24/2021] [Accepted: 06/28/2021] [Indexed: 11/20/2022]
Abstract
The prediction by classification of side effects incidence in a given medical treatment is a common challenge in medical research. Machine Learning (ML) methods are widely used in the areas of risk prediction and classification. The primary objective of such algorithms is to use several features to predict dichotomous responses (e.g., disease positive/negative). Similar to statistical inference modelling, ML modelling is subject to the class imbalance problem and is affected by the majority class, increasing the false-negative rate. In this study, seventy-nine ML models were built and evaluated to classify approximately 2000 participants from 26 hospitals in eight different countries into two groups of radiotherapy (RT) side effects incidence based on recorded observations from the international study of RT related toxicity "REQUITE". We also examined the effect of sampling techniques and cost-sensitive learning methods on the models when dealing with class imbalance. The combinations of such techniques used had a significant impact on the classification. They resulted in an improvement in incidence status prediction by shifting classifiers' attention to the minority group. The best classification model for RT acute toxicity prediction was identified based on domain experts' success criteria. The Area Under Receiver Operator Characteristic curve of the models tested with an isolated dataset ranged from 0.50 to 0.77. The scale of improved results is promising and will guide further development of models to predict RT acute toxicities. One model was optimised and found to be beneficial to identify patients who are at risk of developing acute RT early-stage toxicities as a result of undergoing breast RT ensuring relevant treatment interventions can be appropriately targeted. The design of the approach presented in this paper resulted in producing a preclinical-valid prediction model. The study was developed by a multi-disciplinary collaboration of data scientists, medical physicists, oncologists and surgeons in the UK Radiotherapy Machine Learning Network.
Collapse
Affiliation(s)
- Mahmoud Aldraimli
- The Health Innovation Ecosystem, University of Westminster, London, UK.
| | - Daniele Soria
- School of Computing, University of Kent, Canterbury, UK
| | | | - Samuel Ingram
- Division of Cancer Sciences, Faculty of Biology, Medicine and Health, The University of Manchester, UK
| | - Robert Lyon
- Department of Computer Science, Edge Hill University, Ormskirk, Lancashire, UK
| | - Anil Mistry
- Guy's and St Thomas' NHS Foundation Trust, London, UK
| | | | - Robert Samuel
- University of Leeds, Leeds Cancer Centre, St. James's University Hospital, Leeds, UK
| | - Leila E A Shelley
- Edinburgh Cancer Centre, Western General Hospital, Crewe Road South, Edinburgh, UK
| | - Sarah Osman
- Patrick G Johnston Centre for Cancer Research, Queen's University Belfast, Belfast, UK
| | - Miriam V Dwek
- School of Life Sciences, University of Westminster, London, UK
| | | | - Jenny Chang-Claude
- German Cancer Research Center (DKFZ) Division of Cancer Epidemiology, Unit of Genetic Epidemiology, Heidelberg, Germany
| | | | - Maria Carmen De Santis
- Dept of Radiation Oncology 1, Fondazione IRCCS Istituto Nazionale dei Tumori, Milan, Italy
| | | | - Dirk De Ruysscher
- Maastricht Radiation Oncology (MAASTRO Clinic) University Hospital Maastricht, the Netherlands
| | - Elena Sperk
- Department of Radiation Oncology, University Medical Center Mannheim, Medical Faculty Mannheim, Heidelberg University, Germany
| | | | | | - Ana Vega
- Fundación Publica Galega Medicina Xenomica, Santiago de Compostela, Spain
| | - Liv Veldeman
- Department of Basic Medical Sciences, University Hospital Ghent, Belgium
| | - Adam Webb
- Department of Genetics and Genome Biology, University of Leicester, UK
| | | | - Catharine M West
- Institute of Cancer Sciences, Christie Hospital, Wilmslow Road, Manchester, UK
| | - Tim Rattay
- Cancer Research Centre, University of Leicester, Leicester, UK
| | | |
Collapse
|