1
|
Xu J, Ruan X, Yang J, Hu B, Li S, Hu J. SME-MFP: A novel spatiotemporal neural network with multiangle initialization embedding toward multifunctional peptides prediction. Comput Biol Chem 2024; 109:108033. [PMID: 38412804 DOI: 10.1016/j.compbiolchem.2024.108033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2023] [Revised: 01/09/2024] [Accepted: 02/17/2024] [Indexed: 02/29/2024]
Abstract
As a promising alternative to conventional antibiotic drugs in the biomedical field, functional peptide has been widely used in disease treatment owing to its low toxicity, high absorption rate, and biological activity. Recently, several machine learning methods have been developed for functional peptide prediction. However, the main research heavily relies on statistical features and few consider multifunctional peptide identification. So, we propose SME-MFP, a novel predictor in the imbalanced multi-label functional peptide datasets. First, we employ physicochemical and evolutionary information to represent the peptide sequence's initialization features from multiple perspectives. Second, the features are fused and then put into spatial feature extractors, where the residual connection and multiscale convolutional neural network extract more discriminative features of different lengths' peptide sequences. Besides, we also design AFT-based temporal feature extractors to fully capture the global interactions of the sequences. Finally, devising a new loss to replace the traditional cross entropy loss to settle the class imbalance problems. The results show that our framework not only enhances the model's ability to capture sequence features effectively, but also accuracy improves by 3.89% over existing methods on public peptide datasets.
Collapse
Affiliation(s)
- Jing Xu
- State Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, China
| | - Xiaoli Ruan
- State Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, China.
| | - Jing Yang
- State Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, China
| | - Bingqi Hu
- State Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, China
| | - Shaobo Li
- State Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, China
| | - Jianjun Hu
- Department of Computer Science and Engineering, University of South Carolina, Columbia 29208, USA
| |
Collapse
|
2
|
Liu CL, Lee MH, Hsueh SN, Chung CC, Lin CJ, Chang PH, Luo AC, Weng HC, Lee YH, Dai MJ, Tsai MJ. A bagging approach for improved predictive accuracy of intradialytic hypotension during hemodialysis treatment. Comput Biol Med 2024; 172:108244. [PMID: 38457931 DOI: 10.1016/j.compbiomed.2024.108244] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Revised: 02/24/2024] [Accepted: 03/04/2024] [Indexed: 03/10/2024]
Abstract
The primary objective of this study is to enhance the prediction accuracy of intradialytic hypotension in patients undergoing hemodialysis. A significant challenge in this context arises from the nature of the data derived from the monitoring devices and exhibits an extreme class imbalance problem. Traditional predictive models often display a bias towards the majority class, compromising the accuracy of minority class predictions. Therefore, we introduce a method called UnderXGBoost. This novel methodology combines the under-sampling, bagging, and XGBoost techniques to balance the dataset and improve predictive accuracy for the minority class. This method is characterized by its straightforward implementation and training efficiency. Empirical validation in a real-world dataset confirms the superior performance of UnderXGBoost compared to existing models in predicting intradialytic hypotension. Furthermore, our approach demonstrates versatility, allowing XGBoost to be substituted with other classifiers and still producing promising results. Sensitivity analysis was performed to assess the model's robustness, reinforce its reliability, and indicate its applicability to a broader range of medical scenarios facing similar challenges of data imbalance. Our model aims to enable medical professionals to provide preemptive treatments more effectively, thereby improving patient care and prognosis. This study contributes a novel and effective solution to a critical issue in medical prediction, thus broadening the application spectrum of predictive modeling in the healthcare domain.
Collapse
Affiliation(s)
- Chien-Liang Liu
- Department of Industrial Engineering and Management, National Yang Ming Chiao Tung University, No. 1001, Daxue Rd. East Dist., Hsinchu, 30010, Taiwan, ROC.
| | - Min-Hsuan Lee
- Department of Industrial Engineering and Management, National Yang Ming Chiao Tung University, No. 1001, Daxue Rd. East Dist., Hsinchu, 30010, Taiwan, ROC
| | - Shan-Ni Hsueh
- Department of Industrial Engineering and Management, National Yang Ming Chiao Tung University, No. 1001, Daxue Rd. East Dist., Hsinchu, 30010, Taiwan, ROC
| | - Chia-Chen Chung
- Department of Industrial Engineering and Management, National Yang Ming Chiao Tung University, No. 1001, Daxue Rd. East Dist., Hsinchu, 30010, Taiwan, ROC
| | - Chun-Ju Lin
- Industrial Technology Research Institute, 195, Sec. 4, Chung Hsing Rd., Chutung, Hsinchu County, 310401, Taiwan, ROC
| | - Po-Han Chang
- Industrial Technology Research Institute, 195, Sec. 4, Chung Hsing Rd., Chutung, Hsinchu County, 310401, Taiwan, ROC
| | - An-Chun Luo
- Industrial Technology Research Institute, 195, Sec. 4, Chung Hsing Rd., Chutung, Hsinchu County, 310401, Taiwan, ROC
| | - Hsuan-Chi Weng
- Industrial Technology Research Institute, 195, Sec. 4, Chung Hsing Rd., Chutung, Hsinchu County, 310401, Taiwan, ROC
| | - Yu-Hsien Lee
- Industrial Technology Research Institute, 195, Sec. 4, Chung Hsing Rd., Chutung, Hsinchu County, 310401, Taiwan, ROC
| | - Ming-Ji Dai
- Industrial Technology Research Institute, 195, Sec. 4, Chung Hsing Rd., Chutung, Hsinchu County, 310401, Taiwan, ROC
| | - Min-Juei Tsai
- Department of Nephrology, Chang-Hua Hospital, Ministry of Health and Welfare, Changhua, No. 80, Sec. 2, Zhongzheng Rd., Puxin Township, Changhua County, 513007, Taiwan, ROC.
| |
Collapse
|
3
|
Cusworth S, Gkoutos GV, Acharjee A. A novel generative adversarial networks modelling for the class imbalance problem in high dimensional omics data. BMC Med Inform Decis Mak 2024; 24:90. [PMID: 38549123 PMCID: PMC10979623 DOI: 10.1186/s12911-024-02487-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2023] [Accepted: 03/22/2024] [Indexed: 04/01/2024] Open
Abstract
Class imbalance remains a large problem in high-throughput omics analyses, causing bias towards the over-represented class when training machine learning-based classifiers. Oversampling is a common method used to balance classes, allowing for better generalization of the training data. More naive approaches can introduce other biases into the data, being especially sensitive to inaccuracies in the training data, a problem considering the characteristically noisy data obtained in healthcare. This is especially a problem with high-dimensional data. A generative adversarial network-based method is proposed for creating synthetic samples from small, high-dimensional data, to improve upon other more naive generative approaches. The method was compared with 'synthetic minority over-sampling technique' (SMOTE) and 'random oversampling' (RO). Generative methods were validated by training classifiers on the balanced data.
Collapse
Affiliation(s)
- Samuel Cusworth
- Institute of Applied Health Research, University of Birmingham, Birmingham, UK
- NIHR Blood and Transplant Research Unit (BTRU) in Precision Transplant and Cellular Therapeutics, University of Birmingham, Birmingham, UK
| | - Georgios V Gkoutos
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, University of Birmingham, B15 2TT, Birmingham, UK
- Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, B15 2TT, Birmingham, UK
- MRC Health Data Research UK (HDR), Midlands Site, UK
- Centre for Health Data Research, University of Birmingham, B15 2TT, Birmingham, UK
- NIHR Experimental Cancer Medicine Centre, B15 2TT, Birmingham, UK
| | - Animesh Acharjee
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, University of Birmingham, B15 2TT, Birmingham, UK.
- Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, B15 2TT, Birmingham, UK.
- MRC Health Data Research UK (HDR), Midlands Site, UK.
- Centre for Health Data Research, University of Birmingham, B15 2TT, Birmingham, UK.
| |
Collapse
|
4
|
Zhang S, Zhu C, Li H, Cai J, Yang L. Gradient-aware learning for joint biases: Label noise and class imbalance. Neural Netw 2024; 171:374-382. [PMID: 38134600 DOI: 10.1016/j.neunet.2023.12.028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2023] [Revised: 12/04/2023] [Accepted: 12/16/2023] [Indexed: 12/24/2023]
Abstract
Data biases such as class imbalance and label noise always exist in large-scale datasets in real-world. These problems bring huge challenges to deep learning methods. Some previous works adopted loss re-weighting, sample re-weighting, or data-dependent regularization to mitigate the influence of these training biases. But these methods usually pay more attention to class imbalance problem when both the class imbalance and label noise exist in training set simultaneously. These methods may overfit noisy labels, which leads to a great degradation in performance. In this paper, we propose a gradient-aware learning method for the combination of the two biases. During the training process, we update only a part of crucial parameters regularly and rectify the update direction of the rest redundant parameters. This update rule is conducted both in the encoder and classifier of the deep network to decouple label noise and class imbalance implicitly. The experimental results verify the effectiveness of the proposed method on synthetic and real-world data biases.
Collapse
Affiliation(s)
- Shichuan Zhang
- Zhejiang University, Hangzhou, Zhejiang 310027, China; School of Engineering, Westlake University, Hangzhou, Zhejiang 310030, China.
| | - Chenglu Zhu
- School of Engineering, Westlake University, Hangzhou, Zhejiang 310030, China
| | - Honglin Li
- Zhejiang University, Hangzhou, Zhejiang 310027, China; School of Engineering, Westlake University, Hangzhou, Zhejiang 310030, China
| | - Jiatong Cai
- School of Engineering, Westlake University, Hangzhou, Zhejiang 310030, China
| | - Lin Yang
- School of Engineering, Westlake University, Hangzhou, Zhejiang 310030, China
| |
Collapse
|
5
|
Li Y, Wang Y, Lin G, Huang Y, Liu J, Lin Y, Wei D, Zhang Q, Ma K, Zhang Z, Lu G, Zheng Y. Triplet-branch network with contrastive prior-knowledge embedding for disease grading. Artif Intell Med 2024; 149:102801. [PMID: 38462290 DOI: 10.1016/j.artmed.2024.102801] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Revised: 11/28/2023] [Accepted: 02/03/2024] [Indexed: 03/12/2024]
Abstract
Since different disease grades require different treatments from physicians, i.e., the low-grade patients may recover with follow-up observations whereas the high-grade may need immediate surgery, the accuracy of disease grading is pivotal in clinical practice. In this paper, we propose a Triplet-Branch Network with ContRastive priOr-knoWledge embeddiNg (TBN-CROWN) for the accurate disease grading, which enables physicians to accordingly take appropriate treatments. Specifically, our TBN-CROWN has three branches, which are implemented for representation learning, classifier learning and grade-related prior-knowledge learning, respectively. The former two branches deal with the issue of class-imbalanced training samples, while the latter one embeds the grade-related prior-knowledge via a novel auxiliary module, termed contrastive embedding module. The proposed auxiliary module takes the features embedded by different branches as input, and accordingly constructs positive and negative embeddings for the model to deploy grade-related prior-knowledge via contrastive learning. Extensive experiments on our private and two publicly available disease grading datasets show that our TBN-CROWN can effectively tackle the class-imbalance problem and yield a satisfactory grading accuracy for various diseases, such as fatigue fracture, ulcerative colitis, and diabetic retinopathy.
Collapse
Affiliation(s)
- Yuexiang Li
- Medical AI ReSearch (MARS) Group, Center for Genomic and Personalized Medicine, Guangxi Key Laboratory for Genomic and Personalized Medicine, Guangxi Collaborative Innovation Center for Genomic and Personalized Medicine, Guangxi Medical University, Nanning, 530021, PR China.
| | - Yanping Wang
- Department of Diagnostic Radiology, Jinling Hospital, Medical School of Nanjing University, Nanjing, PR China
| | - Guang Lin
- Department of Diagnostic Radiology, Jinling Hospital, Medical School of Nanjing University, Nanjing, PR China
| | - Yawen Huang
- Jarvis Research Center, Tencent YouTu Lab, Shenzhen, 518057, PR China
| | - Jingxin Liu
- School of AI and Advanced Computing, Xi'an Jiaotong-Liverpool University, Suzhou, PR China
| | - Yi Lin
- Jarvis Research Center, Tencent YouTu Lab, Shenzhen, 518057, PR China
| | - Dong Wei
- Jarvis Research Center, Tencent YouTu Lab, Shenzhen, 518057, PR China
| | - Qirui Zhang
- Department of Diagnostic Radiology, Jinling Hospital, Medical School of Nanjing University, Nanjing, PR China
| | - Kai Ma
- Jarvis Research Center, Tencent YouTu Lab, Shenzhen, 518057, PR China
| | - Zhiqiang Zhang
- Department of Diagnostic Radiology, Jinling Hospital, Medical School of Nanjing University, Nanjing, PR China
| | - Guangming Lu
- Department of Diagnostic Radiology, Jinling Hospital, Medical School of Nanjing University, Nanjing, PR China
| | - Yefeng Zheng
- Jarvis Research Center, Tencent YouTu Lab, Shenzhen, 518057, PR China
| |
Collapse
|
6
|
Lee SJ, Oh HJ, Son YD, Kim JH, Kwon IJ, Kim B, Lee JH, Kim HK. Enhancing deep learning classification performance of tongue lesions in imbalanced data: mosaic-based soft labeling with curriculum learning. BMC Oral Health 2024; 24:161. [PMID: 38302981 PMCID: PMC10832072 DOI: 10.1186/s12903-024-03898-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Accepted: 01/15/2024] [Indexed: 02/03/2024] Open
Abstract
BACKGROUND Oral potentially malignant disorders (OPMDs) are associated with an increased risk of cancer of the oral cavity including the tongue. The early detection of oral cavity cancers and OPMDs is critical for reducing cancer-specific morbidity and mortality. Recently, there have been studies to apply the rapidly advancing technology of deep learning for diagnosing oral cavity cancer and OPMDs. However, several challenging issues such as class imbalance must be resolved to effectively train a deep learning model for medical imaging classification tasks. The aim of this study is to evaluate a new technique of artificial intelligence to improve the classification performance in an imbalanced tongue lesion dataset. METHODS A total of 1,810 tongue images were used for the classification. The class-imbalanced dataset consisted of 372 instances of cancer, 141 instances of OPMDs, and 1,297 instances of noncancerous lesions. The EfficientNet model was used as the feature extraction model for classification. Mosaic data augmentation, soft labeling, and curriculum learning (CL) were employed to improve the classification performance of the convolutional neural network. RESULTS Utilizing a mosaic-augmented dataset in conjunction with CL, the final model achieved an accuracy rate of 0.9444, surpassing conventional oversampling and weight balancing methods. The relative precision improvement rate for the minority class OPMD was 21.2%, while the relative [Formula: see text] score improvement rate of OPMD was 4.9%. CONCLUSIONS The present study demonstrates that the integration of mosaic-based soft labeling and curriculum learning improves the classification performance of tongue lesions compared to previous methods, establishing a foundation for future research on effectively learning from imbalanced data.
Collapse
Affiliation(s)
- Sung-Jae Lee
- Department of Biomedical Engineering, College of IT Convergence, Gachon University, Seongnam, Republic of Korea
| | - Hyun Jun Oh
- Oral Oncology Clinic, National Cancer Center, Goyang, Republic of Korea
| | - Young-Don Son
- Department of Biomedical Engineering, College of IT Convergence, Gachon University, Seongnam, Republic of Korea
- Neuroscience Research Institute, Gachon Advanced Institute for Health Science and Technology, Gachon University, Incheon, Republic of Korea
| | - Jong-Hoon Kim
- Neuroscience Research Institute, Gachon Advanced Institute for Health Science and Technology, Gachon University, Incheon, Republic of Korea
- Department of Psychiatry, Gachon University College of Medicine, Gil Medical Center, Incheon, Republic of Korea
| | - Ik-Jae Kwon
- Department of Oral and Maxillofacial Surgery, Seoul National University Dental Hospital, Seoul, Republic of Korea
- Dental Research Institute, Seoul National University, Seoul, Republic of Korea
| | - Bongju Kim
- Dental Life Science Research Institute, Seoul National University Dental Hospital, Seoul, Republic of Korea
| | - Jong-Ho Lee
- Oral Oncology Clinic, National Cancer Center, Goyang, Republic of Korea.
- Dental Life Science Research Institute, Seoul National University Dental Hospital, Seoul, Republic of Korea.
| | - Hang-Keun Kim
- Department of Biomedical Engineering, College of IT Convergence, Gachon University, Seongnam, Republic of Korea.
- Neuroscience Research Institute, Gachon Advanced Institute for Health Science and Technology, Gachon University, Incheon, Republic of Korea.
| |
Collapse
|
7
|
Hong C, Liu M, Wojdyla DM, Hickey J, Pencina M, Henao R. Trans-Balance: Reducing demographic disparity for prediction models in the presence of class imbalance. J Biomed Inform 2024; 149:104532. [PMID: 38070817 PMCID: PMC10850917 DOI: 10.1016/j.jbi.2023.104532] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2023] [Revised: 10/21/2023] [Accepted: 10/28/2023] [Indexed: 12/21/2023]
Abstract
INTRODUCTION Risk prediction, including early disease detection, prevention, and intervention, is essential to precision medicine. However, systematic bias in risk estimation caused by heterogeneity across different demographic groups can lead to inappropriate or misinformed treatment decisions. In addition, low incidence (class-imbalance) outcomes negatively impact the classification performance of many standard learning algorithms which further exacerbates the racial disparity issues. Therefore, it is crucial to improve the performance of statistical and machine learning models in underrepresented populations in the presence of heavy class imbalance. METHOD To address demographic disparity in the presence of class imbalance, we develop a novel framework, Trans-Balance, by leveraging recent advances in imbalance learning, transfer learning, and federated learning. We consider a practical setting where data from multiple sites are stored locally under privacy constraints. RESULTS We show that the proposed Trans-Balance framework improves upon existing approaches by explicitly accounting for heterogeneity across demographic subgroups and cohorts. We demonstrate the feasibility and validity of our methods through numerical experiments and a real application to a multi-cohort study with data from participants of four large, NIH-funded cohorts for stroke risk prediction. CONCLUSION Our findings indicate that the Trans-Balance approach significantly improves predictive performance, especially in scenarios marked by severe class imbalance and demographic disparity. Given its versatility and effectiveness, Trans-Balance offers a valuable contribution to enhancing risk prediction in biomedical research and related fields.
Collapse
Affiliation(s)
- Chuan Hong
- Duke University, Department of Biostatistics and Bioinformatics, Durham, NC, USA; Duke Clinical Research Institute, Durham, NC, USA.
| | - Molei Liu
- Columbia University, Department of Biostatistics, New York, NY, USA
| | | | - Jimmy Hickey
- North Carolina State University, Department of Statistics, Raleigh, NC, USA
| | - Michael Pencina
- Duke University, Department of Biostatistics and Bioinformatics, Durham, NC, USA; Duke Clinical Research Institute, Durham, NC, USA
| | - Ricardo Henao
- Duke University, Department of Biostatistics and Bioinformatics, Durham, NC, USA; Duke Clinical Research Institute, Durham, NC, USA
| |
Collapse
|
8
|
Li X, Wu Q, Wang M, Wu K. Uncertainty-aware network for fine-grained and imbalanced reflux esophagitis grading. Comput Biol Med 2024; 168:107751. [PMID: 38016373 DOI: 10.1016/j.compbiomed.2023.107751] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2023] [Revised: 10/22/2023] [Accepted: 11/20/2023] [Indexed: 11/30/2023]
Abstract
Computer-aided diagnosis (CAD) assists endoscopists in analyzing endoscopic images, reducing misdiagnosis rates and enabling timely treatment. A few studies have focused on CAD for gastroesophageal reflux disease, but CAD studies on reflux esophagitis (RE) are still inadequate. This paper presents a CAD study on RE using a dataset collected from hospital, comprising over 3000 images. We propose an uncertainty-aware network with handcrafted features, utilizing representation and classifier decoupling with metric learning to address class imbalance and achieve fine-grained RE classification. To enhance interpretability, the network estimates uncertainty through test time augmentation. The experimental results demonstrate that the proposed network surpasses previous methods, achieving an accuracy of 90.2% and an F1 score of 90.1%.
Collapse
Affiliation(s)
- Xingcun Li
- School of Management, Huazhong University of Science and Technology, Wuhan, 430074, China.
| | - Qinghua Wu
- School of Management, Huazhong University of Science and Technology, Wuhan, 430074, China.
| | - Mi Wang
- Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430030, China.
| | - Kun Wu
- Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430030, China
| |
Collapse
|
9
|
Alkhawaldeh IM, Albalkhi I, Naswhan AJ. Challenges and limitations of synthetic minority oversampling techniques in machine learning. World J Methodol 2023; 13:373-378. [PMID: 38229946 PMCID: PMC10789107 DOI: 10.5662/wjm.v13.i5.373] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 09/30/2023] [Accepted: 11/03/2023] [Indexed: 12/20/2023] Open
Abstract
Oversampling is the most utilized approach to deal with class-imbalanced datasets, as seen by the plethora of oversampling methods developed in the last two decades. We argue in the following editorial the issues with oversampling that stem from the possibility of overfitting and the generation of synthetic cases that might not accurately represent the minority class. These limitations should be considered when using oversampling techniques. We also propose several alternate strategies for dealing with imbalanced data, as well as a future work perspective.
Collapse
Affiliation(s)
| | - Ibrahem Albalkhi
- Department of Neuroradiology, Alfaisal University, Great Ormond Street Hospital NHS Foundation Trust, London WC1N 3JH, United Kingdom
| | | |
Collapse
|
10
|
Wang X, Qiao Y, Cui Y, Ren H, Zhao Y, Linghu L, Ren J, Zhao Z, Chen L, Qiu L. An explainable artificial intelligence framework for risk prediction of COPD in smokers. BMC Public Health 2023; 23:2164. [PMID: 37932692 PMCID: PMC10626705 DOI: 10.1186/s12889-023-17011-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2023] [Accepted: 10/17/2023] [Indexed: 11/08/2023] Open
Abstract
BACKGROUND Since the inconspicuous nature of early signs associated with Chronic Obstructive Pulmonary Disease (COPD), individuals often remain unidentified, leading to suboptimal opportunities for timely prevention and treatment. The purpose of this study was to create an explainable artificial intelligence framework combining data preprocessing methods, machine learning methods, and model interpretability methods to identify people at high risk of COPD in the smoking population and to provide a reasonable interpretation of model predictions. METHODS The data comprised questionnaire information, physical examination data and results of pulmonary function tests before and after bronchodilatation. First, the factorial analysis for mixed data (FAMD), Boruta and NRSBoundary-SMOTE resampling methods were used to solve the missing data, high dimensionality and category imbalance problems. Then, seven classification models (CatBoost, NGBoost, XGBoost, LightGBM, random forest, SVM and logistic regression) were applied to model the risk level, and the best machine learning (ML) model's decisions were explained using the Shapley additive explanations (SHAP) method and partial dependence plot (PDP). RESULTS In the smoking population, age and 14 other variables were significant factors for predicting COPD. The CatBoost, random forest, and logistic regression models performed reasonably well in unbalanced datasets. CatBoost with NRSBoundary-SMOTE had the best classification performance in balanced datasets when composite indicators (the AUC, F1-score, and G-mean) were used as model comparison criteria. Age, COPD Assessment Test (CAT) score, gross annual income, body mass index (BMI), systolic blood pressure (SBP), diastolic blood pressure (DBP), anhelation, respiratory disease, central obesity, use of polluting fuel for household heating, region, use of polluting fuel for household cooking, and wheezing were important factors for predicting COPD in the smoking population. CONCLUSION This study combined feature screening methods, unbalanced data processing methods, and advanced machine learning methods to enable early identification of COPD risk groups in the smoking population. COPD risk factors in the smoking population were identified using SHAP and PDP, with the goal of providing theoretical support for targeted screening strategies and smoking population self-management strategies.
Collapse
Affiliation(s)
- Xuchun Wang
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, 030001, P.R. China
| | - Yuchao Qiao
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, 030001, P.R. China
| | - Yu Cui
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, 030001, P.R. China
| | - Hao Ren
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, 030001, P.R. China
| | - Ying Zhao
- Shanxi Centre for Disease Control and Prevention, Taiyuan, Shanxi, 030012, China
| | - Liqin Linghu
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, 030001, P.R. China
- Shanxi Centre for Disease Control and Prevention, Taiyuan, Shanxi, 030012, China
| | - Jiahui Ren
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, 030001, P.R. China
| | - Zhiyang Zhao
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, 030001, P.R. China
| | - Limin Chen
- The Fifth Hospital (Shanxi People's Hospital) of Shanxi Medical University, Taiyuan, Shanxi, 030012, P.R. China.
| | - Lixia Qiu
- Department of Health Statistics, School of Public Health, Shanxi Medical University, 56 South XinJian Road, Taiyuan, 030001, P.R. China.
| |
Collapse
|
11
|
Abbas Q, Malik KM, Saudagar AKJ, Khan MB. Context-aggregator: An approach of loss- and class imbalance-aware aggregation in federated learning. Comput Biol Med 2023; 163:107167. [PMID: 37421740 DOI: 10.1016/j.compbiomed.2023.107167] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2022] [Revised: 05/26/2023] [Accepted: 06/08/2023] [Indexed: 07/10/2023]
Abstract
Federated Learning (FL) is an emerging distributed learning paradigm which offers data privacy to contributing nodes in the collaborating environment. By exploiting the Individual datasets of different hospitals in FL setting could be used to develop reliable screening, diagnosis, and treatment predictive models to tackle major challenges such as pandemics. FL can enable the development of very diverse medical imaging datasets and thus provide more reliable models for all participating nodes, including those with low quality data. However, the issue with the traditional Federated Learning paradigm is the degradation of generalization power due to poorly trained local models at the client nodes. The generalization power of the FL paradigm can be improved by considering the relative learning contribution of client nodes. Simple aggregation of learning parameters in the standard FL model faces a diversity issue and results in more validation loss during the learning process. This issue can be resolved by considering the relative contribution of each client node participating in the learning process. The class imbalance at each site is another significant challenge that greatly impacts the performance of the aggregated learning model. This work considers Context Aggregator FL based on the context of loss-factor and class-imbalance issues by incorporating the relative contribution of the collaborating nodes in FL by proposing Validation-Loss based Context Aggregator (CAVL) and Class Imbalance based Context Aggregator (CACI). The proposed Context Aggregator is evaluated on several different Covid-19 imaging classification datasets present on participating nodes. The evaluation results show that Context Aggregator performs better than standard Federating average Learning algorithms and FedProx Algorithm for Covid-19 image classification problems.
Collapse
Affiliation(s)
- Qamar Abbas
- Department of Computer Science, Faculty of Computing and Information Technology, International Islamic University, Islamabad, 44000, Pakistan
| | - Khalid Mahmood Malik
- Department of Computer Science and Engineering, Oakland University, Rochester, MI, 48309-4401, USA.
| | - Abdul Khader Jilani Saudagar
- Information Systems Department, College of Computer and Information Sciences, Imam Mohammad Ibn Saud Islamic University, Riyadh, 11432, Saudi Arabia
| | - Muhammad Badruddin Khan
- Information Systems Department, College of Computer and Information Sciences, Imam Mohammad Ibn Saud Islamic University, Riyadh, 11432, Saudi Arabia
| |
Collapse
|
12
|
Thölke P, Mantilla-Ramos YJ, Abdelhedi H, Maschke C, Dehgan A, Harel Y, Kemtur A, Mekki Berrada L, Sahraoui M, Young T, Bellemare Pépin A, El Khantour C, Landry M, Pascarella A, Hadid V, Combrisson E, O'Byrne J, Jerbi K. Class imbalance should not throw you off balance: Choosing the right classifiers and performance metrics for brain decoding with imbalanced data. Neuroimage 2023:120253. [PMID: 37385392 DOI: 10.1016/j.neuroimage.2023.120253] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Revised: 06/05/2023] [Accepted: 06/26/2023] [Indexed: 07/01/2023] Open
Abstract
Machine learning (ML) is increasingly used in cognitive, computational and clinical neuroscience. The reliable and efficient application of ML requires a sound understanding of its subtleties and limitations. Training ML models on datasets with imbalanced classes is a particularly common problem, and it can have severe consequences if not adequately addressed. With the neuroscience ML user in mind, this paper provides a didactic assessment of the class imbalance problem and illustrates its impact through systematic manipulation of data imbalance ratios in (i) simulated data and (ii) brain data recorded with electroencephalography (EEG), magnetoencephalography (MEG) and functional magnetic resonance imaging (fMRI). Our results illustrate how the widely-used Accuracy (Acc) metric, which measures the overall proportion of successful predictions, yields misleadingly high performances, as class imbalance increases. Because Acc weights the per-class ratios of correct predictions proportionally to class size, it largely disregards the performance on the minority class. A binary classification model that learns to systematically vote for the majority class will yield an artificially high decoding accuracy that directly reflects the imbalance between the two classes, rather than any genuine generalizable ability to discriminate between them. We show that other evaluation metrics such as the Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC), and the less common Balanced Accuracy (BAcc) metric - defined as the arithmetic mean between sensitivity and specificity, provide more reliable performance evaluations for imbalanced data. Our findings also highlight the robustness of Random Forest (RF), and the benefits of using stratified cross-validation and hyperprameter optimization to tackle data imbalance. Critically, for neuroscience ML applications that seek to minimize overall classification error, we recommend the routine use of BAcc, which in the specific case of balanced data is equivalent to using standard Acc, and readily extends to multi-class settings. Importantly, we present a list of recommendations for dealing with imbalanced data, as well as open-source code to allow the neuroscience community to replicate and extend our observations and explore alternative approaches to coping with imbalanced data.
Collapse
Affiliation(s)
- Philipp Thölke
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada; Institute of Cognitive Science, Osnabrück University, Neuer Graben 29/Schloss, Osnabrück, 49074, Lower Saxony, Germany.
| | - Yorguin-Jose Mantilla-Ramos
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada; Neuropsychology and Behavior Group (GRUNECO), Faculty of Medicine, Universidad de Antioquia,53-108, Medellin, Aranjuez, Medellin, 050010, Colombia
| | - Hamza Abdelhedi
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada
| | - Charlotte Maschke
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada; Integrated Program in Neuroscience, McGill University, 1033 Pine Ave,Montreal, H3A 0G4, Canada
| | - Arthur Dehgan
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada; Institut de Neurosciences de la Timone (INT), CNRS, Aix Marseille University,Marseille, 13005, France
| | - Yann Harel
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada
| | - Anirudha Kemtur
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada
| | - Loubna Mekki Berrada
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada
| | - Myriam Sahraoui
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada
| | - Tammy Young
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada; Department of Computing Science, University of Alberta, 116 St & 85 Ave, Edmonton, T6G 2R3, AB, Canada
| | - Antoine Bellemare Pépin
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada; Department of Music, Concordia University, 1550 De Maisonneuve Blvd. W., Montreal, H3H 1G8, QC, Canada
| | - Clara El Khantour
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada
| | - Mathieu Landry
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada
| | - Annalisa Pascarella
- Institute for Applied Mathematics Mauro Picone, National Research Council, Roma, Italy, Roma, Italy
| | - Vanessa Hadid
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada
| | - Etienne Combrisson
- Institut de Neurosciences de la Timone (INT), CNRS, Aix Marseille University,Marseille, 13005, France
| | - Jordan O'Byrne
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada
| | - Karim Jerbi
- Cognitive and Computational Neuroscience Laboratory (CoCo Lab), University of Montreal, 2900, boul. Edouard-Montpetit, Montreal, H3T 1J4, Quebec, Canada; Mila (Quebec Machine Learning Institute),6666 Rue Saint-Urbain, Montreal, H2S 3H1, QC, Canada; UNIQUE Centre (Quebec Neuro-AI Research Centre), 3744 rue Jean-Brillant, Montreal,H3T 1P1,QC, Canada
| |
Collapse
|
13
|
Veturi YA, Woof W, Lazebnik T, Moghul I, Woodward-Court P, Wagner SK, Cabral de Guimarães TA, Daich Varela M, Liefers B, Patel PJ, Beck S, Webster AR, Mahroo O, Keane PA, Michaelides M, Balaskas K, Pontikos N. SynthEye: Investigating the Impact of Synthetic Data on Artificial Intelligence-assisted Gene Diagnosis of Inherited Retinal Disease. Ophthalmol Sci 2023; 3:100258. [PMID: 36685715 PMCID: PMC9852957 DOI: 10.1016/j.xops.2022.100258] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Revised: 11/08/2022] [Accepted: 11/09/2022] [Indexed: 11/23/2022]
Abstract
Purpose Rare disease diagnosis is challenging in medical image-based artificial intelligence due to a natural class imbalance in datasets, leading to biased prediction models. Inherited retinal diseases (IRDs) are a research domain that particularly faces this issue. This study investigates the applicability of synthetic data in improving artificial intelligence-enabled diagnosis of IRDs using generative adversarial networks (GANs). Design Diagnostic study of gene-labeled fundus autofluorescence (FAF) IRD images using deep learning. Participants Moorfields Eye Hospital (MEH) dataset of 15 692 FAF images obtained from 1800 patients with confirmed genetic diagnosis of 1 of 36 IRD genes. Methods A StyleGAN2 model is trained on the IRD dataset to generate 512 × 512 resolution images. Convolutional neural networks are trained for classification using different synthetically augmented datasets, including real IRD images plus 1800 and 3600 synthetic images, and a fully rebalanced dataset. We also perform an experiment with only synthetic data. All models are compared against a baseline convolutional neural network trained only on real data. Main Outcome Measures We evaluated synthetic data quality using a Visual Turing Test conducted with 4 ophthalmologists from MEH. Synthetic and real images were compared using feature space visualization, similarity analysis to detect memorized images, and Blind/Referenceless Image Spatial Quality Evaluator (BRISQUE) score for no-reference-based quality evaluation. Convolutional neural network diagnostic performance was determined on a held-out test set using the area under the receiver operating characteristic curve (AUROC) and Cohen's Kappa (κ). Results An average true recognition rate of 63% and fake recognition rate of 47% was obtained from the Visual Turing Test. Thus, a considerable proportion of the synthetic images were classified as real by clinical experts. Similarity analysis showed that the synthetic images were not copies of the real images, indicating that copied real images, meaning the GAN was able to generalize. However, BRISQUE score analysis indicated that synthetic images were of significantly lower quality overall than real images (P < 0.05). Comparing the rebalanced model (RB) with the baseline (R), no significant change in the average AUROC and κ was found (R-AUROC = 0.86[0.85-88], RB-AUROC = 0.88[0.86-0.89], R-k = 0.51[0.49-0.53], and RB-k = 0.52[0.50-0.54]). The synthetic data trained model (S) achieved similar performance as the baseline (S-AUROC = 0.86[0.85-87], S-k = 0.48[0.46-0.50]). Conclusions Synthetic generation of realistic IRD FAF images is feasible. Synthetic data augmentation does not deliver improvements in classification performance. However, synthetic data alone deliver a similar performance as real data, and hence may be useful as a proxy to real data. Financial Disclosure(s): Proprietary or commercial disclosure may be found after the references.
Collapse
Key Words
- AUROC, area under the receiver operating characteristic curve
- BRISQUE, Blind/Referenceless Image Spatial Quality Evaluator
- Class imbalance
- Clinical Decision-Support Model
- DL, deep learning
- Deep Learning
- FAF, fundas autofluorescence
- FRR, Fake Recognition Rate
- GAN, generative adversarial network
- Generative Adversarial Networks
- IRD, inherited retinal disease
- Inherited Retinal Diseases
- MEH, Moorfields Eye Hospital
- R, baseline model
- RB, rebalanced model
- S, synthetic data trained model
- Synthetic data
- TRR, True Recognition Rate
- UMAP, Universal Manifold Approximation and Projection
Collapse
Affiliation(s)
- Yoga Advaith Veturi
- University College London Institute of Ophthalmology, University College London, London, UK
- Moorfields Eye Hospital, London, UK
| | - William Woof
- University College London Institute of Ophthalmology, University College London, London, UK
- Moorfields Eye Hospital, London, UK
| | - Teddy Lazebnik
- University College London Cancer Institute, University College London, London, UK
| | | | - Peter Woodward-Court
- University College London Institute of Ophthalmology, University College London, London, UK
- Moorfields Eye Hospital, London, UK
| | - Siegfried K. Wagner
- University College London Institute of Ophthalmology, University College London, London, UK
- Moorfields Eye Hospital, London, UK
| | | | - Malena Daich Varela
- University College London Institute of Ophthalmology, University College London, London, UK
- Moorfields Eye Hospital, London, UK
| | | | | | - Stephan Beck
- University College London Cancer Institute, University College London, London, UK
| | - Andrew R. Webster
- University College London Institute of Ophthalmology, University College London, London, UK
- Moorfields Eye Hospital, London, UK
| | - Omar Mahroo
- University College London Institute of Ophthalmology, University College London, London, UK
- Moorfields Eye Hospital, London, UK
| | - Pearse A. Keane
- University College London Institute of Ophthalmology, University College London, London, UK
- Moorfields Eye Hospital, London, UK
| | - Michel Michaelides
- University College London Institute of Ophthalmology, University College London, London, UK
- Moorfields Eye Hospital, London, UK
| | - Konstantinos Balaskas
- University College London Institute of Ophthalmology, University College London, London, UK
- Moorfields Eye Hospital, London, UK
| | - Nikolas Pontikos
- University College London Institute of Ophthalmology, University College London, London, UK
- Moorfields Eye Hospital, London, UK
| |
Collapse
|
14
|
Junaid M, Ali S, Eid F, El-Sappagh S, Abuhmed T. Explainable machine learning models based on multimodal time-series data for the early detection of Parkinson's disease. Comput Methods Programs Biomed 2023; 234:107495. [PMID: 37003039 DOI: 10.1016/j.cmpb.2023.107495] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Revised: 02/23/2023] [Accepted: 03/17/2023] [Indexed: 06/19/2023]
Abstract
BACKGROUND AND OBJECTIVES Parkinson's Disease (PD) is a devastating chronic neurological condition. Machine learning (ML) techniques have been used in the early prediction of PD progression. Fusion of heterogeneous data modalities proved its capability to improve the performance of ML models. Time series data fusion supports the tracking of the disease over time. In addition, the trustworthiness of the resulting models is improved by adding model explainability features. The literature on PD has not sufficiently explored these three points. METHODS In this work, we proposed an ML pipeline for predicting the progression of PD that is both accurate and explainable. We explore the fusion of different combinations of five time series modalities from the Parkinson's Progression Markers Initiative (PPMI) real-world dataset, including patient characteristics, biosamples, medication history, motor, and non-motor function data. Each patient has six visits. The problem has been formulated in two ways: ❶ a three-class based progression prediction with 953 patients in each time series modality, and ❷ a four-class based progression prediction with 1,060 patients in each time series modality. The statistical features of these six visits were calculated from each modality and diverse feature selection methods were applied to select the most informative feature sets. The extracted features were used to train a set of well-known ML models including Support vector machines (SVM), random forests (RF), extra tree classifier (ETC), light gradient boosting machines (LGBM), and stochastic gradient descent (SGD). We examined a number of data-balancing strategies in the pipeline with different combinations of modalities. ML models have been optimized using the Bayesian optimizer. A comprehensive evaluation of various ML methods has been conducted, and the best models have been extended to provide different explainability features. RESULTS We compare the performance of ML models before and after optimization and using and without using feature selection. In the three-class experiment and with various modality fusions, the LGBM model produced the most accurate results with a 10-fold cross-validation (10-CV) accuracy of 90.73% using non-motor function modality. RF produced the best results in the four-class experiment with various modality fusions with a 10-CV accuracy of 94.57% using non-motor modality. With the fused dataset of non-motor and motor function modalities, the LGBM model outperformed the other ML models in both the 3-class and 4-class experiments (i.e., 10-CV accuracy of 94.89% and 93.73%, respectively). Using the Shapely Additive Explanations (SHAP) framework, we employed global and instance-based explanations to explain the behavior of each ML classifier. Moreover, we extended the explainability by implementing the LIME and SHAPASH local explainers. The consistency of these explainers has been explored. The resultant classifiers were accurate, explainable, and thus medically more relevant and applicable. CONCLUSIONS The select modalities and feature sets were confirmed by the literature and medical experts. The various explainers suggest that the bradykinesia (NP3BRADY) feature was the most dominant and consistent. By providing thorough insights into the influence of multiple modalities on the disease risk, the suggested approach is expected to help improve the clinical knowledge of PD progression processes.
Collapse
Affiliation(s)
- Muhammad Junaid
- Information Laboratory (InfoLab), Department of Electrical and Computer Engineering, Sungkyunkwan University, Suwon 16419, South Korea.
| | - Sajid Ali
- Information Laboratory (InfoLab), Department of Electrical and Computer Engineering, Sungkyunkwan University, Suwon 16419, South Korea.
| | - Fatma Eid
- Technology Management, Stony Brook University, New York 11794, USA.
| | - Shaker El-Sappagh
- Information Laboratory (InfoLab), College of Computing and Informatics, Sungkyunkwan University, Suwon 16419, South Korea; Faculty of Computer Science and Engineering, Galala University, Suez 435611, Egypt; Information Systems Department, Faculty of Computers and Artificial Intelligence, Benha University, Banha, 13518, Egypt.
| | - Tamer Abuhmed
- Information Laboratory (InfoLab), College of Computing and Informatics, Sungkyunkwan University, Suwon 16419, South Korea.
| |
Collapse
|
15
|
Iqbal S, Qureshi AN, Li J, Choudhry IA, Mahmood T. Dynamic learning for imbalanced data in learning chest X-ray and CT images. Heliyon 2023; 9:e16807. [PMID: 37313141 PMCID: PMC10258426 DOI: 10.1016/j.heliyon.2023.e16807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2023] [Revised: 05/26/2023] [Accepted: 05/29/2023] [Indexed: 06/15/2023] Open
Abstract
Massive annotated datasets are necessary for networks of deep learning. When a topic is being researched for the first time, as in the situation of the viral epidemic, handling it with limited annotated datasets might be difficult. Additionally, the datasets are quite unbalanced in this situation, with limited findings coming from significant instances of the novel illness. We offer a technique that allows a class balancing algorithm to understand and detect lung disease signs from chest X-ray and CT images. Deep learning techniques are used to train and evaluate images, enabling the extraction of basic visual attributes. The training objects' characteristics, instances, categories, and relative data modeling are all represented probabilistically. It is possible to identify a minority category in the classification process by using an imbalance-based sample analyzer. In order to address the imbalance problem, learning samples from the minority class are examined. The Support Vector Machine (SVM) is used to categorize images in clustering. Physicians and medical professionals can use the CNN model to validate their initial assessments of malignant and benign categorization. The proposed technique for class imbalance (3-Phase Dynamic Learning (3PDL)) and parallel CNN model (Hybrid Feature Fusion (HFF)) for multiple modalities achieve a high F1 score of 96.83 and precision is 96.87, its outstanding accuracy and generalization suggest that it may be utilized to create a pathologist's help tool.
Collapse
Affiliation(s)
- Saeed Iqbal
- Faculty of Information Technology, Beijing University of Technology, Beijing, 100124,China
- Department of Computer Science, Faculty of Information Technology & Computer Science, University of Central Punjab, Lahore, Pakistan
| | - Adnan N. Qureshi
- Department of Computer Science, Faculty of Information Technology & Computer Science, University of Central Punjab, Lahore, Pakistan
| | - Jianqiang Li
- Faculty of Information Technology, Beijing University of Technology, Beijing, 100124,China
- Beijing Engineering Research Center for IoT Software and Systems, 100124, China
| | - Imran Arshad Choudhry
- Department of Computer Science, Faculty of Information Technology & Computer Science, University of Central Punjab, Lahore, Pakistan
| | - Tariq Mahmood
- Faculty of Information Sciences, University of Education, Vehari Campus, Vehari, 61100, Pakistan
- Artificial Intelligence and Data Analytics (AIDA) Lab, College of Computer & Information Sciences (CCIS), Prince Sultan University, Riyadh, 11586, Kingdom of Saudi Arabia
| |
Collapse
|
16
|
Zhang H, Zhong X, Li G, Liu W, Liu J, Ji D, Li X, Wu J. BCU-Net: Bridging ConvNeXt and U-Net for medical image segmentation. Comput Biol Med 2023; 159:106960. [PMID: 37099973 DOI: 10.1016/j.compbiomed.2023.106960] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2022] [Revised: 04/12/2023] [Accepted: 04/17/2023] [Indexed: 04/28/2023]
Abstract
Medical image segmentation enables doctors to observe lesion regions better and make accurate diagnostic decisions. Single-branch models such as U-Net have achieved great progress in this field. However, the complementary local and global pathological semantics of heterogeneous neural networks have not yet been fully explored. The class-imbalance problem remains a serious issue. To alleviate these two problems, we propose a novel model called BCU-Net, which leverages the advantages of ConvNeXt in global interaction and U-Net in local processing. We propose a new multilabel recall loss (MRL) module to relieve the class imbalance problem and facilitate deep-level fusion of local and global pathological semantics between the two heterogeneous branches. Extensive experiments were conducted on six medical image datasets including retinal vessel and polyp images. The qualitative and quantitative results demonstrate the superiority and generalizability of BCU-Net. In particular, BCU-Net can handle diverse medical images with diverse resolutions. It has a flexible structure owing to its plug-and-play characteristics, which promotes its practicality.
Collapse
Affiliation(s)
- Hongbin Zhang
- School of Software, East China Jiaotong University, China.
| | - Xiang Zhong
- School of Software, East China Jiaotong University, China.
| | - Guangli Li
- School of Information Engineering, East China Jiaotong University, China.
| | - Wei Liu
- School of Software, East China Jiaotong University, China.
| | - Jiawei Liu
- School of Software, East China Jiaotong University, China.
| | - Donghong Ji
- School of Cyber Science and Engineering, Wuhan University, China.
| | - Xiong Li
- School of Software, East China Jiaotong University, China.
| | - Jianguo Wu
- The Second Affiliated Hospital of Nanchang University, China.
| |
Collapse
|
17
|
Lu H, Barrett A, Pierce A, Zheng J, Wang Y, Chiang C, Rakovski C. Predicting suicidal and self-injurious events in a correctional setting using AI algorithms on unstructured medical notes and structured data. J Psychiatr Res 2023; 160:19-27. [PMID: 36773344 DOI: 10.1016/j.jpsychires.2023.01.032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/19/2022] [Revised: 01/23/2023] [Accepted: 01/26/2023] [Indexed: 01/31/2023]
Abstract
Suicidal and self-injurious incidents in correctional settings deplete the institutional and healthcare resources, create disorder and stress for staff and other inmates. Traditional statistical analyses provide some guidance, but they can only be applied to structured data that are often difficult to collect and their recommendations are often expensive to act upon. This study aims to extract information from medical and mental health progress notes using AI algorithms to make actionable predictions of suicidal and self-injurious events to improve the efficiency of triage for health care services and prevent suicidal and injurious events from happening at California's Orange County Jails. The results showed that the notes data contain more information with respect to suicidal or injurious behaviors than the structured data available in the EHR database at the Orange County Jails. Using the notes data alone (under-sampled to 50%) in a Transformer Encoder model produced an AUC-ROC of 0.862, a Sensitivity of 0.816, and a Specificity of 0.738. Incorporating the information extracted from the notes data into traditional Machine Learning models as a feature alongside structured data (under-sampled to 50%) yielded better performance in terms of Sensitivity (AUC-ROC: 0.77, Sensitivity: 0.89, Specificity: 0.65). In addition, under-sampling is an effective approach to mitigating the impact of the extremely imbalanced classes.
Collapse
|
18
|
Lin F, Xia Y, Song S, Ravikumar N, Frangi AF. High-throughput 3DRA segmentation of brain vasculature and aneurysms using deep learning. Comput Methods Programs Biomed 2023; 230:107355. [PMID: 36709557 DOI: 10.1016/j.cmpb.2023.107355] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Revised: 01/10/2023] [Accepted: 01/13/2023] [Indexed: 06/18/2023]
Abstract
BACKGROUND AND OBJECTIVES Automatic segmentation of the cerebral vasculature and aneurysms facilitates incidental detection of aneurysms. The assessment of aneurysm rupture risk assists with pre-operative treatment planning and enables in-silico investigation of cerebral hemodynamics within and in the vicinity of aneurysms. However, ensuring precise and robust segmentation of cerebral vessels and aneurysms in neuroimaging modalities such as three-dimensional rotational angiography (3DRA) is challenging. The vasculature constitutes a small proportion of the image volume, resulting in a large class imbalance (relative to surrounding brain tissue). Additionally, aneurysms and vessels have similar image/appearance characteristics, making it challenging to distinguish the aneurysm sac from the vessel lumen. METHODS We propose a novel multi-class convolutional neural network to tackle these challenges and facilitate the automatic segmentation of cerebral vessels and aneurysms in 3DRA images. The proposed model is trained and evaluated on an internal multi-center dataset and an external publicly available challenge dataset. RESULTS On the internal clinical dataset, our method consistently outperformed several state-of-the-art approaches for vessel and aneurysm segmentation, achieving an average Dice score of 0.81 (0.15 higher than nnUNet) and an average surface-to-surface error of 0.20 mm (less than the in-plane resolution (0.35 mm/pixel)) for aneurysm segmentation; and an average Dice score of 0.91 and average surface-to-surface error of 0.25 mm for vessel segmentation. In 223 cases of a clinical dataset, our method accurately segmented 190 aneurysm cases. CONCLUSIONS The proposed approach can help address class imbalance problems and inter-class interference problems in multi-class segmentation. Besides, this method performs consistently on clinical datasets from four different sources and the generated results are qualified for hemodynamic simulation. Code available at https://github.com/cistib/vessel-aneurysm-segmentation.
Collapse
Affiliation(s)
- Fengming Lin
- Centre for Computational Imaging and Simulation Technologies in Biomedicine (CISTIB), The University of Leeds, Leeds LS2 9JT, UK
| | - Yan Xia
- Centre for Computational Imaging and Simulation Technologies in Biomedicine (CISTIB), The University of Leeds, Leeds LS2 9JT, UK.
| | - Shuang Song
- Centre for Computational Imaging and Simulation Technologies in Biomedicine (CISTIB), The University of Leeds, Leeds LS2 9JT, UK
| | - Nishant Ravikumar
- Centre for Computational Imaging and Simulation Technologies in Biomedicine (CISTIB), The University of Leeds, Leeds LS2 9JT, UK
| | - Alejandro F Frangi
- Centre for Computational Imaging and Simulation Technologies in Biomedicine (CISTIB), The University of Leeds, Leeds LS2 9JT, UK; Leeds Institute for Cardiovascular and Metabolic Medicine (LICAMM), School of Medicine, University of Leeds, Leeds LS2 9JT, UK; Medical Imaging Research Center (MIRC), Cardiovascular Science and Electronic Engineering Departments, KU Leuven, Leuven, Belgium; Alan Turing Institute, London, UK
| |
Collapse
|
19
|
Bigoulaeva I, Hangya V, Gurevych I, Fraser A. Label modification and bootstrapping for zero-shot cross-lingual hate speech detection. LANG RESOUR EVAL 2023; 57:1515-1546. [PMID: 38021031 PMCID: PMC10656307 DOI: 10.1007/s10579-023-09637-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/13/2023] [Indexed: 02/21/2023]
Abstract
The goal of hate speech detection is to filter negative online content aiming at certain groups of people. Due to the easy accessibility and multilinguality of social media platforms, it is crucial to protect everyone which requires building hate speech detection systems for a wide range of languages. However, the available labeled hate speech datasets are limited, making it difficult to build systems for many languages. In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages, while highlighting label issues across application scenarios, such as inconsistent label sets of corpora or differing hate speech definitions, which hinder the application of such methods. We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply them to the target language, which lacks labeled examples, and show that good performance can be achieved. We then incorporate unlabeled target language data for further model improvements by bootstrapping labels using an ensemble of different model architectures. Furthermore, we investigate the issue of label imbalance in hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance. We test simple data undersampling and oversampling techniques and show their effectiveness.
Collapse
Affiliation(s)
- Irina Bigoulaeva
- Ubiquitous Knowledge Processing Lab (UKP Lab), Department of Computer Science, Technical University of Darmstadt, Darmstadt, Germany
| | - Viktor Hangya
- Center for Information and Language Processing, LMU Munich, Munich, Germany
| | - Iryna Gurevych
- Ubiquitous Knowledge Processing Lab (UKP Lab), Department of Computer Science, Technical University of Darmstadt, Darmstadt, Germany
| | - Alexander Fraser
- Center for Information and Language Processing, LMU Munich, Munich, Germany
| |
Collapse
|
20
|
Hu L, Fu C, Ren Z, Cai Y, Yang J, Xu S, Xu W, Tang D. SSELM-neg: spherical search-based extreme learning machine for drug-target interaction prediction. BMC Bioinformatics 2023; 24:38. [PMID: 36737694 PMCID: PMC9896467 DOI: 10.1186/s12859-023-05153-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2022] [Accepted: 01/18/2023] [Indexed: 02/05/2023] Open
Abstract
BACKGROUND The experimental verification of a drug discovery process is expensive and time-consuming. Therefore, efficiently and effectively identifying drug-target interactions (DTIs) has been the focus of research. At present, many machine learning algorithms are used for predicting DTIs. The key idea is to train the classifier using an existing DTI to predict a new or unknown DTI. However, there are various challenges, such as class imbalance and the parameter optimization of many classifiers, that need to be solved before an optimal DTI model is developed. METHODS In this study, we propose a framework called SSELM-neg for DTI prediction, in which we use a screening approach to choose high-quality negative samples and a spherical search approach to optimize the parameters of the extreme learning machine. RESULTS The results demonstrated that the proposed technique outperformed other state-of-the-art methods in 10-fold cross-validation experiments in terms of the area under the receiver operating characteristic curve (0.986, 0.993, 0.988, and 0.969) and AUPR (0.982, 0.991, 0.982, and 0.946) for the enzyme dataset, G-protein coupled receptor dataset, ion channel dataset, and nuclear receptor dataset, respectively. CONCLUSION The screening approach produced high-quality negative samples with the same number of positive samples, which solved the class imbalance problem. We optimized an extreme learning machine using a spherical search approach to identify DTIs. Therefore, our models performed better than other state-of-the-art methods.
Collapse
Affiliation(s)
- Lingzhi Hu
- grid.411847.f0000 0004 1804 4300School of Medical Information Engineering, Guangdong Pharmaceutical University, Guangzhou, People’s Republic of China
| | - Chengzhou Fu
- grid.411847.f0000 0004 1804 4300School of Medical Information Engineering, Guangdong Pharmaceutical University, Guangzhou, People’s Republic of China ,Guangdong Province Precise Medicine Big Data of Traditional Chinese Medicine Engineering Technology Research Center, Guangzhou, People’s Republic of China
| | - Zhonglu Ren
- grid.411847.f0000 0004 1804 4300School of Medical Information Engineering, Guangdong Pharmaceutical University, Guangzhou, People’s Republic of China
| | - Yongming Cai
- grid.411847.f0000 0004 1804 4300School of Medical Information Engineering, Guangdong Pharmaceutical University, Guangzhou, People’s Republic of China ,Guangdong Province Precise Medicine Big Data of Traditional Chinese Medicine Engineering Technology Research Center, Guangzhou, People’s Republic of China
| | - Jin Yang
- grid.411847.f0000 0004 1804 4300School of Medical Information Engineering, Guangdong Pharmaceutical University, Guangzhou, People’s Republic of China ,Guangdong Province Precise Medicine Big Data of Traditional Chinese Medicine Engineering Technology Research Center, Guangzhou, People’s Republic of China
| | - Siwen Xu
- grid.411847.f0000 0004 1804 4300School of Medical Information Engineering, Guangdong Pharmaceutical University, Guangzhou, People’s Republic of China
| | - Wenhua Xu
- grid.411847.f0000 0004 1804 4300School of Medical Information Engineering, Guangdong Pharmaceutical University, Guangzhou, People’s Republic of China
| | - Deyu Tang
- grid.411847.f0000 0004 1804 4300School of Medical Information Engineering, Guangdong Pharmaceutical University, Guangzhou, People’s Republic of China ,grid.79703.3a0000 0004 1764 3838School of Computer Science and Engineering, South China University of Technology, Guangzhou, People’s Republic of China ,Guangdong Province Precise Medicine Big Data of Traditional Chinese Medicine Engineering Technology Research Center, Guangzhou, People’s Republic of China
| |
Collapse
|
21
|
Cartus AR, Samuels EA, Cerdá M, Marshall BDL. Outcome class imbalance and rare events: An underappreciated complication for overdose risk prediction modeling. Addiction 2023; 118:1167-1176. [PMID: 36683137 PMCID: PMC10175167 DOI: 10.1111/add.16133] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/22/2022] [Accepted: 12/22/2022] [Indexed: 01/24/2023]
Abstract
BACKGROUND AND AIMS Low outcome prevalence, often observed with opioid-related outcomes, poses an underappreciated challenge to accurate predictive modeling. Outcome class imbalance, where non-events (i.e. negative class observations) outnumber events (i.e. positive class observations) by a moderate to extreme degree, can distort measures of predictive accuracy in misleading ways, and make the overall predictive accuracy and the discriminatory ability of a predictive model appear spuriously high. We conducted a simulation study to measure the impact of outcome class imbalance on predictive performance of a simple SuperLearner ensemble model and suggest strategies for reducing that impact. DESIGN, SETTING, PARTICIPANTS Using a Monte Carlo design with 250 repetitions, we trained and evaluated these models on four simulated data sets with 100 000 observations each: one with perfect balance between events and non-events, and three where non-events outnumbered events by an approximate factor of 10:1, 100:1, and 1000:1, respectively. MEASUREMENTS We evaluated the performance of these models using a comprehensive suite of measures, including measures that are more appropriate for imbalanced data. FINDINGS Increasing imbalance tended to spuriously improve overall accuracy (using a high threshold to classify events vs non-events, overall accuracy improved from 0.45 with perfect balance to 0.99 with the most severe outcome class imbalance), but diminished predictive performance was evident using other metrics (corresponding positive predictive value decreased from 0.99 to 0.14). CONCLUSION Increasing reliance on algorithmic risk scores in consequential decision-making processes raises critical fairness and ethical concerns. This paper provides broad guidance for analytic strategies that clinical investigators can use to remedy the impacts of outcome class imbalance on risk prediction tools.
Collapse
Affiliation(s)
- Abigail R Cartus
- Department of Epidemiology, Brown University School of Public Health, Providence, Rhode Island, USA
| | - Elizabeth A Samuels
- Department of Epidemiology, Brown University School of Public Health, Providence, Rhode Island, USA.,Department of Emergency Medicine, Alpert Medical School of Brown University, Providence, Rhode Island, USA
| | - Magdalena Cerdá
- Division of Epidemiology, Department of Population Health, Center for Opioid Epidemiology and Policy, School of Medicine, New York University, New York, New York, USA
| | - Brandon D L Marshall
- Department of Epidemiology, Brown University School of Public Health, Providence, Rhode Island, USA
| |
Collapse
|
22
|
梁 进, 周 强, 李 婉. [Single-channel electroencephalogram signal used for sleep state recognition based on one-dimensional width kernel convolutional neural networks and long-short-term memory networks]. Sheng Wu Yi Xue Gong Cheng Xue Za Zhi 2022; 39:1089-1096. [PMID: 36575077 PMCID: PMC9927194 DOI: 10.7507/1001-5515.202204021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Subscribe] [Scholar Register] [Received: 04/11/2022] [Revised: 10/26/2022] [Indexed: 12/29/2022]
Abstract
Aiming at the problem that the unbalanced distribution of data in sleep electroencephalogram(EEG) signals and poor comfort in the process of polysomnography information collection will reduce the model's classification ability, this paper proposed a sleep state recognition method using single-channel EEG signals (WKCNN-LSTM) based on one-dimensional width kernel convolutional neural networks(WKCNN) and long-short-term memory networks (LSTM). Firstly, the wavelet denoising and synthetic minority over-sampling technique-Tomek link (SMOTE-Tomek) algorithm were used to preprocess the original sleep EEG signals. Secondly, one-dimensional sleep EEG signals were used as the input of the model, and WKCNN was used to extract frequency-domain features and suppress high-frequency noise. Then, the LSTM layer was used to learn the time-domain features. Finally, normalized exponential function was used on the full connection layer to realize sleep state. The experimental results showed that the classification accuracy of the one-dimensional WKCNN-LSTM model was 91.80% in this paper, which was better than that of similar studies in recent years, and the model had good generalization ability. This study improved classification accuracy of single-channel sleep EEG signals that can be easily utilized in portable sleep monitoring devices.
Collapse
Affiliation(s)
- 进 梁
- 陕西科技大学 电气与控制工程学院(西安 710021)School of Electrical and Control Engineering, Shaanxi University of Science & Technology, Xi’an 710021, P. R. China
- 陕西科技大学 电子信息与人工智能学院(西安 710021)School of Electronic Information and Artificial Intelligence, Shaanxi University of Science & Technology, Xi’an 710021, P. R. China
| | - 强 周
- 陕西科技大学 电气与控制工程学院(西安 710021)School of Electrical and Control Engineering, Shaanxi University of Science & Technology, Xi’an 710021, P. R. China
- 陕西科技大学 电子信息与人工智能学院(西安 710021)School of Electronic Information and Artificial Intelligence, Shaanxi University of Science & Technology, Xi’an 710021, P. R. China
| | - 婉 李
- 陕西科技大学 电气与控制工程学院(西安 710021)School of Electrical and Control Engineering, Shaanxi University of Science & Technology, Xi’an 710021, P. R. China
- 陕西科技大学 电子信息与人工智能学院(西安 710021)School of Electronic Information and Artificial Intelligence, Shaanxi University of Science & Technology, Xi’an 710021, P. R. China
| |
Collapse
|
23
|
Gökkan O, Kuntalp M. A new imbalance-aware loss function to be used in a deep neural network for colorectal polyp segmentation. Comput Biol Med 2022; 151:106205. [PMID: 36370582 DOI: 10.1016/j.compbiomed.2022.106205] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2022] [Revised: 09/14/2022] [Accepted: 10/09/2022] [Indexed: 12/27/2022]
Abstract
Colorectal cancers may occur in colon region of human body because of late detection of polyps. Therefore, colonoscopists often use colonoscopy device to view the entire colon in their routine practice to remove polyps by excisional biopsy. The aim of this study is to develop a new imbalance-aware loss function, i.e., omni-comprehensive loss, to be used in deep neural networks to overcome both imbalanced dataset and the vanishing gradient problem in identifying the related regions of a polyp. Another reason of developing a new loss function is to be able to produce a more comprehensive one that has evaluation capabilities of region-based, shape-aware, and pixel-wise distribution loss approaches at once. To measure the performance of the new loss function, two scenarios have been conducted. First, an 18-layer residual network as backbone with UNet as the decoder is implemented. Second, a 34-layer residual network as the encoder and a UNet as the decoder is designed. For both scenarios, the results of using popular imbalance-aware losses are compared with those of using our proposed new loss function. During training and 5-fold cross validation steps, multiple publicly available datasets are used. In addition to original data in these datasets, their augmented versions are also created by flipping, scaling, rotating and contrast-limited adaptive histogram equalization operations. As a result, our proposed new custom loss function produced the best performance metrics compared with the popular loss functions.
Collapse
Affiliation(s)
- Ozan Gökkan
- Ege University, Graduate School of Natural and Applied Sciences, Dept. of Biomedical Technologies, 35040, Turkey.
| | - Mehmet Kuntalp
- Dokuz Eylül University, Graduate School of Natural and Applied Sciences, Dept. of Biomedical Technologies, 35390, Turkey
| |
Collapse
|
24
|
Chatterjee S, Maity S, Bhattacharjee M, Banerjee S, Das AK, Ding W. Variational Autoencoder Based Imbalanced COVID-19 Detection Using Chest X-Ray Images. New Gener Comput 2022; 41:25-60. [PMID: 36439303 PMCID: PMC9676807 DOI: 10.1007/s00354-022-00194-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/03/2021] [Accepted: 10/16/2022] [Indexed: 06/12/2023]
Abstract
Early and fast detection of disease is essential for the fight against COVID-19 pandemic. Researchers have focused on developing robust and cost-effective detection methods using Deep learning based chest X-Ray image processing. However, such prediction models are often not well suited to address the challenge of highly imabalanced datasets. The current work is an attempt to address the issue by utilizing unsupervised Variational Auto Encoders (VAEs). Firstly, chest X-Ray images are converted to a latent space by learning the most important features using VAEs. Secondly, a wide range of well established data resampling techniques are used to balance the preexisting imbalanced classes in the latent vector form of the dataset. Finally, the modified dataset in the new feature space is used to train well known classification models to classify chest X-Ray images into three different classes viz., "COVID-19", "Pneumonia", and "Normal". In order to capture the quality of resampling methods, 10-folds cross validation technique is applied on the dataset. Extensive experimental analysis have been carried out and results so obtained indicate significant improvement in COVID-19 detection using the proposed VAE based method. Furthermore, the ingenuity of the results have been established by performing Wilcoxon rank test with 95% level of significance.
Collapse
Affiliation(s)
- Sankhadeep Chatterjee
- Department of Computer Science and Technology, Indian Institute of Engineering Science and Technology, Shibpur, West Bengal India
| | - Soumyajit Maity
- Department of Computer Science and Engineering, University of Engineering & Management, Kolkata, West Bengal India
| | - Mayukh Bhattacharjee
- Department of Computer Science and Engineering, University of Engineering & Management, Kolkata, West Bengal India
| | - Soumen Banerjee
- Department of Electronics and Communication Engineering, Budge Budge Institute of Technology, Budge Budge, Kolkata, West Bengal 700137 India
| | - Asit Kumar Das
- Department of Computer Science and Technology, Indian Institute of Engineering Science and Technology, Shibpur, West Bengal India
| | - Weiping Ding
- School of Information Science and Technology, Nantong University, 66479, Nantong, 226019 Jiangsu China
| |
Collapse
|
25
|
Kalsotra R, Arora S. Performance analysis of U-Net with hybrid loss for foreground detection. Multimed Syst 2022; 29:771-786. [PMID: 36406901 PMCID: PMC9641683 DOI: 10.1007/s00530-022-01014-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/06/2022] [Accepted: 10/10/2022] [Indexed: 06/16/2023]
Abstract
With the latest developments in deep neural networks, the convolutional neural network (CNN) has made considerable progress in the area of foreground detection. However, the top-rank background subtraction algorithms for foreground detection still have many shortcomings. It is challenging to extract the true foreground against complex background. To tackle the bottleneck, we propose a hybrid loss-assisted U-Net framework for foreground detection. A proposed deep learning model integrates transfer learning and hybrid loss for better feature representation and faster model convergence. The core idea is to incorporate reference background image and change detection mask in the learning network. Furthermore, we empirically investigate the potential of hybrid loss over single loss function. The advantages of two significant loss functions are combined to tackle the class imbalance problem in foreground detection. The proposed technique demonstrates its effectiveness on standard datasets and performs better than the top-rank methods in challenging environment. Moreover, experiments on unseen videos also confirm the efficacy of proposed method.
Collapse
Affiliation(s)
- Rudrika Kalsotra
- Department of Computer Science and Engineering, Shri Mata Vaishno Devi University, Katra, 182320 India
| | - Sakshi Arora
- Department of Computer Science and Engineering, Shri Mata Vaishno Devi University, Katra, 182320 India
| |
Collapse
|
26
|
Qian S, Ren K, Zhang W, Ning H. Skin lesion classification using CNNs with grouping of multi-scale attention and class-specific loss weighting. Comput Methods Programs Biomed 2022; 226:107166. [PMID: 36209623 DOI: 10.1016/j.cmpb.2022.107166] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/07/2022] [Revised: 09/05/2022] [Accepted: 09/29/2022] [Indexed: 06/16/2023]
Abstract
As one of the most common cancers globally, the incidence of skin cancer has been rising. Dermoscopy-based classification has become the most effective method for the diagnosis of skin lesion types due to its accuracy and non-invasive characteristics, which plays a significant role in reducing mortality. Although a great breakthrough of the task of skin lesion classification has been made with the application of convolutional neural network, the inter-class similarity and intra-class variation in skin lesions images, the high class imbalance of the dataset and the lack of ability to focus on the lesion area all affect the classification results of the model. In order to solve these problems, on the one hand, we use the grouping of multi-scale attention blocks (GMAB) to extract multi-scale fine-grained features so as to improve the model's ability to focus on the lesion area. On the other hand, we adopt the method of class-specific loss weighting for the problem of category imbalance. In this paper, we propose a deep convolution neural network dermatoscopic image classification method based on the grouping of multi-scale attention blocks and class-specific loss weighting. We evaluated our model on the HAM10000 dataset, and the results showed that the ACC and AUC of the proposed method were 91.6% and 97.1% respectively, which can achieve good results in dermatoscopic classification tasks.
Collapse
Affiliation(s)
- Shenyi Qian
- Information management center, Zhengzhou University of Light Industry, Zhengzhou 450001, China.
| | - Kunpeng Ren
- School of Computer and Communication Engineering, Zhengzhou University of Light Industry, Zhengzhou 450001, China
| | - Weiwei Zhang
- School of Computer and Communication Engineering, Zhengzhou University of Light Industry, Zhengzhou 450001, China
| | - Haohan Ning
- School of Computer and Communication Engineering, Zhengzhou University of Light Industry, Zhengzhou 450001, China
| |
Collapse
|
27
|
Xing M, Zhang Y, Yu H, Yang Z, Li X, Li Q, Zhao Y, Zhao Z, Luo Y. Predict DLBCL patients' recurrence within two years with Gaussian mixture model cluster oversampling and multi-kernel learning. Comput Methods Programs Biomed 2022; 226:107103. [PMID: 36088813 DOI: 10.1016/j.cmpb.2022.107103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/14/2021] [Revised: 08/05/2022] [Accepted: 08/30/2022] [Indexed: 06/15/2023]
Abstract
BACKGROUND AND OBJECTIVE Diffuse large B-cell lymphoma (DLBCL) is common in adults' non-Hodgkin's lymphoma. Relapse mainly occurs within two years after diagnosis and has a poor prognosis. Relapse after two years is less frequent and has a better prognosis. In this work, we constructed a relapse prediction model for diffuse large B-cell lymphoma patients within two years, expecting to provide a reference for Clinicians to implement individualized treatment. METHOD We propose a secondary-level class imbalance method based on Gaussian mixture model (GMM) clustering resampling to balance the data. Then use a multi-kernel support vector machine(SVM) to inscribe heterogeneous clinical data. Finally, merging them to identify recurrence patients within two years. RESULTS Among all the class imbalance methods in this work, Inverse Weighted -GMM +SMOTEENN has the best performance. Compared with NO-GMM (Directl use the SMOTEENN without the GMM clustering process), its Area Under the ROC Curve(AUC) increases by 8.75%, and ECE and brier scores decrease 2.07% and 3.09%, respectively. Among the four classification algorithms in this work, Multiple kernel learning (MKL) has the most minimized brier scores and expected calibration error(ECE), the largest AUC, accuracy, Recall, precision and F1, has the best discrimination and calibration. CONCLUSION Our inverse weighted -GMM+SMOTEENN+MKL (GMM-SENN-MKL) method can handle data class imbalance and clinical heterogeneity data well and can be used to predict recurrence in DLBCL patients.
Collapse
Affiliation(s)
- Meng Xing
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China; Shanxi Provincial Key Laboratory of Major Diseases Risk Assessment, Taiyuan, China
| | - Yanbo Zhang
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China; Shanxi Provincial Key Laboratory of Major Diseases Risk Assessment, Taiyuan, China
| | - Hongmei Yu
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China; Shanxi Provincial Key Laboratory of Major Diseases Risk Assessment, Taiyuan, China
| | - Zhenhuan Yang
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China; Shanxi Provincial Key Laboratory of Major Diseases Risk Assessment, Taiyuan, China
| | - Xueling Li
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China; Shanxi Provincial Key Laboratory of Major Diseases Risk Assessment, Taiyuan, China
| | - Qiong Li
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China; Shanxi Provincial Key Laboratory of Major Diseases Risk Assessment, Taiyuan, China
| | - Yanlin Zhao
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China; Shanxi Provincial Key Laboratory of Major Diseases Risk Assessment, Taiyuan, China
| | - Zhiqiang Zhao
- Department of Hematology, Shanxi Cancer Hospital, Taiyuan, China.
| | - Yanhong Luo
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, China; Shanxi Provincial Key Laboratory of Major Diseases Risk Assessment, Taiyuan, China.
| |
Collapse
|
28
|
Matharaarachchi S, Domaratzki M, Muthukumarana S. Minimizing features while maintaining performance in data classification problems. PeerJ Comput Sci 2022; 8:e1081. [PMID: 36262135 PMCID: PMC9575878 DOI: 10.7717/peerj-cs.1081] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2022] [Accepted: 08/10/2022] [Indexed: 06/16/2023]
Abstract
High dimensional classification problems have gained increasing attention in machine learning, and feature selection has become essential in executing machine learning algorithms. In general, most feature selection methods compare the scores of several feature subsets and select the one that gives the maximum score. There may be other selections of a lower number of features with a lower score, yet the difference is negligible. This article proposes and applies an extended version of such feature selection methods, which selects a smaller feature subset with similar performance to the original subset under a pre-defined threshold. It further validates the suggested extended version of the Principal Component Loading Feature Selection (PCLFS-ext) results by simulating data for several practical scenarios with different numbers of features and different imbalance rates on several classification methods. Our simulated results show that the proposed method outperforms the original PCLFS and existing Recursive Feature Elimination (RFE) by giving reasonable feature reduction on various data sets, which is important in some applications.
Collapse
Affiliation(s)
| | - Mike Domaratzki
- Computer Science, University of Western Ontario, London, Ontario, Canada
| | | |
Collapse
|
29
|
Zhu S, Meng Q. What can we learn from autonomous vehicle collision data on crash severity? A cost-sensitive CART approach. Accid Anal Prev 2022; 174:106769. [PMID: 35858521 DOI: 10.1016/j.aap.2022.106769] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/16/2021] [Revised: 04/17/2022] [Accepted: 07/02/2022] [Indexed: 06/15/2023]
Abstract
Autonomous vehicles (AVs) are emerging in the automobile industry with potential benefits to reduce traffic congestion, improve mobility and accessibility, as well as safety. According to the AV collision data managed by the California Department of Motor Vehicles (DMV), however, the safety issue of AVs has continuously been a concern. This paper aims to learn the contributing factors to AV crash severity from the latest 3-year AV collision data. To achieve the objective, we develop an AV crash severity classification tree with the possible contributing factors by the cost-sensitive classification and regression tree (CART) model, which can deal with the class imbalance issue raised from the AV collision dataset. Our results show that the main factors affecting AV crash severity level include manufacturer, facility type, movement preceding collision, collision type, light condition and year. These findings could provide useful insights for traffic engineers or AV manufacturers to raise effective counter measures or policies to mitigate AV crash severity.
Collapse
Affiliation(s)
- Siying Zhu
- Department of Civil and Environmental Engineering, National University of Singapore, Singapore 117576, Singapore
| | - Qiang Meng
- Department of Civil and Environmental Engineering, National University of Singapore, Singapore 117576, Singapore.
| |
Collapse
|
30
|
Narwane SV, Sawarkar SD. Is handling unbalanced datasets for machine learning uplifts system performance?: A case of diabetic prediction. Diabetes Metab Syndr 2022; 16:102609. [PMID: 36099677 DOI: 10.1016/j.dsx.2022.102609] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/05/2022] [Revised: 08/21/2022] [Accepted: 08/23/2022] [Indexed: 11/30/2022]
Abstract
BACKGROUND AND AIMS Healthcare is a sensitive sector, and addressing the class imbalance in the healthcare domain is a time-consuming task for machine learning-based systems due to the vast amount of data. This study looks into the impact of socioeconomic disparities on the healthcare data of diabetic patients to make accurate disease predictions. METHODS This study proposed a systematic approach of Closest Distance Ranking and Principal Component Analysis to deal with the unbalanced dataset. A typical machine learning technique was used to analyze the proposed approach. The data set of pregnant diabetic women is analysed for accurate detection. RESULTS The results of the case are analysed using sensitivity, which demonstrates that the minority class's lack of information makes it impossible to forecast the results. On the other hand, the unbalanced dataset was treated using the proposed technique and evaluated with the machine learning algorithm which significantly increased the performance of the system. CONCLUSION The performance of the machine learning-based system was significantly enhanced by the unbalanced dataset which was processed with the proposed technique and evaluated with the machine learning algorithm. For the first time, an unbalanced dataset was treated with a combination of Closest Distance Ranking and Principal Component Analysis.
Collapse
Affiliation(s)
- Swati V Narwane
- Department of Computer Engineering, Datta Meghe College of Engineering, Navi Mumbai, Pin Code: 400 708, India.
| | - Sudhir D Sawarkar
- Department of Computer Engineering, Datta Meghe College of Engineering, Navi Mumbai, Pin Code: 400 708, India.
| |
Collapse
|
31
|
Holste G, Wang S, Jiang Z, Shen TC, Shih G, Summers RM, Peng Y, Wang Z. Long-Tailed Classification of Thorax Diseases on Chest X-Ray: A New Benchmark Study. Data Augment Label Imperfections (2022) 2022; 13567:22-32. [PMID: 36318048 PMCID: PMC9618235 DOI: 10.1007/978-3-031-17027-0_3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Imaging exams, such as chest radiography, will yield a small set of common findings and a much larger set of uncommon findings. While a trained radiologist can learn the visual presentation of rare conditions by studying a few representative examples, teaching a machine to learn from such a "long-tailed" distribution is much more difficult, as standard methods would be easily biased toward the most frequent classes. In this paper, we present a comprehensive benchmark study of the long-tailed learning problem in the specific domain of thorax diseases on chest X-rays. We focus on learning from naturally distributed chest X-ray data, optimizing classification accuracy over not only the common "head" classes, but also the rare yet critical "tail" classes. To accomplish this, we introduce a challenging new long-tailed chest X-ray benchmark to facilitate research on developing long-tailed learning methods for medical image classification. The benchmark consists of two chest X-ray datasets for 19- and 20-way thorax disease classification, containing classes with as many as 53,000 and as few as 7 labeled training images. We evaluate both standard and state-of-the-art long-tailed learning methods on this new benchmark, analyzing which aspects of these methods are most beneficial for long-tailed medical image classification and summarizing insights for future algorithm design. The datasets, trained models, and code are available at https://github.com/VITA-Group/LongTailCXR.
Collapse
Affiliation(s)
| | - Song Wang
- The University of Texas at Austin, Austin, TX, USA
| | - Ziyu Jiang
- Texas A&M University, College Station, TX, USA
| | | | | | | | - Yifan Peng
- Weill Cornell Medicine, New York, NY, USA
| | | |
Collapse
|
32
|
A Romero RA, Y Deypalan MN, Mehrotra S, Jungao JT, Sheils NE, Manduchi E, Moore JH. Benchmarking AutoML frameworks for disease prediction using medical claims. BioData Min 2022; 15:15. [PMID: 35883154 PMCID: PMC9327416 DOI: 10.1186/s13040-022-00300-2] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2021] [Accepted: 06/27/2022] [Indexed: 11/10/2022] Open
Abstract
Objectives Ascertain and compare the performances of Automated Machine Learning (AutoML) tools on large, highly imbalanced healthcare datasets. Materials and Methods We generated a large dataset using historical de-identified administrative claims including demographic information and flags for disease codes in four different time windows prior to 2019. We then trained three AutoML tools on this dataset to predict six different disease outcomes in 2019 and evaluated model performances on several metrics. Results The AutoML tools showed improvement from the baseline random forest model but did not differ significantly from each other. All models recorded low area under the precision-recall curve and failed to predict true positives while keeping the true negative rate high. Model performance was not directly related to prevalence. We provide a specific use-case to illustrate how to select a threshold that gives the best balance between true and false positive rates, as this is an important consideration in medical applications. Discussion Healthcare datasets present several challenges for AutoML tools, including large sample size, high imbalance, and limitations in the available features. Improvements in scalability, combinations of imbalance-learning resampling and ensemble approaches, and curated feature selection are possible next steps to achieve better performance. Conclusion Among the three explored, no AutoML tool consistently outperforms the rest in terms of predictive performance. The performances of the models in this study suggest that there may be room for improvement in handling medical claims data. Finally, selection of the optimal prediction threshold should be guided by the specific practical application. Supplementary Information The online version contains supplementary material available at (10.1186/s13040-022-00300-2).
Collapse
Affiliation(s)
| | | | | | | | | | - Elisabetta Manduchi
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, 700 N. San Vicente Blvd., Pacific Design Center Suite G540, West Hollywood, 90069, CA, USA
| | - Jason H Moore
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, 700 N. San Vicente Blvd., Pacific Design Center Suite G540, West Hollywood, 90069, CA, USA.
| |
Collapse
|
33
|
Tappeiner E, Welk M, Schubert R. Tackling the class imbalance problem of deep learning-based head and neck organ segmentation. Int J Comput Assist Radiol Surg 2022; 17:2103-2111. [PMID: 35578086 PMCID: PMC9515025 DOI: 10.1007/s11548-022-02649-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2022] [Accepted: 04/20/2022] [Indexed: 12/03/2022]
Abstract
Purpose The segmentation of organs at risk (OAR) is a required precondition for the cancer treatment with image- guided radiation therapy. The automation of the segmentation task is therefore of high clinical relevance. Deep learning (DL)-based medical image segmentation is currently the most successful approach, but suffers from the over-presence of the background class and the anatomically given organ size difference, which is most severe in the head and neck (HAN) area. Methods To tackle the HAN area-specific class imbalance problem, we first optimize the patch size of the currently best performing general-purpose segmentation framework, the nnU-Net, based on the introduced class imbalance measurement, and second introduce the class adaptive Dice loss to further compensate for the highly imbalanced setting. Results Both the patch size and the loss function are parameters with direct influence on the class imbalance, and their optimization leads to a 3% increase in the Dice score and 22% reduction in the 95% Hausdorff distance compared to the baseline, finally reaching \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$0.8\pm 0.15$$\end{document}0.8±0.15 and \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$3.17\pm 1.7$$\end{document}3.17±1.7 mm for the segmentation of seven HAN organs using a single and simple neural network. Conclusion The patch size optimization and the class adaptive Dice loss are both simply integrable in current DL-based segmentation approaches and allow to increase the performance for class imbalance segmentation tasks.
Collapse
Affiliation(s)
- Elias Tappeiner
- Department for Biomedical Computer Science and Mechatronics, UMIT-Private University for Health Sciences, Medical Informatics and Technology, Eduard-Wallnöfer-Zentrum 1, 6060, Hall in Tyrol, Tyrol, Austria.
| | - Martin Welk
- Department for Biomedical Computer Science and Mechatronics, UMIT-Private University for Health Sciences, Medical Informatics and Technology, Eduard-Wallnöfer-Zentrum 1, 6060, Hall in Tyrol, Tyrol, Austria
| | - Rainer Schubert
- Department for Biomedical Computer Science and Mechatronics, UMIT-Private University for Health Sciences, Medical Informatics and Technology, Eduard-Wallnöfer-Zentrum 1, 6060, Hall in Tyrol, Tyrol, Austria
| |
Collapse
|
34
|
Naga D, Muster W, Musvasva E, Ecker GF. Off-targetP ML: an open source machine learning framework for off-target panel safety assessment of small molecules. J Cheminform 2022; 14:27. [PMID: 35525988 PMCID: PMC9077900 DOI: 10.1186/s13321-022-00603-w] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2021] [Accepted: 03/26/2022] [Indexed: 11/10/2022] Open
Abstract
Unpredicted drug safety issues constitute the majority of failures in the pharmaceutical industry according to several studies. Some of these preclinical safety issues could be attributed to the non-selective binding of compounds to targets other than their intended therapeutic target, causing undesired adverse events. Consequently, pharmaceutical companies routinely run in-vitro safety screens to detect off-target activities prior to preclinical and clinical studies. Hereby we present an open source machine learning framework aiming at the prediction of our in-house 50 off-target panel activities for ~ 4000 compounds, directly from their structure. This framework is intended to guide chemists in the drug design process prior to synthesis and to accelerate drug discovery. We also present a set of ML approaches that require minimum programming experience for deployment. The workflow incorporates different ML approaches such as deep learning and automated machine learning. It also accommodates popular issues faced in bioactivity predictions, as data imbalance, inter-target duplicated measurements and duplicated public compound identifiers. Throughout the workflow development, we explore and compare the capability of Neural Networks and AutoML in constructing prediction models for fifty off-targets of different protein classes, different dataset sizes, and high-class imbalance. Outcomes from different methods are compared in terms of efficiency and efficacy. The most important challenges and factors impacting model construction and performance in addition to suggestions on how to overcome such challenges are also discussed.
Collapse
Affiliation(s)
- Doha Naga
- Roche Pharma Research & Early Development, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland.,Department of Pharmaceutical Sciences, University of Vienna, Vienna, Austria
| | - Wolfgang Muster
- Roche Pharma Research & Early Development, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland
| | - Eunice Musvasva
- Roche Pharma Research & Early Development, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland
| | - Gerhard F Ecker
- Department of Pharmaceutical Sciences, University of Vienna, Vienna, Austria.
| |
Collapse
|
35
|
Huynh T, Nibali A, He Z. Semi-supervised learning for medical image classification using imbalanced training data. Comput Methods Programs Biomed 2022; 216:106628. [PMID: 35101700 DOI: 10.1016/j.cmpb.2022.106628] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/01/2021] [Revised: 12/20/2021] [Accepted: 01/07/2022] [Indexed: 06/14/2023]
Abstract
BACKGROUND AND OBJECTIVE Medical image classification is often challenging for two reasons: a lack of labelled examples due to expensive and time-consuming annotation protocols, and imbalanced class labels due to the relative scarcity of disease-positive individuals in the wider population. Semi-supervised learning methods exist for dealing with a lack of labels, but they generally do not address the problem of class imbalance. Hence, the purpose of this study is to explore a new approach to perturbation-based semi-supervised learning which tackles the problem of applying semi-supervised learning to medical image classification with imbalanced training data. METHODS In this study we propose Adaptive Blended Consistency Loss (ABCL), a simple yet effective drop-in replacement for consistency loss in perturbation-based semi-supervised learning methods. ABCL counteracts data skew by adaptively mixing the target class distribution of the consistency loss in accordance with class frequency. Our proposed method is evaluated and compared with existing methods on two different imbalanced medical image classification datasets. An ablation study is also provided to analyse the properties and effectiveness of our proposed method. RESULTS Our experiments with ABCL reveal improvements to unweighted average recall (UAR) when compared with existing consistency losses that are not designed to counteract class imbalance and other existing methods. Our proposed ABCL method is able to improve the performance of the baseline consistency loss approach from 0.59 to 0.67 UAR and outperforms methods that address the class imbalance problem for labelled data (between 0.51 and 0.59 UAR) and for unlabelled data (0.61 UAR) on the imbalanced skin cancer dataset. On the imbalanced retinal fundus glaucoma dataset, ABCL (combined with Weighted Cross Entropy loss) achieves 0.67 UAR, which is an improvement over the best existing approach (0.57 UAR). CONCLUSIONS Overall the results show the effectiveness of ABCL to alleviate the class imbalance problem for semi-supervised classification for medical images.
Collapse
Affiliation(s)
- Tri Huynh
- Department of Computer Science and Information Technology, La Trobe University, Melbourne, Australia.
| | - Aiden Nibali
- Department of Computer Science and Information Technology, La Trobe University, Melbourne, Australia
| | - Zhen He
- Department of Computer Science and Information Technology, La Trobe University, Melbourne, Australia
| |
Collapse
|
36
|
Jiao J, Du Y, Li X, Guo Y, Ren Y, Wang Y. Prenatal prediction of neonatal respiratory morbidity: a radiomics method based on imbalanced few-shot fetal lung ultrasound images. BMC Med Imaging 2022; 22:2. [PMID: 34983431 PMCID: PMC8725479 DOI: 10.1186/s12880-021-00731-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2021] [Accepted: 12/30/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND To develop a non-invasive method for the prenatal prediction of neonatal respiratory morbidity (NRM) by a novel radiomics method based on imbalanced few-shot fetal lung ultrasound images. METHODS A total of 210 fetal lung ultrasound images were enrolled in this study, including 159 normal newborns and 51 NRM newborns. Fetal lungs were delineated as the region of interest (ROI), where radiomics features were designed and extracted. Integrating radiomics features selected and two clinical features, including gestational age and gestational diabetes mellitus, the prediction model was developed and evaluated. The modelling methods used were data augmentation, cost-sensitive learning, and ensemble learning. Furthermore, two methods, which embed data balancing into ensemble learning, were employed to address the problems of imbalance and few-shot simultaneously. RESULTS Our model achieved sensitivity values of 0.82, specificity values of 0.84, balanced accuracy values of 0.83 and area under the curve values of 0.87 in the test set. The radiomics features extracted from the ROIs at different locations within the lung region achieved similar classification performance outcomes. CONCLUSION The feature set we designed can efficiently and robustly describe fetal lungs for NRM prediction. RUSBoost shows excellent performance compared to state-of-the-art classifiers on the imbalanced few-shot dataset. The diagnostic efficacy of the model we developed is similar to that of several previous reports of amniocentesis and can serve as a non-invasive, precise evaluation tool for NRM prediction.
Collapse
Affiliation(s)
- Jing Jiao
- Department of Electronic Engineering, Fudan University, No. 220, Handan Road, Yangpu District, Shanghai, 200433, China.,Key Laboratory of Medical Imaging Computing and Computer Assisted Intervention of Shanghai, Shanghai, 200433, China
| | - Yanran Du
- Department of Ultrasound, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, No.197 Rui Jin 2nd Road, Shanghai, 200025, China
| | - Xiaokang Li
- Department of Electronic Engineering, Fudan University, No. 220, Handan Road, Yangpu District, Shanghai, 200433, China.,Key Laboratory of Medical Imaging Computing and Computer Assisted Intervention of Shanghai, Shanghai, 200433, China
| | - Yi Guo
- Department of Electronic Engineering, Fudan University, No. 220, Handan Road, Yangpu District, Shanghai, 200433, China. .,Key Laboratory of Medical Imaging Computing and Computer Assisted Intervention of Shanghai, Shanghai, 200433, China.
| | - Yunyun Ren
- Department of Ultrasound, Obstetrics and Gynecology Hospital of Fudan University, No. 128, Shenyang Road, Shanghai, 200090, China.
| | - Yuanyuan Wang
- Department of Electronic Engineering, Fudan University, No. 220, Handan Road, Yangpu District, Shanghai, 200433, China. .,Key Laboratory of Medical Imaging Computing and Computer Assisted Intervention of Shanghai, Shanghai, 200433, China.
| |
Collapse
|
37
|
Pes B, Lai G. Cost-sensitive learning strategies for high-dimensional and imbalanced data: a comparative study. PeerJ Comput Sci 2021; 7:e832. [PMID: 35036539 PMCID: PMC8725666 DOI: 10.7717/peerj-cs.832] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2021] [Accepted: 12/06/2021] [Indexed: 05/28/2023]
Abstract
High dimensionality and class imbalance have been largely recognized as important issues in machine learning. A vast amount of literature has indeed investigated suitable approaches to address the multiple challenges that arise when dealing with high-dimensional feature spaces (where each problem instance is described by a large number of features). As well, several learning strategies have been devised to cope with the adverse effects of imbalanced class distributions, which may severely impact on the generalization ability of the induced models. Nevertheless, although both the issues have been largely studied for several years, they have mostly been addressed separately, and their combined effects are yet to be fully understood. Indeed, little research has been so far conducted to investigate which approaches might be best suited to deal with datasets that are, at the same time, high-dimensional and class-imbalanced. To make a contribution in this direction, our work presents a comparative study among different learning strategies that leverage both feature selection, to cope with high dimensionality, as well as cost-sensitive learning methods, to cope with class imbalance. Specifically, different ways of incorporating misclassification costs into the learning process have been explored. Also different feature selection heuristics have been considered, both univariate and multivariate, to comparatively evaluate their effectiveness on imbalanced data. The experiments have been conducted on three challenging benchmarks from the genomic domain, gaining interesting insight into the beneficial impact of combining feature selection and cost-sensitive learning, especially in the presence of highly skewed data distributions.
Collapse
Affiliation(s)
- Barbara Pes
- Dipartimento di Matematica e Informatica, Università degli Studi di Cagliari, Cagliari, Italy
| | - Giuseppina Lai
- Dipartimento di Matematica e Informatica, Università degli Studi di Cagliari, Cagliari, Italy
| |
Collapse
|
38
|
Yeung M, Sala E, Schönlieb CB, Rundo L. Unified Focal loss: Generalising Dice and cross entropy-based losses to handle class imbalanced medical image segmentation. Comput Med Imaging Graph 2021; 95:102026. [PMID: 34953431 PMCID: PMC8785124 DOI: 10.1016/j.compmedimag.2021.102026] [Citation(s) in RCA: 78] [Impact Index Per Article: 26.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2021] [Revised: 11/18/2021] [Accepted: 12/04/2021] [Indexed: 12/18/2022]
Abstract
Automatic segmentation methods are an important advancement in medical image analysis. Machine learning techniques, and deep neural networks in particular, are the state-of-the-art for most medical image segmentation tasks. Issues with class imbalance pose a significant challenge in medical datasets, with lesions often occupying a considerably smaller volume relative to the background. Loss functions used in the training of deep learning algorithms differ in their robustness to class imbalance, with direct consequences for model convergence. The most commonly used loss functions for segmentation are based on either the cross entropy loss, Dice loss or a combination of the two. We propose the Unified Focal loss, a new hierarchical framework that generalises Dice and cross entropy-based losses for handling class imbalance. We evaluate our proposed loss function on five publicly available, class imbalanced medical imaging datasets: CVC-ClinicDB, Digital Retinal Images for Vessel Extraction (DRIVE), Breast Ultrasound 2017 (BUS2017), Brain Tumour Segmentation 2020 (BraTS20) and Kidney Tumour Segmentation 2019 (KiTS19). We compare our loss function performance against six Dice or cross entropy-based loss functions, across 2D binary, 3D binary and 3D multiclass segmentation tasks, demonstrating that our proposed loss function is robust to class imbalance and consistently outperforms the other loss functions. Source code is available at: https://github.com/mlyg/unified-focal-loss. Loss function choice is crucial for class-imbalanced medical imaging datasets. Understanding the relationship between loss functions is key to inform choice. Unified Focal loss generalises Dice and cross-entropy based loss functions. Unified Focal loss outperforms various Dice and cross-entropy based loss functions.
Collapse
Affiliation(s)
- Michael Yeung
- Department of Radiology, University of Cambridge, Cambridge CB2 0QQ, United Kingdom; School of Clinical Medicine, University of Cambridge, Cambridge CB2 0SP, United Kingdom.
| | - Evis Sala
- Department of Radiology, University of Cambridge, Cambridge CB2 0QQ, United Kingdom; Cancer Research UK Cambridge Centre, University of Cambridge, Cambridge CB2 0RE, United Kingdom.
| | - Carola-Bibiane Schönlieb
- Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge CB3 0WA, United Kingdom.
| | - Leonardo Rundo
- Department of Radiology, University of Cambridge, Cambridge CB2 0QQ, United Kingdom; Cancer Research UK Cambridge Centre, University of Cambridge, Cambridge CB2 0RE, United Kingdom; Department of Information and Electrical Engineering and Applied Mathematics (DIEM), University of Salerno, Fisciano, SA 84084, Italy.
| |
Collapse
|
39
|
Peng L, Yuan R, Shen L, Gao P, Zhou L. LPI-EnEDT: an ensemble framework with extra tree and decision tree classifiers for imbalanced lncRNA-protein interaction data classification. BioData Min 2021; 14:50. [PMID: 34861891 PMCID: PMC8642957 DOI: 10.1186/s13040-021-00277-4] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2021] [Accepted: 08/22/2021] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Long noncoding RNAs (lncRNAs) have dense linkages with various biological processes. Identifying interacting lncRNA-protein pairs contributes to understand the functions and mechanisms of lncRNAs. Wet experiments are costly and time-consuming. Most computational methods failed to observe the imbalanced characterize of lncRNA-protein interaction (LPI) data. More importantly, they were measured based on a unique dataset, which produced the prediction bias. RESULTS In this study, we develop an Ensemble framework (LPI-EnEDT) with Extra tree and Decision Tree classifiers to implement imbalanced LPI data classification. First, five LPI datasets are arranged. Second, lncRNAs and proteins are separately characterized based on Pyfeat and BioTriangle and concatenated as a vector to represent each lncRNA-protein pair. Finally, an ensemble framework with Extra tree and decision tree classifiers is developed to classify unlabeled lncRNA-protein pairs. The comparative experiments demonstrate that LPI-EnEDT outperforms four classical LPI prediction methods (LPI-BLS, LPI-CatBoost, LPI-SKF, and PLIPCOM) under cross validations on lncRNAs, proteins, and LPIs. The average AUC values on the five datasets are 0.8480, 0,7078, and 0.9066 under the three cross validations, respectively. The average AUPRs are 0.8175, 0.7265, and 0.8882, respectively. Case analyses suggest that there are underlying associations between HOTTIP and Q9Y6M1, NRON and Q15717. CONCLUSIONS Fusing diverse biological features of lncRNAs and proteins and exploiting an ensemble learning model with Extra tree and decision tree classifiers, this work focus on imbalanced LPI data classification as well as interaction information inference for a new lncRNA (or protein).
Collapse
Affiliation(s)
- Lihong Peng
- School of Computer Science, Hunan University of Technology, No.88, Taishan West Road, Tianyuan District, Zhuzhou, China.,College of Life Sciences and Chemistry, Hunan University of Technology, No.88, Taishan West Road, Tianyuan District, Zhuzhou, China
| | - Ruya Yuan
- School of Computer Science, Hunan University of Technology, No.88, Taishan West Road, Tianyuan District, Zhuzhou, China
| | - Ling Shen
- School of Computer Science, Hunan University of Technology, No.88, Taishan West Road, Tianyuan District, Zhuzhou, China
| | - Pengfei Gao
- College of Life Sciences and Chemistry, Hunan University of Technology, No.88, Taishan West Road, Tianyuan District, Zhuzhou, China
| | - Liqian Zhou
- School of Computer Science, Hunan University of Technology, No.88, Taishan West Road, Tianyuan District, Zhuzhou, China.
| |
Collapse
|
40
|
Han S, Williamson BD, Fong Y. Improving random forest predictions in small datasets from two-phase sampling designs. BMC Med Inform Decis Mak 2021; 21:322. [PMID: 34809631 PMCID: PMC8607560 DOI: 10.1186/s12911-021-01688-3] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2021] [Accepted: 11/10/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND While random forests are one of the most successful machine learning methods, it is necessary to optimize their performance for use with datasets resulting from a two-phase sampling design with a small number of cases-a common situation in biomedical studies, which often have rare outcomes and covariates whose measurement is resource-intensive. METHODS Using an immunologic marker dataset from a phase III HIV vaccine efficacy trial, we seek to optimize random forest prediction performance using combinations of variable screening, class balancing, weighting, and hyperparameter tuning. RESULTS Our experiments show that while class balancing helps improve random forest prediction performance when variable screening is not applied, class balancing has a negative impact on performance in the presence of variable screening. The impact of the weighting similarly depends on whether variable screening is applied. Hyperparameter tuning is ineffective in situations with small sample sizes. We further show that random forests under-perform generalized linear models for some subsets of markers, and prediction performance on this dataset can be improved by stacking random forests and generalized linear models trained on different subsets of predictors, and that the extent of improvement depends critically on the dissimilarities between candidate learner predictions. CONCLUSION In small datasets from two-phase sampling design, variable screening and inverse sampling probability weighting are important for achieving good prediction performance of random forests. In addition, stacking random forests and simple linear models can offer improvements over random forests.
Collapse
Affiliation(s)
- Sunwoo Han
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, USA
| | - Brian D. Williamson
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, USA
| | - Youyi Fong
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, USA
| |
Collapse
|
41
|
Ding S, Wu Z, Zheng Y, Liu Z, Yang X, Yang X, Yuan G, Xie J. Deep attention branch networks for skin lesion classification. Comput Methods Programs Biomed 2021; 212:106447. [PMID: 34678529 DOI: 10.1016/j.cmpb.2021.106447] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/07/2021] [Accepted: 09/28/2021] [Indexed: 06/13/2023]
Abstract
BACKGROUND AND OBJECTIVE The skin lesion usually covers a small region of the dermoscopy image, and the lesions of different categories might own high similarities. Therefore, it is essential to design an elaborate network for accurate skin lesion classification, which can focus on semantically meaningful lesion parts. Although the Class Activation Mapping (CAM) shows good localization capability of highlighting the discriminative parts, it cannot be obtained in the forward propagation process. METHODS We propose a Deep Attention Branch Network (DABN) model, which introduces the attention branches to expand the conventional Deep Convolutional Neural Networks (DCNN). The attention branch is designed to obtain the CAM in the training stage, which is then utilized as an attention map to make the network focus on discriminative parts of skin lesions. DABN is applicable to multiple DCNN structures and can be trained in an end-to-end manner. Moreover, a novel Entropy-guided Loss Weighting (ELW) strategy is designed to counter class imbalance influence in the skin lesion datasets. RESULTS The proposed method achieves an Average Precision (AP) of 0.719 on the ISIC-2016 dataset and an average area under the ROC curve (AUC) of 0.922 on the ISIC-2017 dataset. Compared with other state-of-the-art methods, our method obtains better performance without external data and ensemble learning. Moreover, extensive experiments demonstrate that it can be applied to multi-class classification tasks and improves mean sensitivity by more than 2.6% in different DCNN structures. CONCLUSIONS The proposed method can adaptively focus on the discriminative regions of dermoscopy images and allows for effective training when facing class imbalance, leading to the performance improvement of skin lesion classification, which could also be applied to other clinical applications.
Collapse
Affiliation(s)
- Saisai Ding
- School of Communication and Information Engineering, Shanghai University, Shanghai, 200444, China
| | - Zhongyi Wu
- School of Biomedical Engineering (Suzhou), Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, 230026, China; Department of Medical Imaging, Suzhou Institute of Biomedical Engineering and Technology, Chinese Academy of Sciences, Suzhou, 215163, China
| | - Yanyan Zheng
- The Wenzhou Third Clinical Institute Affiliated To Wenzhou Medical University, Wenzhou, 325000, China; Wenzhou People's Hospital, Wenzhou, 325000, China
| | - Zhaobang Liu
- School of Biomedical Engineering (Suzhou), Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, 230026, China; Department of Medical Imaging, Suzhou Institute of Biomedical Engineering and Technology, Chinese Academy of Sciences, Suzhou, 215163, China
| | - Xiaodong Yang
- School of Biomedical Engineering (Suzhou), Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, 230026, China; Department of Medical Imaging, Suzhou Institute of Biomedical Engineering and Technology, Chinese Academy of Sciences, Suzhou, 215163, China
| | - Xiaokai Yang
- The Wenzhou Third Clinical Institute Affiliated To Wenzhou Medical University, Wenzhou, 325000, China; Wenzhou People's Hospital, Wenzhou, 325000, China
| | - Gang Yuan
- School of Biomedical Engineering (Suzhou), Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, 230026, China; Department of Medical Imaging, Suzhou Institute of Biomedical Engineering and Technology, Chinese Academy of Sciences, Suzhou, 215163, China.
| | - Jing Xie
- The Wenzhou Third Clinical Institute Affiliated To Wenzhou Medical University, Wenzhou, 325000, China; Wenzhou People's Hospital, Wenzhou, 325000, China.
| |
Collapse
|
42
|
Huang D, Wang M, Zhang L, Li H, Ye M, Li A. Learning rich features with hybrid loss for brain tumor segmentation. BMC Med Inform Decis Mak 2021; 21:63. [PMID: 34330265 PMCID: PMC8323198 DOI: 10.1186/s12911-021-01431-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2021] [Accepted: 02/09/2021] [Indexed: 11/10/2022] Open
Abstract
Background Accurately segment the tumor region of MRI images is important for brain tumor diagnosis and radiotherapy planning. At present, manual segmentation is wildly adopted in clinical and there is a strong need for an automatic and objective system to alleviate the workload of radiologists. Methods We propose a parallel multi-scale feature fusing architecture to generate rich feature representation for accurate brain tumor segmentation. It comprises two parts: (1) Feature Extraction Network (FEN) for brain tumor feature extraction at different levels and (2) Multi-scale Feature Fusing Network (MSFFN) for merge all different scale features in a parallel manner. In addition, we use two hybrid loss functions to optimize the proposed network for the class imbalance issue. Results We validate our method on BRATS 2015, with 0.86, 0.73 and 0.61 in Dice for the three tumor regions (complete, core and enhancing), and the model parameter size is only 6.3 MB. Without any post-processing operations, our method still outperforms published state-of-the-arts methods on the segmentation results of complete tumor regions and obtains competitive performance in another two regions. Conclusions The proposed parallel structure can effectively fuse multi-level features to generate rich feature representation for high-resolution results. Moreover, the hybrid loss functions can alleviate the class imbalance issue and guide the training process. The proposed method can be used in other medical segmentation tasks.
Collapse
Affiliation(s)
- Daobin Huang
- School of Information Science and Technology, and Centers for Biomedical Engineering, University of Science and Technology of China, Hefei, 230027, China.,School of Medical Information, Wannan Medical College, Wuhu, 241002, China.,Research Center of Health Big Data Mining and Applications, Wannan Medical College, Wuhu, 241002, China
| | - Minghui Wang
- School of Information Science and Technology, and Centers for Biomedical Engineering, University of Science and Technology of China, Hefei, 230027, China
| | - Ling Zhang
- Department of Biochemistry, Wannan Medical College, Wuhu, 241002, China
| | - Haichun Li
- School of Information Science and Technology, and Centers for Biomedical Engineering, University of Science and Technology of China, Hefei, 230027, China
| | - Minquan Ye
- School of Information Science and Technology, and Centers for Biomedical Engineering, University of Science and Technology of China, Hefei, 230027, China. .,Research Center of Health Big Data Mining and Applications, Wannan Medical College, Wuhu, 241002, China.
| | - Ao Li
- School of Information Science and Technology, and Centers for Biomedical Engineering, University of Science and Technology of China, Hefei, 230027, China.
| |
Collapse
|
43
|
Wang YC, Cheng CH. A multiple combined method for rebalancing medical data with class imbalances. Comput Biol Med 2021; 134:104527. [PMID: 34091384 DOI: 10.1016/j.compbiomed.2021.104527] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2021] [Revised: 05/24/2021] [Accepted: 05/25/2021] [Indexed: 11/24/2022]
Abstract
Most classification algorithms assume that classes are in a balanced state. However, datasets with class imbalances are everywhere. The classes of actual medical datasets are imbalanced, severely impacting identification models and even sacrificing the classification accuracy of the minority class, even though it is the most influential and representative. The medical field has irreversible characteristics. Its tolerance rate for misjudgment is relatively low, and errors may cause irreparable harm to patients. Therefore, this study proposes a multiple combined method to rebalance medical data featuring class imbalances. The combined methods include (1) resampling methods (synthetic minority oversampling technique [SMOTE] and undersampling [US]), (2) particle swarm optimization (PSO), and (3) MetaCost. This study conducted two experiments with nine medical datasets to verify and compare the proposed method with the listing methods. A decision tree is used to generate decision rules for easy understanding of the research results. The results show that (1) the proposed method with ensemble learning can improve the area under a receiver operating characteristic curve (AUC), recall, precision, and F1 metrics; (2) MetaCost can increase sensitivity; (3) SMOTE can effectively enhance AUC; (4) US can improve sensitivity, F1, and misclassification costs in data with a high-class imbalance ratio; and (5) PSO-based attribute selection can increase sensitivity and reduce data dimension. Finally, we suggest that the dataset with an imbalanced ratio >9 must use the US results to make the decision. As the imbalanced ratio is < 9, the decision-maker can simultaneously consider the results of SMOTE and US to identify the best decision.
Collapse
|
44
|
Deng L, Yang B, Kang Z, Yang S, Wu S. A noisy label and negative sample robust loss function for DNN-based distant supervised relation extraction. Neural Netw 2021; 139:358-70. [PMID: 33901772 DOI: 10.1016/j.neunet.2021.03.030] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2020] [Revised: 03/08/2021] [Accepted: 03/19/2021] [Indexed: 11/20/2022]
Abstract
As a major method for relation extraction, distantly supervised relation extraction (DSRE) suffered from the noisy label problem and class imbalance problem (these two problems are also common for many other NLP tasks, e.g., text classification). However, there seems no existing research in DSRE or other NLP tasks that can simultaneously solve both problems, which is a significant insufficiency in related researches. In this paper, we propose a loss function which is robust to noisy label and efficient for the imbalanced class dataset. More specific, first we quantify the negative impacts of the noisy label and class imbalance problems. And then we construct a loss function that can minimize these negative impacts through a linear programming method. As far as we know, this seems to be the first attempt to address the noisy label problem and class imbalance problem simultaneously. We evaluated the constructed loss function on the distantly labeled dataset, our artificially noised dataset, human-annotated dataset of Docred, as well as the artificially noised dataset of CoNLL 2003. Experimental results indicate that a DNN model adopting the constructed loss function can outperform other models that adopt the state-of-the-art noisy label robust or negative sample robust loss functions.
Collapse
|
45
|
Harerimana G, Kim JW, Jang B. A deep attention model to forecast the Length Of Stay and the in-hospital mortality right on admission from ICD codes and demographic data. J Biomed Inform 2021; 118:103778. [PMID: 33872817 DOI: 10.1016/j.jbi.2021.103778] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2020] [Revised: 03/15/2021] [Accepted: 04/06/2021] [Indexed: 11/28/2022]
Abstract
Leveraging the Electronic Health Records (EHR) longitudinal data to produce actionable clinical insights has always been a critical issue for recent studies. Non-forecasted extended hospitalizations account for a disproportionate amount of resource use, the mediocre quality of inpatient care, and avoidable fatalities. The capability to predict the Length of Stay (LoS) and mortality in the early stages of the admission provides opportunities to improve care and prevent many preventable losses. Forecasting the in-hospital mortality is important in providing clinicians with enough insights to make decisions and hospitals to allocate resources, hence predicting the LoS and mortality within the first day of admission is a difficult but a paramount endeavor. The biggest challenge is that few data are available by this time, thus the prediction has to bring in the previous admissions history and free text diagnosis that are recorded immediately on admission. We propose a model that uses the multi-modal EHR structured medical codes and key demographic information to classify the LoS in 3 classes; Short Los (LoS⩽10 days), Medium LoS (10<LoS⩽30 days) and Long LoS (LoS>30 days) as well as mortality as a binary classification of a patient's death during current admission. The prediction has to use data available only within 24 h of admission. The key predictors include previous ICD9 diagnosis codes, ICD9 procedures, key demographic data, and free text diagnosis of the current admission recorded right on admission. We propose a Hierarchical Attention Network (HAN-LoS and HAN-Mor) model and train it to a dataset of over 45321 admissions recorded in the de-identified MIMIC-III dataset. For improved prediction, our attention mechanisms can focus on the most influential past admissions and most influential codes in these admissions. For fair performance evaluation, we implemented and compared the HAN model with previous approaches. With dataset balancing techniques HAN-LoS achieved an AUROC of over 0.82 and a Micro-F1 score of 0.24 and HAN-Mor achieved AUC-ROC of 0.87 hence outperforming the existing baselines that use structured medical codes as well as clinical time series for LoS and Mortality forecasting. By predicting mortality and LoS using the same model, we show that with little tuning the proposed model can be used for other clinical predictive tasks like phenotyping, decompensation,re-admission prediction, and survival analysis.
Collapse
Affiliation(s)
- Gaspard Harerimana
- Department of Computer Science, Sangmyung University, Seoul, Republic of Korea.
| | - Jong Wook Kim
- Department of Computer Science, Sangmyung University, Seoul, Republic of Korea.
| | - Beakcheol Jang
- Graduate School of Information, Yonsei University, Seoul, Republic of Korea.
| |
Collapse
|
46
|
Yahaya M, Guo R, Jiang X, Bashir K, Matara C, Xu S. Ensemble-based model selection for imbalanced data to investigate the contributing factors to multiple fatality road crashes in Ghana. Accid Anal Prev 2021; 151:105851. [PMID: 33383521 DOI: 10.1016/j.aap.2020.105851] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/27/2020] [Revised: 09/25/2020] [Accepted: 10/16/2020] [Indexed: 06/12/2023]
Abstract
The study aims to identify relevant variables to improve the prediction performance of the crash injury severity (CIS) classification model. Unfortunately, the CIS database is invariably characterized by the class imbalance. For instance, the samples of multiple fatal injury (MFI) severity class are typically rare as opposed to other classes. The imbalance phenomenon may introduce a prediction bias in favour of the majority class and affect the quality of the learning algorithm. The paper proposes an ensemble-based variable ranking scheme that incorporates the data resampling. At the data pre-processing level, majority weighted minority oversampling (MWMOTE) is employed to treat the imbalanced training data. Ensemble of classifiers induced from the balanced data is used to evaluate and rank the individual variables according to their importance to the injury severity prediction. The relevant variables selected are then applied to the balanced data to form a training set for the CIS classification modelling. An empirical comparison is conducted through considering the variable ranking by: 1) the learning of single inductive algorithm with imbalanced data where the relevant variables are applied to the imbalanced data to form the training data; 2) the learning of single inductive algorithm with MWMOTE data and the relevant variables identified are applied to the balanced data to form the training data; and 3) the learning of ensembles with imbalanced data where the relevant variables identified are applied to the imbalanced data to form the training data. Bayesian Networks (BNs) classifiers are then developed for each ranking method, where nested subsets of the top ranked variables are adopted. The model predictions are captured in four performance indicators in the comparative study. Based on three-year (2014-2016) crash data in Ghana, the empirical results show that the proposed method is effective to identify the most prolific predictors of the CIS level. Finally, based on the inference results of BNs developed on the best subset, the study offers the most probable explanations to the occurrence of MFI crashes in Ghana.
Collapse
Affiliation(s)
- Mahama Yahaya
- School of Transportation and Logistics, Southwest Jiaotong University, West Park, High-Tech District, Chengdu, China 611756; National Engineering Laboratory of Integrated Transportation Big Data Application Technology, West Park, High-Tech District, Chengdu, 611756, China
| | - Runhua Guo
- Department of Civil Engineering, Suite 217, Heshangheng Bldg, Tsinghua University, 100084, Beijing, China
| | - Xinguo Jiang
- School of Transportation and Logistics, Southwest Jiaotong University, West Park, High-Tech District, Chengdu, China 611756; National Engineering Laboratory of Integrated Transportation Big Data Application Technology, West Park, High-Tech District, Chengdu, 611756, China.
| | - Kamal Bashir
- Department of Information Technology, Karare University, Omdurman, 12304, Sudan
| | - Caroline Matara
- Department of Civil and Construction Engineering, University of Nairobi, 30197, Nairobi, Kenya
| | - Shiwei Xu
- Guangzhou Transportation Planning Institute, 510030, Guangzhou, China
| |
Collapse
|
47
|
Wang D, Zhang X, Chen H, Zhou Y, Cheng F. Sintering conditions recognition of rotary kiln based on kernel modification considering class imbalance. ISA Trans 2020; 106:271-282. [PMID: 32674852 DOI: 10.1016/j.isatra.2020.07.010] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/02/2019] [Revised: 07/06/2020] [Accepted: 07/07/2020] [Indexed: 06/11/2023]
Abstract
Accurate sintering condition recognition (SCR) is an important precondition for optimal control of rotary kilns. However, the occurrence probability of abnormal conditions in the industrial field is much lower than normal, resulting in imbalanced class sintering samples in general. This significantly deteriorates the effectiveness of existing recognition models in abnormal condition detection. In this paper, an integrated framework considering class imbalance is proposed for sintering condition recognition. In the proposed framework, after analysing the characteristics of thermal signals by the Lipschitz method, four discriminant features are extracted to comprehensively describe different sintering conditions. In addition, focusing on the class imbalance of sintering samples, the kernel modification method is introduced to enhance the optimal marginal distribution machine (ODM), and a novel recognition model kernel modified the ODM (KMODM) is proposed for SCR. By constructing a new conformal transformation function to modify the ODM kernel function, KMODM optimizes the spatial distribution of training samples in the kernel space, thereby alleviating the detection accuracy deterioration of the minority class. The experimental results on real thermal signals and standard datasets show that the KMODM model can effectively handle imbalanced data. Based on this, the proposed SCR framework can reduce the misjudgement of abnormal conditions and balance the recognition accuracy of each condition.
Collapse
Affiliation(s)
- Dingxiang Wang
- College of Electrical and Information Engineering, Hunan University, Changsha 410082, China.
| | - Xiaogang Zhang
- College of Electrical and Information Engineering, Hunan University, Changsha 410082, China.
| | - Hua Chen
- College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China.
| | - Yicong Zhou
- Department of Computer and Information Science, University of Macau, Macau 999078, China.
| | - Fanyong Cheng
- College of Electrical Engineering, Anhui Polytechnic University, Wuhu 241000, China.
| |
Collapse
|
48
|
Qu W, Balki I, Mendez M, Valen J, Levman J, Tyrrell PN. Assessing and mitigating the effects of class imbalance in machine learning with application to X-ray imaging. Int J Comput Assist Radiol Surg 2020; 15:2041-8. [PMID: 32965624 DOI: 10.1007/s11548-020-02260-6] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2020] [Accepted: 09/04/2020] [Indexed: 10/23/2022]
Abstract
PURPOSE Machine learning (ML) algorithms are well known to exhibit variations in prediction accuracy when provided with imbalanced training sets typically seen in medical imaging (MI) due to the imbalanced ratio of pathological and normal cases. This paper presents a thorough investigation of the effects of class imbalance and methods for mitigating class imbalance in ML algorithms applied to MI. METHODS We first selected five classes from the Image Retrieval in Medical Applications (IRMA) dataset, performed multiclass classification using the random forest model (RFM), and then performed binary classification using convolutional neural network (CNN) on a chest X-ray dataset. An imbalanced class was created in the training set by varying the number of images in that class. Methods tested to mitigate class imbalance included oversampling, undersampling, and changing class weights of the RFM. Model performance was assessed by overall classification accuracy, overall F1 score, and specificity, recall, and precision of the imbalanced class. RESULTS A close-to-balanced training set resulted in the best model performance, and a large imbalance with overrepresentation was more detrimental to model performance than underrepresentation. Oversampling and undersampling methods were both effective in mitigating class imbalance, and efficacy of oversampling techniques was class specific. CONCLUSION This study systematically demonstrates the effect of class imbalance on two public X-ray datasets on RFM and CNN, making these findings widely applicable as a reference. Furthermore, the methods employed here can guide researchers in assessing and addressing the effects of class imbalance, while considering the data-specific characteristics to optimize imbalance mitigating methods.
Collapse
|
49
|
Ashraf S, Saleem S, Ahmed T, Aslam Z, Muhammad D. Conversion of adverse data corpus to shrewd output using sampling metrics. Vis Comput Ind Biomed Art 2020; 3:19. [PMID: 32779031 PMCID: PMC7417470 DOI: 10.1186/s42492-020-00055-9] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2020] [Accepted: 07/24/2020] [Indexed: 11/11/2022] Open
Abstract
An imbalanced dataset is commonly found in at least one class, which are typically exceeded by the other ones. A machine learning algorithm (classifier) trained with an imbalanced dataset predicts the majority class (frequently occurring) more than the other minority classes (rarely occurring). Training with an imbalanced dataset poses challenges for classifiers; however, applying suitable techniques for reducing class imbalance issues can enhance classifiers’ performance. In this study, we consider an imbalanced dataset from an educational context. Initially, we examine all shortcomings regarding the classification of an imbalanced dataset. Then, we apply data-level algorithms for class balancing and compare the performance of classifiers. The performance of the classifiers is measured using the underlying information in their confusion matrices, such as accuracy, precision, recall, and F measure. The results show that classification with an imbalanced dataset may produce high accuracy but low precision and recall for the minority class. The analysis confirms that undersampling and oversampling are effective for balancing datasets, but the latter dominates.
Collapse
Affiliation(s)
- Shahzad Ashraf
- College of Internet of Things Engineering, Hohai University, Changzhou, Jiangsu, 210032, China.
| | - Sehrish Saleem
- Muhammad Nawaz Sharif University of Engineering & Technology, Multan, 66000, Pakistan
| | - Tauqeer Ahmed
- College of Internet of Things Engineering, Hohai University, Changzhou, Jiangsu, 210032, China
| | | | - Durr Muhammad
- Pakistan Steel Mills Karachi, Karachi, 75200, Pakistan
| |
Collapse
|
50
|
Sleeman Iv WC, Nalluri J, Syed K, Ghosh P, Krawczyk B, Hagan M, Palta J, Kapoor R. A Machine Learning method for relabeling arbitrary DICOM structure sets to TG-263 defined labels. J Biomed Inform 2020; 109:103527. [PMID: 32777484 DOI: 10.1016/j.jbi.2020.103527] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2020] [Revised: 07/11/2020] [Accepted: 08/02/2020] [Indexed: 10/23/2022]
Abstract
PURPOSE To present a Machine Learning pipeline for automatically relabeling anatomical structure sets in the Digital Imaging and Communications in Medicine (DICOM) format to a standard nomenclature that will enable data abstraction for research and quality improvement. METHODS DICOM structure sets from approximately 1200 lung and prostate cancer patients across 40 treatment centers were used to build predictive models to automate the relabeling of clinically specified structure labels to standardized labels as defined by the American Association of Physics in Medicine's (AAPM) Task Group 263 (TG-263). Volumetric bitmaps were created based on the delineated volumes and were combined with associated bony anatomy data to build feature vectors. Feature reduction was performed with singular value decomposition and the resulting vectors were used for predicting the label of each structure using five different classifier algorithms on the Apache Spark platform with 5-fold cross-validation. Undersampling methods were used to deal with underlying class imbalance that hindered the performance of classifiers. Experiments were performed on both a curated version of the data, which included only annotated structures, and the non-curated data that included all structures from the original treatment plans. RESULTS Random Forest provided the highest accuracies with F1 scores of 98.77 for lung and 95.06 for prostate on the curated data sets. Scores were lower with 95.67 for lung and 90.22 for prostate on the non-curated data sets, highlighting some of the challenges of classifying real clinical data. Including bony anatomy data and pooling information from all structures for the same patient both increased accuracies. In some cases, undersampling with k-Means clustering for class balancing improved classifier accuracy but in all experiments it significantly reduced run time compared to random undersampling. CONCLUSION This work shows that structure sets can be relabeled using our approach with accuracies over 95% for many structure types when presented with curated data. Although accuracies dropped when using the full non-curated data sets, some structure types were still correctly labeled over 90% of the time. With similar results obtained on an external test data set, we can infer that the proposed models are likely to work on other clinical data sets.
Collapse
Affiliation(s)
- William C Sleeman Iv
- Virginia Commonwealth University, Department of Radiation Oncology, Richmond, VA, United States of America; Virginia Commonwealth University, Department of Computer Science, Richmond, VA, United States of America; National Radiation Oncology Program, Department of Veteran Affairs, Richmond, VA, United States of America.
| | - Joseph Nalluri
- Virginia Commonwealth University, Department of Radiation Oncology, Richmond, VA, United States of America; National Radiation Oncology Program, Department of Veteran Affairs, Richmond, VA, United States of America
| | - Khajamoinuddin Syed
- Virginia Commonwealth University, Department of Computer Science, Richmond, VA, United States of America
| | - Preetam Ghosh
- Virginia Commonwealth University, Department of Computer Science, Richmond, VA, United States of America
| | - Bartosz Krawczyk
- Virginia Commonwealth University, Department of Computer Science, Richmond, VA, United States of America
| | - Michael Hagan
- Virginia Commonwealth University, Department of Radiation Oncology, Richmond, VA, United States of America; National Radiation Oncology Program, Department of Veteran Affairs, Richmond, VA, United States of America
| | - Jatinder Palta
- Virginia Commonwealth University, Department of Radiation Oncology, Richmond, VA, United States of America; National Radiation Oncology Program, Department of Veteran Affairs, Richmond, VA, United States of America
| | - Rishabh Kapoor
- Virginia Commonwealth University, Department of Radiation Oncology, Richmond, VA, United States of America; National Radiation Oncology Program, Department of Veteran Affairs, Richmond, VA, United States of America
| |
Collapse
|