1
|
Shoombuatong W, Schaduangrat N, Homdee N, Ahmed S, Chumnanpuen P. Advancing the accuracy of tyrosinase inhibitory peptides prediction via a multiview feature fusion strategy. Sci Rep 2025; 15:4762. [PMID: 39922825 PMCID: PMC11807091 DOI: 10.1038/s41598-024-81807-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2024] [Accepted: 11/29/2024] [Indexed: 02/10/2025] Open
Abstract
Tyrosinase plays a crucial role as an enzyme in the production of melanin, which is the pigment accountable for determining the color of the hair, eyes, and skin. Tyrosinase inhibitory peptides (TIPs), mainly designed to regulate the activity of the enzyme tyrosinase, are of interest in various domains, including cosmetics, dermatology, and pharmaceuticals, due to their potential applications in controlling skin pigmentation. To date, a few machine learning-based models have been proposed for predicting TIPs, but their predictive performance remains unsatisfactory. In this study, we propose an innovative computational approach, named TIPred-MVFF, to accurately predict TIPs using only sequence information. Firstly, we established an up-to-date and high-quality dataset by collecting samples from various sources. Secondly, we applied a multi-view feature fusion (MVFF) strategy to extract and explore probability and category information embedded in TIPs, employing several machine learning (ML) algorithms coupled with different commonly used sequence-based feature encodings. Then, we employed resampling approaches to address the class imbalance issue. Finally, to maximize the utility of each feature, we fused probability-based and sequence-based features, generating more informative feature that were used to develop the final prediction model. Based on the independent test, experimental results showed that TIPred-MVFF outperformed several conventional ML classifiers and existing methods in terms of prediction accuracy and robustness, achieving an accuracy of 0.937 and a Matthew's correlation coefficient of 0.847. This new computational approach is anticipated to aid community-wide efforts in rapidly and cost-effectively discovering novel peptides with strong tyrosinase inhibitory activities.
Collapse
Affiliation(s)
- Watshara Shoombuatong
- Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand.
| | - Nalini Schaduangrat
- Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand
| | - Nutta Homdee
- Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand
| | - Saeed Ahmed
- Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand
- Department of Computer Science, University of Swabi, Swabi, 23561, Pakistan
| | - Pramote Chumnanpuen
- Department of Zoology, Faculty of Science, Kasetsart University, Bangkok, 10900, Thailand.
- Kasetsart University International College (KUIC), Kasetsart University, Bangkok, 10900, Thailand.
| |
Collapse
|
2
|
Zhang Z, Liu Z, Ning L, Martin A, Xiong J. Representation of Imprecision in Deep Neural Networks for Image Classification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:1199-1212. [PMID: 37948150 DOI: 10.1109/tnnls.2023.3329712] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2023]
Abstract
Quantification and reduction of uncertainty in deep-learning techniques have received much attention but ignored how to characterize the imprecision caused by such uncertainty. In some tasks, we prefer to obtain an imprecise result rather than being willing or unable to bear the cost of an error. For this purpose, we investigate the representation of imprecision in deep-learning (RIDL) techniques based on the theory of belief functions (TBF). First, the labels of some training images are reconstructed using the learning mechanism of neural networks to characterize the imprecision in the training set. In the process, a label assignment rule is proposed to reassign one or more labels to each training image. Once an image is assigned with multiple labels, it indicates that the image may be in an overlapping region of different categories from the feature perspective or the original label is wrong. Second, those images with multiple labels are rechecked. As a result, the imprecision (multiple labels) caused by the original labeling errors will be corrected, while the imprecision caused by insufficient knowledge is retained. Images with multiple labels are called imprecise ones, and they are considered to belong to meta-categories, the union of some specific categories. Third, the deep network model is retrained based on the reconstructed training set, and the test images are then classified. Finally, some test images that specific categories cannot distinguish will be assigned to meta-categories to characterize the imprecision in the results. Experiments based on some remarkable networks have shown that RIDL can improve accuracy (AC) and reasonably represent imprecision both in the training and testing sets.
Collapse
|
3
|
Gurcan F, Soylu A. Synthetic Boosted Resampling Using Deep Generative Adversarial Networks: A Novel Approach to Improve Cancer Prediction from Imbalanced Datasets. Cancers (Basel) 2024; 16:4046. [PMID: 39682233 DOI: 10.3390/cancers16234046] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2024] [Revised: 11/28/2024] [Accepted: 11/30/2024] [Indexed: 12/18/2024] Open
Abstract
BACKGROUND/OBJECTIVES This study examines the effectiveness of different resampling methods and classifier models for handling imbalanced datasets, with a specific focus on critical healthcare applications such as cancer diagnosis and prognosis. METHODS To address the class imbalance issue, traditional sampling methods like SMOTE and ADASYN were replaced by Generative Adversarial Networks (GANs), which leverage deep neural network architectures to generate high-quality synthetic data. The study highlights the advantage of GANs in creating realistic, diverse, and homogeneous samples for the minority class, which plays a significant role in mitigating the diagnostic challenges posed by imbalanced data. Four types of classifiers, Boosting, Bagging, Linear, and Non-linear, were assessed to evaluate their performance using metrics such as accuracy, precision, recall, F1 score, and ROC AUC. RESULTS Baseline performance without resampling showed significant limitations, underscoring the need for resampling strategies. Using GAN-generated data notably improved the detection of minority instances and overall classification performance. The average ROC AUC value increased from baseline levels of approximately 0.8276 to over 0.9734, underscoring the effectiveness of GAN-based resampling in enhancing model performance and ensuring more balanced detection across classes. With GAN-based resampling, GradientBoosting classifier achieved a ROC AUC of 0.9890, the highest among all models, demonstrating the effectiveness of GAN-generated data in enhancing performance. CONCLUSIONS The findings underscore that advanced models like Boosting and Bagging, when paired with effective resampling strategies such as GANs, are better suited for handling imbalanced datasets and improving predictive accuracy in healthcare applications.
Collapse
Affiliation(s)
- Fatih Gurcan
- Department of Management Information Systems, Faculty of Economics and Administrative Sciences, Karadeniz Technical University, 61080 Trabzon, Turkey
| | - Ahmet Soylu
- Department of Computer Science, Faculty of Information Technology and Electrical Engineering, Norwegian University of Science and Technology, 2815 Gjøvik, Norway
| |
Collapse
|
4
|
Yasin P, Yimit Y, Cai X, Aimaiti A, Sheng W, Mamat M, Nijiati M. Machine learning-enabled prediction of prolonged length of stay in hospital after surgery for tuberculosis spondylitis patients with unbalanced data: a novel approach using explainable artificial intelligence (XAI). Eur J Med Res 2024; 29:383. [PMID: 39054495 PMCID: PMC11270948 DOI: 10.1186/s40001-024-01988-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2023] [Accepted: 07/18/2024] [Indexed: 07/27/2024] Open
Abstract
BACKGROUND Tuberculosis spondylitis (TS), commonly known as Pott's disease, is a severe type of skeletal tuberculosis that typically requires surgical treatment. However, this treatment option has led to an increase in healthcare costs due to prolonged hospital stays (PLOS). Therefore, identifying risk factors associated with extended PLOS is necessary. In this research, we intended to develop an interpretable machine learning model that could predict extended PLOS, which can provide valuable insights for treatments and a web-based application was implemented. METHODS We obtained patient data from the spine surgery department at our hospital. Extended postoperative length of stay (PLOS) refers to a hospitalization duration equal to or exceeding the 75th percentile following spine surgery. To identify relevant variables, we employed several approaches, such as the least absolute shrinkage and selection operator (LASSO), recursive feature elimination (RFE) based on support vector machine classification (SVC), correlation analysis, and permutation importance value. Several models using implemented and some of them are ensembled using soft voting techniques. Models were constructed using grid search with nested cross-validation. The performance of each algorithm was assessed through various metrics, including the AUC value (area under the curve of receiver operating characteristics) and the Brier Score. Model interpretation involved utilizing methods such as Shapley additive explanations (SHAP), the Gini Impurity Index, permutation importance, and local interpretable model-agnostic explanations (LIME). Furthermore, to facilitate the practical application of the model, a web-based interface was developed and deployed. RESULTS The study included a cohort of 580 patients and 11 features include (CRP, transfusions, infusion volume, blood loss, X-ray bone bridge, X-ray osteophyte, CT-vertebral destruction, CT-paravertebral abscess, MRI-paravertebral abscess, MRI-epidural abscess, postoperative drainage) were selected. Most of the classifiers showed better performance, where the XGBoost model has a higher AUC value (0.86) and lower Brier Score (0.126). The XGBoost model was chosen as the optimal model. The results obtained from the calibration and decision curve analysis (DCA) plots demonstrate that XGBoost has achieved promising performance. After conducting tenfold cross-validation, the XGBoost model demonstrated a mean AUC of 0.85 ± 0.09. SHAP and LIME were used to display the variables' contributions to the predicted value. The stacked bar plots indicated that infusion volume was the primary contributor, as determined by Gini, permutation importance (PFI), and the LIME algorithm. CONCLUSIONS Our methods not only effectively predicted extended PLOS but also identified risk factors that can be utilized for future treatments. The XGBoost model developed in this study is easily accessible through the deployed web application and can aid in clinical research.
Collapse
Affiliation(s)
- Parhat Yasin
- Department of Spine Surgery, The Sixth Affiliated Hospital of Xinjiang Medical University, Urumqi, 830000, Xinjiang, People's Republic of China
- Department of Spine Surgery, The First Affiliated Hospital of Xinjiang Medical University, Urumqi, 830054, Xinjiang, People's Republic of China
| | - Yasen Yimit
- Department of Radiology, The First People's Hospital of Kashi Prefecture, Kashi, 844000, Xinjiang, People's Republic of China
| | - Xiaoyu Cai
- Department of Spine Surgery, The First Affiliated Hospital of Xinjiang Medical University, Urumqi, 830054, Xinjiang, People's Republic of China
| | - Abasi Aimaiti
- Department of Anesthesiology, The First Affiliated Hospital of Xinjiang Medical University, Urumqi, 830054, Xinjiang, People's Republic of China
| | - Weibin Sheng
- Department of Spine Surgery, The First Affiliated Hospital of Xinjiang Medical University, Urumqi, 830054, Xinjiang, People's Republic of China
| | - Mardan Mamat
- Department of Spine Surgery, The First Affiliated Hospital of Xinjiang Medical University, Urumqi, 830054, Xinjiang, People's Republic of China.
| | - Mayidili Nijiati
- Department of Radiology, The Fourth Affiliated Hospital of Xinjiang Medical University(Xinjiang Hospital of Traditional Chinese Medicine), Urumqi, 830002, Xinjiang, People's Republic of China.
- Xinjiang Key Laboratory of Artificial Intelligence Assisted Imaging Diagnosis, Kashi, 844000, Xinjiang, People's Republic of China.
| |
Collapse
|
5
|
Chung J, Zhang J, Saimon AI, Liu Y, Johnson BN, Kong Z. Imbalanced spectral data analysis using data augmentation based on the generative adversarial network. Sci Rep 2024; 14:13230. [PMID: 38853181 PMCID: PMC11163007 DOI: 10.1038/s41598-024-63285-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Accepted: 05/27/2024] [Indexed: 06/11/2024] Open
Abstract
Spectroscopic techniques generate one-dimensional spectra with distinct peaks and specific widths in the frequency domain. These features act as unique identities for material characteristics. Deep neural networks (DNNs) has recently been considered a powerful tool for automatically categorizing experimental spectra data by supervised classification to evaluate material characteristics. However, most existing work assumes balanced spectral data among various classes in the training data, contrary to actual experiments, where the spectral data is usually imbalanced. The imbalanced training data deteriorates the supervised classification performance, hindering understanding of the phase behavior, specifically, sol-gel transition (gelation) of soft materials and glycomaterials. To address this issue, this paper applies a novel data augmentation method based on a generative adversarial network (GAN) proposed by the authors in their prior work. To demonstrate the effectiveness of the proposed method, the actual imbalanced spectral data from Pluronic F-127 hydrogel and Alpha-Cyclodextrin hydrogel are used to classify the phases of data. Specifically, our approach improves 8.8%, 6.4%, and 6.2% of the performance of the existing data augmentation methods regarding the classifier's F-score, Precision, and Recall on average, respectively. Specifically, our method consists of three DNNs: the generator, discriminator, and classifier. The method generates samples that are not only authentic but emphasize the differentiation between material characteristics to provide balanced training data, improving the classification results. Based on these validated results, we expect the method's broader applications in addressing imbalanced measurement data across diverse domains in materials science and chemical engineering.
Collapse
Affiliation(s)
- Jihoon Chung
- Department of Industrial Engineering, Pusan National University, Busan, South Korea
| | - Junru Zhang
- Grado Department of Industrial and Systems Engineering, Virginia Tech, Blacksburg, VA, USA
| | - Amirul Islam Saimon
- Grado Department of Industrial and Systems Engineering, Virginia Tech, Blacksburg, VA, USA
| | - Yang Liu
- Grado Department of Industrial and Systems Engineering, Virginia Tech, Blacksburg, VA, USA
| | - Blake N Johnson
- Grado Department of Industrial and Systems Engineering, Virginia Tech, Blacksburg, VA, USA.
| | - Zhenyu Kong
- Grado Department of Industrial and Systems Engineering, Virginia Tech, Blacksburg, VA, USA.
| |
Collapse
|
6
|
Jia Q, Chen C, Xu A, Wang S, He X, Shen G, Luo Y, Tu H, Sun T, Wu X. A biological age model based on physical examination data to predict mortality in a Chinese population. iScience 2024; 27:108891. [PMID: 38384842 PMCID: PMC10879664 DOI: 10.1016/j.isci.2024.108891] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2023] [Revised: 09/02/2023] [Accepted: 01/09/2024] [Indexed: 02/23/2024] Open
Abstract
Biological age could be reflective of an individual's health status and aging degree. Limited estimations of biological aging based on physical examination data in the Chinese population have been developed to quantify the rate of aging. We developed and validated a novel aging measure (Balanced-AGE) based on readily available physical health examination data. In this study, a repeated sub-sampling approach was applied to address the data imbalance issue, and this approach significantly improved the performance of biological age (Balanced-AGE) in predicting all-cause mortality with a 10-year time-dependent AUC of 0.908 for all-cause mortality. This mortality prediction tool was found to be effective across different subgroups by age, sex, smoking, and alcohol consumption status. Additionally, this study revealed that individuals who were underweight, smokers, or drinkers had a higher extent of age acceleration. The Balanced-AGE may serve as an effective and generally applicable tool for health assessment and management among the elderly population.
Collapse
Affiliation(s)
- Qingqing Jia
- Department of Big Data in Health Science School of Public Health, Center of Clinical Big Data and Analytics of The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou 310058, China
| | - Chen Chen
- Department of Big Data in Health Science School of Public Health, Center of Clinical Big Data and Analytics of The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou 310058, China
| | - Andi Xu
- Department of Big Data in Health Science School of Public Health, Center of Clinical Big Data and Analytics of The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou 310058, China
| | - Sicong Wang
- Department of Big Data in Health Science School of Public Health, Center of Clinical Big Data and Analytics of The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou 310058, China
| | - Xiaojie He
- Health Management Center, The Second Affiliated Hospital of Zhejiang University School of Medicine, Hangzhou 310009, China
| | - Guoli Shen
- Health Management Center, The Second Affiliated Hospital of Zhejiang University School of Medicine, Hangzhou 310009, China
| | - Yihong Luo
- Department of Big Data in Health Science School of Public Health, Center of Clinical Big Data and Analytics of The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou 310058, China
| | - Huakang Tu
- Department of Big Data in Health Science School of Public Health, Center of Clinical Big Data and Analytics of The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou 310058, China
| | - Ting Sun
- Health Management Center, The Second Affiliated Hospital of Zhejiang University School of Medicine, Hangzhou 310009, China
| | - Xifeng Wu
- Department of Big Data in Health Science School of Public Health, Center of Clinical Big Data and Analytics of The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou 310058, China
- National Institute for Data Science in Health and Medicine, Zhejiang University, Hangzhou, Zhejiang, China
- The Key Laboratory of Intelligent Preventive Medicine of Zhejiang Province, Hangzhou, Zhejiang, China
- Cancer Center, Zhejiang University, Hangzhou, Zhejiang, China
- School of Medicine and Health Science, George Washington University, Washington, DC, USA
| |
Collapse
|
7
|
Xie Y, Wan Q, Xie H, Xu Y, Wang T, Wang S, Lei B. Fundus Image-Label Pairs Synthesis and Retinopathy Screening via GANs With Class-Imbalanced Semi-Supervised Learning. IEEE TRANSACTIONS ON MEDICAL IMAGING 2023; 42:2714-2725. [PMID: 37030825 DOI: 10.1109/tmi.2023.3263216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Retinopathy is the primary cause of irreversible yet preventable blindness. Numerous deep-learning algorithms have been developed for automatic retinal fundus image analysis. However, existing methods are usually data-driven, which rarely consider the costs associated with fundus image collection and annotation, along with the class-imbalanced distribution that arises from the relative scarcity of disease-positive individuals in the population. Semi-supervised learning on class-imbalanced data, despite a realistic problem, has been relatively little studied. To fill the existing research gap, we explore generative adversarial networks (GANs) as a potential answer to that problem. Specifically, we present a novel framework, named CISSL-GANs, for class-imbalanced semi-supervised learning (CISSL) by leveraging a dynamic class-rebalancing (DCR) sampler, which exploits the property that the classifier trained on class-imbalanced data produces high-precision pseudo-labels on minority classes to leverage the bias inherent in pseudo-labels. Also, given the well-known difficulty of training GANs on complex data, we investigate three practical techniques to improve the training dynamics without altering the global equilibrium. Experimental results demonstrate that our CISSL-GANs are capable of simultaneously improving fundus image class-conditional generation and classification performance under a typical label insufficient and imbalanced scenario. Our code is available at: https://github.com/Xyporz/CISSL-GANs.
Collapse
|
8
|
An imbalanced binary classification method via space mapping using normalizing flows with class discrepancy constraints. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.12.029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
|
9
|
|