1
|
Nejadshamsi S, Karami V, Ghourchian N, Armanfard N, Bergman H, Grad R, Wilchesky M, Khanassov V, Vedel I, Abbasgholizadeh Rahimi S. Development and Feasibility Study of HOPE Model for Prediction of Depression Among Older Adults Using Wi-Fi-based Motion Sensor Data: Machine Learning Study. JMIR Aging 2025; 8:e67715. [PMID: 40053734 PMCID: PMC11914842 DOI: 10.2196/67715] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2024] [Revised: 12/12/2024] [Accepted: 12/19/2024] [Indexed: 03/09/2025] Open
Abstract
BACKGROUND Depression, characterized by persistent sadness and loss of interest in daily activities, greatly reduces quality of life. Early detection is vital for effective treatment and intervention. While many studies use wearable devices to classify depression based on physical activity, these often rely on intrusive methods. Additionally, most depression classification studies involve large participant groups and use single-stage classifiers without explainability. OBJECTIVE This study aims to assess the feasibility of classifying depression using nonintrusive Wi-Fi-based motion sensor data using a novel machine learning model on a limited number of participants. We also conduct an explainability analysis to interpret the model's predictions and identify key features associated with depression classification. METHODS In this study, we recruited adults aged 65 years and older through web-based and in-person methods, supported by a McGill University health care facility directory. Participants provided consent, and we collected 6 months of activity and sleep data via nonintrusive Wi-Fi-based sensors, along with Edmonton Frailty Scale and Geriatric Depression Scale data. For depression classification, we proposed a HOPE (Home-Based Older Adults' Depression Prediction) machine learning model with feature selection, dimensionality reduction, and classification stages, evaluating various model combinations using accuracy, sensitivity, precision, and F1-score. Shapely addictive explanations and local interpretable model-agnostic explanations were used to explain the model's predictions. RESULTS A total of 6 participants were enrolled in this study; however, 2 participants withdrew later due to internet connectivity issues. Among the 4 remaining participants, 3 participants were classified as not having depression, while 1 participant was identified as having depression. The most accurate classification model, which combined sequential forward selection for feature selection, principal component analysis for dimensionality reduction, and a decision tree for classification, achieved an accuracy of 87.5%, sensitivity of 90%, and precision of 88.3%, effectively distinguishing individuals with and those without depression. The explainability analysis revealed that the most influential features in depression classification, in order of importance, were "average sleep duration," "total number of sleep interruptions," "percentage of nights with sleep interruptions," "average duration of sleep interruptions," and "Edmonton Frailty Scale." CONCLUSIONS The findings from this preliminary study demonstrate the feasibility of using Wi-Fi-based motion sensors for depression classification and highlight the effectiveness of our proposed HOPE machine learning model, even with a small sample size. These results suggest the potential for further research with a larger cohort for more comprehensive validation. Additionally, the nonintrusive data collection method and model architecture proposed in this study offer promising applications in remote health monitoring, particularly for older adults who may face challenges in using wearable devices. Furthermore, the importance of sleep patterns identified in our explainability analysis aligns with findings from previous research, emphasizing the need for more in-depth studies on the role of sleep in mental health, as suggested in the explainable machine learning study.
Collapse
Affiliation(s)
- Shayan Nejadshamsi
- Mila-Quebec Artificial Intelligence Institute, Montreal, QC, Canada
- Family Medicine Department, Faculty of Medicine and Health Sciences, McGill University, Montreal, QC, Canada
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, QC, Canada
| | - Vania Karami
- Mila-Quebec Artificial Intelligence Institute, Montreal, QC, Canada
- Family Medicine Department, Faculty of Medicine and Health Sciences, McGill University, Montreal, QC, Canada
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, QC, Canada
| | | | - Narges Armanfard
- Mila-Quebec Artificial Intelligence Institute, Montreal, QC, Canada
- Department of Electrical and Computer Engineering, Faculty of Engineering, McGill University, Montreal, QC, Canada
| | - Howard Bergman
- Family Medicine Department, Faculty of Medicine and Health Sciences, McGill University, Montreal, QC, Canada
| | - Roland Grad
- Family Medicine Department, Faculty of Medicine and Health Sciences, McGill University, Montreal, QC, Canada
| | - Machelle Wilchesky
- Family Medicine Department, Faculty of Medicine and Health Sciences, McGill University, Montreal, QC, Canada
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, QC, Canada
- Donald Berman Maimonides Centre for Research in Aging, Montreal, QC, Canada
| | - Vladimir Khanassov
- Family Medicine Department, Faculty of Medicine and Health Sciences, McGill University, Montreal, QC, Canada
| | - Isabelle Vedel
- Family Medicine Department, Faculty of Medicine and Health Sciences, McGill University, Montreal, QC, Canada
| | - Samira Abbasgholizadeh Rahimi
- Mila-Quebec Artificial Intelligence Institute, Montreal, QC, Canada
- Family Medicine Department, Faculty of Medicine and Health Sciences, McGill University, Montreal, QC, Canada
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, QC, Canada
- Faculty of Dental Medicine and Oral Health Sciences, McGill University, Montreal, Canada
| |
Collapse
|
2
|
Ghavidel A, Pazos P. Machine learning (ML) techniques to predict breast cancer in imbalanced datasets: a systematic review. J Cancer Surviv 2025; 19:270-294. [PMID: 37749361 DOI: 10.1007/s11764-023-01465-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2023] [Accepted: 09/09/2023] [Indexed: 09/27/2023]
Abstract
Knowledge discovery in databases (KDD) is crucial in analyzing data to extract valuable insights. In medical outcome prediction, KDD is increasingly applied, particularly in diseases with high incidence, mortality, and costs, like cancer. ML techniques can develop more accurate predictive models for cancer patients' clinical outcomes, aiding informed healthcare decision-making. However, cancer prediction modeling faces challenges because of the unbalanced nature of the datasets, where there is a small minority category of patients with a cancer diagnosis compared to a majority category of cancer-free patients. Imbalanced datasets pose statistical hurdles like bias and overfitting when developing accurate prediction models. This systematic review focuses on breast cancer prediction articles published from 2008 to 2023. The objective is to examine ML methods used in three critical steps of KDD: preprocessing, data mining, and interpretation which address the imbalanced data problem in breast cancer prediction. This work synthesizes prior research in ML methods for breast cancer prediction. The findings help identify effective preprocessing strategies, including balancing and feature selection methods, robust predictive models, and evaluation metrics of those models. The study aims to inform healthcare providers and researchers about effective techniques for accurate breast cancer prediction.
Collapse
Affiliation(s)
- Arman Ghavidel
- Engineering Management and Systems Engineering, Old Dominion University, Norfolk, VA, USA
| | - Pilar Pazos
- Engineering Management and Systems Engineering, Old Dominion University, Norfolk, VA, USA.
| |
Collapse
|
3
|
Gómez-Martínez V, Chushig-Muzo D, Veierød MB, Granja C, Soguero-Ruiz C. Ensemble feature selection and tabular data augmentation with generative adversarial networks to enhance cutaneous melanoma identification and interpretability. BioData Min 2024; 17:46. [PMID: 39478549 PMCID: PMC11526724 DOI: 10.1186/s13040-024-00397-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2024] [Accepted: 10/09/2024] [Indexed: 11/02/2024] Open
Abstract
BACKGROUND Cutaneous melanoma is the most aggressive form of skin cancer, responsible for most skin cancer-related deaths. Recent advances in artificial intelligence, jointly with the availability of public dermoscopy image datasets, have allowed to assist dermatologists in melanoma identification. While image feature extraction holds potential for melanoma detection, it often leads to high-dimensional data. Furthermore, most image datasets present the class imbalance problem, where a few classes have numerous samples, whereas others are under-represented. METHODS In this paper, we propose to combine ensemble feature selection (FS) methods and data augmentation with the conditional tabular generative adversarial networks (CTGAN) to enhance melanoma identification in imbalanced datasets. We employed dermoscopy images from two public datasets, PH2 and Derm7pt, which contain melanoma and not-melanoma lesions. To capture intrinsic information from skin lesions, we conduct two feature extraction (FE) approaches, including handcrafted and embedding features. For the former, color, geometric and first-, second-, and higher-order texture features were extracted, whereas for the latter, embeddings were obtained using ResNet-based models. To alleviate the high-dimensionality in the FE, ensemble FS with filter methods were used and evaluated. For data augmentation, we conducted a progressive analysis of the imbalance ratio (IR), related to the amount of synthetic samples created, and evaluated the impact on the predictive results. To gain interpretability on predictive models, we used SHAP, bootstrap resampling statistical tests and UMAP visualizations. RESULTS The combination of ensemble FS, CTGAN, and linear models achieved the best predictive results, achieving AUCROC values of 87% (with support vector machine and IR=0.9) and 76% (with LASSO and IR=1.0) for the PH2 and Derm7pt, respectively. We also identified that melanoma lesions were mainly characterized by features related to color, while not-melanoma lesions were characterized by texture features. CONCLUSIONS Our results demonstrate the effectiveness of ensemble FS and synthetic data in the development of models that accurately identify melanoma. This research advances skin lesion analysis, contributing to both melanoma detection and the interpretation of main features for its identification.
Collapse
Affiliation(s)
- Vanesa Gómez-Martínez
- Department of Signal Theory and Communications, Telematics and Computing Systems, Rey Juan Carlos University, Madrid, 28943, Spain.
| | - David Chushig-Muzo
- Department of Signal Theory and Communications, Telematics and Computing Systems, Rey Juan Carlos University, Madrid, 28943, Spain
| | - Marit B Veierød
- Oslo Centre for Biostatistics and Epidemiology, Department of Biostatistics, Institute of Basic Medical Sciences, University of Oslo, Oslo, Norway
| | - Conceição Granja
- Norwegian Centre for E-health Research, University Hospital of North Norway, Tromsø, 9019, Norway
| | - Cristina Soguero-Ruiz
- Department of Signal Theory and Communications, Telematics and Computing Systems, Rey Juan Carlos University, Madrid, 28943, Spain
| |
Collapse
|
4
|
Long D, Chan M, Han M, Kamdar Z, Ma RK, Tsai PY, Francisco AB, Barrow J, Shackelford DB, Yarchoan M, McBride MJ, Orre LM, Vacanti NM, Gujral TS, Sethupathy P. Proteo-metabolomics and patient tumor slice experiments point to amino acid centrality for rewired mitochondria in fibrolamellar carcinoma. Cell Rep Med 2024; 5:101699. [PMID: 39208801 PMCID: PMC11528240 DOI: 10.1016/j.xcrm.2024.101699] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2024] [Revised: 06/12/2024] [Accepted: 08/03/2024] [Indexed: 09/04/2024]
Abstract
Fibrolamellar carcinoma (FLC) is a rare, lethal, early-onset liver cancer with a critical need for new therapeutics. The primary driver in FLC is the fusion oncoprotein, DNAJ-PKAc, which remains challenging to target therapeutically. It is critical, therefore, to expand understanding of the FLC molecular landscape to identify druggable pathways/targets. Here, we perform the most comprehensive integrative proteo-metabolomic analysis of FLC. We also conduct nutrient manipulation, respirometry analyses, as well as key loss-of-function assays in FLC tumor tissue slices from patients. We propose a model of cellular energetics in FLC pointing to proline anabolism being mediated by ornithine aminotransferase hyperactivity and ornithine transcarbamylase hypoactivity with serine and glutamine catabolism fueling the process. We highlight FLC's potential dependency on voltage-dependent anion channel (VDAC), a mitochondrial gatekeeper for anions including pyruvate. The metabolic rewiring in FLC that we propose in our model, with an emphasis on mitochondria, can be exploited for therapeutic vulnerabilities.
Collapse
Affiliation(s)
- Donald Long
- Department of Biomedical Sciences, College of Veterinary Medicine, Cornell University, Ithaca, NY, USA.
| | - Marina Chan
- Division of Human Biology, Fred Hutchinson Cancer Center, Seattle, WA, USA
| | - Mingqi Han
- Jonsson Comprehensive Cancer Center, UCLA, Los Angeles, CA, USA
| | - Zeal Kamdar
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Bloomberg-Kimmel Institute for Cancer Immunotherapy, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Rosanna K Ma
- Department of Biomedical Sciences, College of Veterinary Medicine, Cornell University, Ithaca, NY, USA
| | - Pei-Yin Tsai
- Division of Nutritional Sciences, Cornell University, Ithaca, NY, USA
| | - Adam B Francisco
- Department of Biomedical Sciences, College of Veterinary Medicine, Cornell University, Ithaca, NY, USA
| | - Joeva Barrow
- Division of Nutritional Sciences, Cornell University, Ithaca, NY, USA
| | | | - Mark Yarchoan
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Bloomberg-Kimmel Institute for Cancer Immunotherapy, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Matthew J McBride
- Department of Chemical Biology, Ernest Mario School of Pharmacy, Rutgers University, Piscataway, NJ, USA
| | - Lukas M Orre
- Department of Oncology and Pathology, Karolinska Institute, SciLifeLab, Solna, Sweden
| | | | - Taranjit S Gujral
- Division of Human Biology, Fred Hutchinson Cancer Center, Seattle, WA, USA
| | - Praveen Sethupathy
- Department of Biomedical Sciences, College of Veterinary Medicine, Cornell University, Ithaca, NY, USA.
| |
Collapse
|
5
|
Vahed SZ, Khatibi SMH, Saadat YR, Emdadi M, Khodaei B, Alishani MM, Boostani F, Dizaj SM, Pirmoradi S. Introducing effective genes in lymph node metastasis of breast cancer patients using SHAP values based on the mRNA expression data. PLoS One 2024; 19:e0308531. [PMID: 39150915 PMCID: PMC11329117 DOI: 10.1371/journal.pone.0308531] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2024] [Accepted: 07/24/2024] [Indexed: 08/18/2024] Open
Abstract
OBJECTIVE Breast cancer, a global concern predominantly impacting women, poses a significant threat when not identified early. While survival rates for breast cancer patients are typically favorable, the emergence of regional metastases markedly diminishes survival prospects. Detecting metastases and comprehending their molecular underpinnings are crucial for tailoring effective treatments and improving patient survival outcomes. METHODS Various artificial intelligence methods and techniques were employed in this study to achieve accurate outcomes. Initially, the data was organized and underwent hold-out cross-validation, data cleaning, and normalization. Subsequently, feature selection was conducted using ANOVA and binary Particle Swarm Optimization (PSO). During the analysis phase, the discriminative power of the selected features was evaluated using machine learning classification algorithms. Finally, the selected features were considered, and the SHAP algorithm was utilized to identify the most significant features for enhancing the decoding of dominant molecular mechanisms in lymph node metastases. RESULTS In this study, five main steps were followed for the analysis of mRNA expression data: reading, preprocessing, feature selection, classification, and SHAP algorithm. The RF classifier utilized the candidate mRNAs to differentiate between negative and positive categories with an accuracy of 61% and an AUC of 0.6. During the SHAP process, intriguing relationships between the selected mRNAs and positive/negative lymph node status were discovered. The results indicate that GDF5, BAHCC1, LCN2, FGF14-AS2, and IDH2 are among the top five most impactful mRNAs based on their SHAP values. CONCLUSION The prominent identified mRNAs including GDF5, BAHCC1, LCN2, FGF14-AS2, and IDH2, are implicated in lymph node metastasis. This study holds promise in elucidating a thorough insight into key candidate genes that could significantly impact the early detection and tailored therapeutic strategies for lymph node metastasis in patients with breast cancer.
Collapse
Affiliation(s)
| | - Seyed Mahdi Hosseiniyan Khatibi
- Kidney Research Center, Tabriz University of Medical Sciences, Tabriz, Iran
- Rahat Breath and Sleep Research Center, Tabriz University of Medical Science, Tabriz, Iran
| | | | - Manijeh Emdadi
- Department of Computer Engineering, Abadan Branch, Islamic Azad University, Abadan, Iran
| | - Bahareh Khodaei
- Clinical Research Development Unit of Tabriz Valiasr Hospital, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Mohammad Matin Alishani
- Department of Computer Science, Faculty of Information Technology, University of Shahid Madani of Tabriz, Tabriz, Iran
| | - Farnaz Boostani
- Clinical Research Development Unit of Tabriz Valiasr Hospital, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Solmaz Maleki Dizaj
- Dental and Periodontal Research Center, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Saeed Pirmoradi
- Clinical Research Development Unit of Tabriz Valiasr Hospital, Tabriz University of Medical Sciences, Tabriz, Iran
| |
Collapse
|
6
|
Jain S, Safo SE. DeepIDA-GRU: a deep learning pipeline for integrative discriminant analysis of cross-sectional and longitudinal multiview data with applications to inflammatory bowel disease classification. Brief Bioinform 2024; 25:bbae339. [PMID: 39007595 PMCID: PMC11771283 DOI: 10.1093/bib/bbae339] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2023] [Revised: 02/29/2024] [Accepted: 06/28/2024] [Indexed: 07/16/2024] Open
Abstract
Biomedical research now commonly integrates diverse data types or views from the same individuals to better understand the pathobiology of complex diseases, but the challenge lies in meaningfully integrating these diverse views. Existing methods often require the same type of data from all views (cross-sectional data only or longitudinal data only) or do not consider any class outcome in the integration method, which presents limitations. To overcome these limitations, we have developed a pipeline that harnesses the power of statistical and deep learning methods to integrate cross-sectional and longitudinal data from multiple sources. In addition, it identifies key variables that contribute to the association between views and the separation between classes, providing deeper biological insights. This pipeline includes variable selection/ranking using linear and nonlinear methods, feature extraction using functional principal component analysis and Euler characteristics, and joint integration and classification using dense feed-forward networks for cross-sectional data and recurrent neural networks for longitudinal data. We applied this pipeline to cross-sectional and longitudinal multiomics data (metagenomics, transcriptomics and metabolomics) from an inflammatory bowel disease (IBD) study and identified microbial pathways, metabolites and genes that discriminate by IBD status, providing information on the etiology of IBD. We conducted simulations to compare the two feature extraction methods.
Collapse
Affiliation(s)
- Sarthak Jain
- Department of Electrical Engineering, University of
Minnesota, Minneapolis, MN 55455, United States
| | - Sandra E Safo
- Division of Biostatistics and Health Data Science, University of
Minnesota, Minneapolis, MN 55455, United States
| |
Collapse
|
7
|
Tang S, Li Z. EEG complexity measures for detecting mind wandering during video-based learning. Sci Rep 2024; 14:8209. [PMID: 38589498 PMCID: PMC11001605 DOI: 10.1038/s41598-024-58889-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2024] [Accepted: 04/04/2024] [Indexed: 04/10/2024] Open
Abstract
This study explores the efficacy of various EEG complexity measures in detecting mind wandering during video-based learning. Employing a modified probe-caught method, we recorded EEG data from participants engaged in viewing educational videos and subsequently focused on the discrimination between mind wandering (MW) and non-MW states. We systematically investigated various EEG complexity metrics, including metrics that reflect a system's regularity like multiscale permutation entropy (MPE), and metrics that reflect a system's dimensionality like detrended fluctuation analysis (DFA). We also compare these features to traditional band power (BP) features. Data augmentation methods and feature selection were applied to optimize detection accuracy. Results show BP features excelled (mean area under the receiver operating characteristic curve (AUC) 0.646) in datasets without eye-movement artifacts, while MPE showed similar performance (mean AUC 0.639) without requiring removal of eye-movement artifacts. Combining all kinds of features improved decoding performance to 0.66 mean AUC. Our findings demonstrate the potential of these complexity metrics in EEG analysis for mind wandering detection, highlighting their practical implications in educational contexts.
Collapse
Affiliation(s)
- Shaohua Tang
- School of Systems Science, Beijing Normal University, Beijing, China
- International Academic Center of Complex Systems, Beijing Normal University, Zhuhai, China
- Center for Cognition and Neuroergonomics, State Key Laboratory of Cognitive Neuroscience and Learning, Beijing Normal University, Zhuhai, China
| | - Zheng Li
- Center for Cognition and Neuroergonomics, State Key Laboratory of Cognitive Neuroscience and Learning, Beijing Normal University, Zhuhai, China.
| |
Collapse
|
8
|
Pirmoradi S, Hosseiniyan Khatibi SM, Zununi Vahed S, Homaei Rad H, Khamaneh AM, Akbarpour Z, Seyedrezazadeh E, Teshnehlab M, Chapman KR, Ansarin K. Unraveling the link between PTBP1 and severe asthma through machine learning and association rule mining method. Sci Rep 2023; 13:15399. [PMID: 37717070 PMCID: PMC10505163 DOI: 10.1038/s41598-023-42581-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2023] [Accepted: 09/12/2023] [Indexed: 09/18/2023] Open
Abstract
Severe asthma is a chronic inflammatory airway disease with great therapeutic challenges. Understanding the genetic and molecular mechanisms of severe asthma may help identify therapeutic strategies for this complex condition. RNA expression data were analyzed using a combination of artificial intelligence methods to identify novel genes related to severe asthma. Through the ANOVA feature selection approach, 100 candidate genes were selected among 54,715 mRNAs in blood samples of patients with severe asthmatic and healthy groups. A deep learning model was used to validate the significance of the candidate genes. The accuracy, F1-score, AUC-ROC, and precision of the 100 genes were 83%, 0.86, 0.89, and 0.9, respectively. To discover hidden associations among selected genes, association rule mining was applied. The top 20 genes including the PTBP1, RAB11FIP3, APH1A, and MYD88 were recognized as the most frequent items among severe asthma association rules. The PTBP1 was found to be the most frequent gene associated with severe asthma among those 20 genes. PTBP1 was the gene most frequently associated with severe asthma among candidate genes. Identification of master genes involved in the initiation and development of asthma can offer novel targets for its diagnosis, prognosis, and targeted-signaling therapy.
Collapse
Affiliation(s)
- Saeed Pirmoradi
- Clinical Research Development Unit of Tabriz Valiasr Hospital, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Seyed Mahdi Hosseiniyan Khatibi
- Kidney Research Center, Tabriz University of Medical Sciences, Tabriz, Iran
- Rahat Breath and Sleep Research Center, Tabriz University of Medical Science, Tabriz, Iran
| | | | - Hamed Homaei Rad
- Rahat Breath and Sleep Research Center, Tabriz University of Medical Science, Tabriz, Iran
| | - Amir Mahdi Khamaneh
- Faculty of Advanced Medical Sciences, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Zahra Akbarpour
- Rahat Breath and Sleep Research Center, Tabriz University of Medical Science, Tabriz, Iran
| | - Ensiyeh Seyedrezazadeh
- Tuberculosis and Lung Disease Research Center, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Mohammad Teshnehlab
- Department of Electric and Computer Engineering, K.N. Toosi University of Technology, Tehran, Iran
| | - Kenneth R Chapman
- Division of Respiratory Medicine, Department of Medicine, University of Toronto, Toronto, ON, Canada.
| | - Khalil Ansarin
- Rahat Breath and Sleep Research Center, Tabriz University of Medical Science, Tabriz, Iran.
| |
Collapse
|
9
|
Wu X, Jia W. Multimodal deep learning as a next challenge in nutrition research: tailoring fermented dairy products based on cytidine diphosphate-diacylglycerol synthase-mediated lipid metabolism. Crit Rev Food Sci Nutr 2023; 64:12272-12283. [PMID: 37615630 DOI: 10.1080/10408398.2023.2248633] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/25/2023]
Abstract
Deep learning is evolving in nutritional epidemiology to address challenges including precise nutrition and data-driven disease modeling. Fermented dairy products consumption as the implementation of specific dietary priority contributes to a lower risk of all-cause mortality, cardiovascular disease, and obesity. Various lipid types play different roles in cardiometabolic health and fermentation process changes the lipid profile in dairy products. Leveraging the power of multiple biological datasets can provide mechanistic insights into how proteins impact lipid pathways, and establish connections among fermentation-lipid biomarkers-protein. The recent leap of deep learning has been performed in food category recognition, agro-food freshness detection, and food flavor prediction and regulation. The proposed multimodal deep learning method includes four steps: (i) Forming data matrices based on data generated from different omics layers. (ii) Decomposing high-dimensional omics data according to self-attention mechanism. (iii) Constructing View Correlation Discovery Network to learn the cross-omics correlations and integrate different omics datasets. (iv) Depicting a biological network for lipid metabolism-centered quantitative multi-omics data analysis. Relying on the cytidine diphosphate-diacylglycerol synthase-mediated lipid metabolism regulates the glycerophospholipid composition of fermented dairy effectively. Innovative processing strategies including ohmic heating and pulsed electric field improve the sensory qualities and nutritional characteristics of the products.
Collapse
Affiliation(s)
- Xixuan Wu
- School of Food and Biological Engineering, Shaanxi University of Science and Technology, Xi'an, China
| | - Wei Jia
- School of Food and Biological Engineering, Shaanxi University of Science and Technology, Xi'an, China
- Shaanxi Research Institute of Agricultural Products Processing Technology, Xi'an, China
| |
Collapse
|
10
|
Jiang X, Hu Z, Wang S, Zhang Y. Deep Learning for Medical Image-Based Cancer Diagnosis. Cancers (Basel) 2023; 15:3608. [PMID: 37509272 PMCID: PMC10377683 DOI: 10.3390/cancers15143608] [Citation(s) in RCA: 40] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Revised: 07/10/2023] [Accepted: 07/10/2023] [Indexed: 07/30/2023] Open
Abstract
(1) Background: The application of deep learning technology to realize cancer diagnosis based on medical images is one of the research hotspots in the field of artificial intelligence and computer vision. Due to the rapid development of deep learning methods, cancer diagnosis requires very high accuracy and timeliness as well as the inherent particularity and complexity of medical imaging. A comprehensive review of relevant studies is necessary to help readers better understand the current research status and ideas. (2) Methods: Five radiological images, including X-ray, ultrasound (US), computed tomography (CT), magnetic resonance imaging (MRI), positron emission computed tomography (PET), and histopathological images, are reviewed in this paper. The basic architecture of deep learning and classical pretrained models are comprehensively reviewed. In particular, advanced neural networks emerging in recent years, including transfer learning, ensemble learning (EL), graph neural network, and vision transformer (ViT), are introduced. Five overfitting prevention methods are summarized: batch normalization, dropout, weight initialization, and data augmentation. The application of deep learning technology in medical image-based cancer analysis is sorted out. (3) Results: Deep learning has achieved great success in medical image-based cancer diagnosis, showing good results in image classification, image reconstruction, image detection, image segmentation, image registration, and image synthesis. However, the lack of high-quality labeled datasets limits the role of deep learning and faces challenges in rare cancer diagnosis, multi-modal image fusion, model explainability, and generalization. (4) Conclusions: There is a need for more public standard databases for cancer. The pre-training model based on deep neural networks has the potential to be improved, and special attention should be paid to the research of multimodal data fusion and supervised paradigm. Technologies such as ViT, ensemble learning, and few-shot learning will bring surprises to cancer diagnosis based on medical images.
Collapse
Grants
- RM32G0178B8 BBSRC
- MC_PC_17171 MRC, UK
- RP202G0230 Royal Society, UK
- AA/18/3/34220 BHF, UK
- RM60G0680 Hope Foundation for Cancer Research, UK
- P202PF11 GCRF, UK
- RP202G0289 Sino-UK Industrial Fund, UK
- P202ED10, P202RE969 LIAS, UK
- P202RE237 Data Science Enhancement Fund, UK
- 24NN201 Fight for Sight, UK
- OP202006 Sino-UK Education Fund, UK
- RM32G0178B8 BBSRC, UK
- 2023SJZD125 Major project of philosophy and social science research in colleges and universities in Jiangsu Province, China
Collapse
Affiliation(s)
- Xiaoyan Jiang
- School of Mathematics and Information Science, Nanjing Normal University of Special Education, Nanjing 210038, China; (X.J.); (Z.H.)
| | - Zuojin Hu
- School of Mathematics and Information Science, Nanjing Normal University of Special Education, Nanjing 210038, China; (X.J.); (Z.H.)
| | - Shuihua Wang
- School of Computing and Mathematical Sciences, University of Leicester, Leicester LE1 7RH, UK;
| | - Yudong Zhang
- School of Computing and Mathematical Sciences, University of Leicester, Leicester LE1 7RH, UK;
| |
Collapse
|
11
|
Khan Mamun MMR, Elfouly T. Detection of Cardiovascular Disease from Clinical Parameters Using a One-Dimensional Convolutional Neural Network. Bioengineering (Basel) 2023; 10:796. [PMID: 37508823 PMCID: PMC10376462 DOI: 10.3390/bioengineering10070796] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Revised: 06/29/2023] [Accepted: 06/30/2023] [Indexed: 07/30/2023] Open
Abstract
Heart disease is a significant public health problem, and early detection is crucial for effective treatment and management. Conventional and noninvasive techniques are cumbersome, time-consuming, inconvenient, expensive, and unsuitable for frequent measurement or diagnosis. With the advance of artificial intelligence (AI), new invasive techniques emerging in research are detecting heart conditions using machine learning (ML) and deep learning (DL). Machine learning models have been used with the publicly available dataset from the internet about heart health; in contrast, deep learning techniques have recently been applied to analyze electrocardiograms (ECG) or similar vital data to detect heart diseases. Significant limitations of these datasets are their small size regarding the number of patients and features and the fact that many are imbalanced datasets. Furthermore, the trained models must be more reliable and accurate in medical settings. This study proposes a hybrid one-dimensional convolutional neural network (1D CNN), which uses a large dataset accumulated from online survey data and selected features using feature selection algorithms. The 1D CNN proved to show better accuracy compared to contemporary machine learning algorithms and artificial neural networks. The non-coronary heart disease (no-CHD) and CHD validation data showed an accuracy of 80.1% and 76.9%, respectively. The model was compared with an artificial neural network, random forest, AdaBoost, and a support vector machine. Overall, 1D CNN proved to show better performance in terms of accuracy, false negative rates, and false positive rates. Similar strategies were applied for four more heart conditions, and the analysis proved that using the hybrid 1D CNN produced better accuracy.
Collapse
Affiliation(s)
| | - Tarek Elfouly
- Department of Electrical and Computer Engineering, Tennessee Technological University, Cookeville, TN 38505, USA
| |
Collapse
|
12
|
Lin Y, Ma J, Sun DW, Cheng JH, Wang Q. A pH-Responsive colourimetric sensor array based on machine learning for real-time monitoring of beef freshness. Food Control 2023. [DOI: 10.1016/j.foodcont.2023.109729] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/12/2023]
|
13
|
Dimitsaki S, Gavriilidis GI, Dimitriadis VK, Natsiavas P. Benchmarking of Machine Learning classifiers on plasma proteomic for COVID-19 severity prediction through interpretable artificial intelligence. Artif Intell Med 2023; 137:102490. [PMID: 36868685 PMCID: PMC9846931 DOI: 10.1016/j.artmed.2023.102490] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2022] [Revised: 01/10/2023] [Accepted: 01/11/2023] [Indexed: 01/19/2023]
Abstract
The SARS-CoV-2 pandemic highlighted the need for software tools that could facilitate patient triage regarding potential disease severity or even death. In this article, an ensemble of Machine Learning (ML) algorithms is evaluated in terms of predicting the severity of their condition using plasma proteomics and clinical data as input. An overview of AI-based technical developments to support COVID-19 patient management is presented outlining the landscape of relevant technical developments. Based on this review, the use of an ensemble of ML algorithms that analyze clinical and biological data (i.e., plasma proteomics) of COVID-19 patients is designed and deployed to evaluate the potential use of AI for early COVID-19 patient triage. The proposed pipeline is evaluated using three publicly available datasets for training and testing. Three ML "tasks" are defined, and several algorithms are tested through a hyperparameter tuning method to identify the highest-performance models. As overfitting is one of the typical pitfalls for such approaches (mainly due to the size of the training/validation datasets), a variety of evaluation metrics are used to mitigate this risk. In the evaluation procedure, recall scores ranged from 0.6 to 0.74 and F1-score from 0.62 to 0.75. The best performance is observed via Multi-Layer Perceptron (MLP) and Support Vector Machines (SVM) algorithms. Additionally, input data (proteomics and clinical data) were ranked based on corresponding Shapley additive explanation (SHAP) values and evaluated for their prognosticated capacity and immuno-biological credence. This "interpretable" approach revealed that our ML models could discern critical COVID-19 cases predominantly based on patient's age and plasma proteins on B cell dysfunction, hyper-activation of inflammatory pathways like Toll-like receptors, and hypo-activation of developmental and immune pathways like SCF/c-Kit signaling. Finally, the herein computational workflow is corroborated in an independent dataset and MLP superiority along with the implication of the abovementioned predictive biological pathways are corroborated. Regarding limitations of the presented ML pipeline, the datasets used in this study contain less than 1000 observations and a significant number of input features hence constituting a high-dimensional low-sample (HDLS) dataset which could be sensitive to overfitting. An advantage of the proposed pipeline is that it combines biological data (plasma proteomics) with clinical-phenotypic data. Thus, in principle, the presented approach could enable patient triage in a timely fashion if used on already trained models. However, larger datasets and further systematic validation are needed to confirm the potential clinical value of this approach. The code is available on Github: https://github.com/inab-certh/Predicting-COVID-19-severity-through-interpretable-AI-analysis-of-plasma-proteomics.
Collapse
Affiliation(s)
- Stella Dimitsaki
- Institute of Applied Biosciences, Centre for Research & Technology Hellas, Thermi, Thessaloniki, Greece.
| | - George I Gavriilidis
- Institute of Applied Biosciences, Centre for Research & Technology Hellas, Thermi, Thessaloniki, Greece
| | - Vlasios K Dimitriadis
- Institute of Applied Biosciences, Centre for Research & Technology Hellas, Thermi, Thessaloniki, Greece
| | - Pantelis Natsiavas
- Institute of Applied Biosciences, Centre for Research & Technology Hellas, Thermi, Thessaloniki, Greece
| |
Collapse
|
14
|
Re-ranking and TOPSIS-based ensemble feature selection with multi-stage aggregation for text categorization. Pattern Recognit Lett 2023. [DOI: 10.1016/j.patrec.2023.02.027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/03/2023]
|
15
|
Wu Y, Zhu D, Wang X. Tree enhanced deep adaptive network for cancer prediction with high dimension low sample size microarray data. Appl Soft Comput 2023. [DOI: 10.1016/j.asoc.2023.110078] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
|
16
|
Improved swarm-optimization-based filter-wrapper gene selection from microarray data for gene expression tumor classification. Pattern Anal Appl 2022. [DOI: 10.1007/s10044-022-01117-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
17
|
Nematzadeh H, García-Nieto J, Navas-Delgado I, Aldana-Montes JF. Automatic frequency-based feature selection using discrete weighted evolution strategy. Appl Soft Comput 2022. [DOI: 10.1016/j.asoc.2022.109699] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
18
|
Colombelli F, Kowalski TW, Recamonde-Mendoza M. A hybrid ensemble feature selection design for candidate biomarkers discovery from transcriptome profiles. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109655] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/15/2022]
|
19
|
Panels of mRNAs and miRNAs for decoding molecular mechanisms of Renal Cell Carcinoma (RCC) subtypes utilizing Artificial Intelligence approaches. Sci Rep 2022; 12:16393. [PMID: 36180558 PMCID: PMC9525704 DOI: 10.1038/s41598-022-20783-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2022] [Accepted: 09/19/2022] [Indexed: 11/12/2022] Open
Abstract
Renal Cell Carcinoma (RCC) encompasses three histological subtypes, including clear cell RCC (KIRC), papillary RCC (KIRP), and chromophobe RCC (KICH) each of which has different clinical courses, genetic/epigenetic drivers, and therapeutic responses. This study aimed to identify the significant mRNAs and microRNA panels involved in the pathogenesis of RCC subtypes. The mRNA and microRNA transcripts profile were obtained from The Cancer Genome Atlas (TCGA), which were included 611 ccRCC patients, 321 pRCC patients, and 89 chRCC patients for mRNA data and 616 patients in the ccRCC subtype, 326 patients in the pRCC subtype, and 91 patients in the chRCC for miRNA data, respectively. To identify mRNAs and miRNAs, feature selection based on filter and graph algorithms was applied. Then, a deep model was used to classify the subtypes of the RCC. Finally, an association rule mining algorithm was used to disclose features with significant roles to trigger molecular mechanisms to cause RCC subtypes. Panels of 77 mRNAs and 73 miRNAs could discriminate the KIRC, KIRP, and KICH subtypes from each other with 92% (F1-score ≥ 0.9, AUC ≥ 0.89) and 95% accuracy (F1-score ≥ 0.93, AUC ≥ 0.95), respectively. The Association Rule Mining analysis could identify miR-28 (repeat count = 2642) and CSN7A (repeat count = 5794) along with the miR-125a (repeat count = 2591) and NMD3 (repeat count = 2306) with the highest repeat counts, in the KIRC and KIRP rules, respectively. This study found new panels of mRNAs and miRNAs to distinguish among RCC subtypes, which were able to provide new insights into the underlying responsible mechanisms for the initiation and progression of KIRC and KIRP. The proposed mRNA and miRNA panels have a high potential to be as biomarkers of RCC subtypes and should be examined in future clinical studies.
Collapse
|
20
|
Network-based dimensionality reduction of high-dimensional, low-sample-size datasets. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
21
|
Prasetiyowati MI, Maulidevi NU, Surendro K. The accuracy of Random Forest performance can be improved by conducting a feature selection with a balancing strategy. PeerJ Comput Sci 2022; 8:e1041. [PMID: 35875646 PMCID: PMC9299283 DOI: 10.7717/peerj-cs.1041] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2021] [Accepted: 06/22/2022] [Indexed: 06/12/2023]
Abstract
One of the significant purposes of building a model is to increase its accuracy within a shorter timeframe through the feature selection process. It is carried out by determining the importance of available features in a dataset using Information Gain (IG). The process is used to calculate the amounts of information contained in features with high values selected to accelerate the performance of an algorithm. In selecting informative features, a threshold value (cut-off) is used by the Information Gain (IG). Therefore, this research aims to determine the time and accuracy-performance needed to improve feature selection by integrating IG, the Fast Fourier Transform (FFT), and Synthetic Minor Oversampling Technique (SMOTE) methods. The feature selection model is then applied to the Random Forest, a tree-based machine learning algorithm with random feature selection. A total of eight datasets consisting of three balanced and five imbalanced datasets were used to conduct this research. Furthermore, the SMOTE found in the imbalance dataset was used to balance the data. The result showed that the feature selection using Information Gain, FFT, and SMOTE improved the performance accuracy of Random Forest.
Collapse
Affiliation(s)
- Maria Irmina Prasetiyowati
- Doctoral Program of Electrical Engineering and Informatics, School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Bandung, Jawa Barat, Indonesia
| | - Nur Ulfa Maulidevi
- Department of Electrical Engineering and Informatics, School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Bandung, Jawa Barat, Indonesia
| | - Kridanto Surendro
- Department of Electrical Engineering and Informatics, School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Bandung, Jawa Barat, Indonesia
| |
Collapse
|
22
|
An ensemble framework for microarray data classification based on feature subspace partitioning. Comput Biol Med 2022; 148:105820. [PMID: 35872409 DOI: 10.1016/j.compbiomed.2022.105820] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Revised: 06/05/2022] [Accepted: 07/03/2022] [Indexed: 12/14/2022]
Abstract
Feature selection is exposed to the curse of dimensionality risk, and it is even more exacerbated with high-dimensional data such as microarrays. Moreover, the low-instance/high-feature (LIHF) property of microarray data needs considerable processing time to do some calculations and comparisons among features to choose the best subset of them, which has led to many efforts to subdue the LIHF property of such genomic medicine data. Due to the promising results of the ensemble models in machine learning problems, this paper presents a novel framework, named feature-level aggregation-based ensemble based on overlapped feature subspace partitioning (FLAE-OFSP) for microarray data classification. The proposed ensemble has three main steps: after generating several subsets by the proposed partitioning approach, a feature selection algorithm (i.e., a feature ranker) is applied on each subset, and finally, their results are combined into a single ranked list using six defined aggregation functions. Evaluation of the presented framework based on seven microarray datasets and using four measures, including stability, classification accuracy, runtime, and Modscore shows substantial runtime improvement and also quality results in other evaluated measures compared to individual methods.
Collapse
|
23
|
Pudjihartono N, Fadason T, Kempa-Liehr AW, O'Sullivan JM. A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction. FRONTIERS IN BIOINFORMATICS 2022; 2:927312. [PMID: 36304293 PMCID: PMC9580915 DOI: 10.3389/fbinf.2022.927312] [Citation(s) in RCA: 172] [Impact Index Per Article: 57.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2022] [Accepted: 06/03/2022] [Indexed: 01/14/2023] Open
Abstract
Machine learning has shown utility in detecting patterns within large, unstructured, and complex datasets. One of the promising applications of machine learning is in precision medicine, where disease risk is predicted using patient genetic data. However, creating an accurate prediction model based on genotype data remains challenging due to the so-called “curse of dimensionality” (i.e., extensively larger number of features compared to the number of samples). Therefore, the generalizability of machine learning models benefits from feature selection, which aims to extract only the most “informative” features and remove noisy “non-informative,” irrelevant and redundant features. In this article, we provide a general overview of the different feature selection methods, their advantages, disadvantages, and use cases, focusing on the detection of relevant features (i.e., SNPs) for disease risk prediction.
Collapse
Affiliation(s)
| | - Tayaza Fadason
- Liggins Institute, University of Auckland, Auckland, New Zealand
- Maurice Wilkins Centre for Molecular Biodiscovery, Auckland, New Zealand
| | - Andreas W. Kempa-Liehr
- Department of Engineering Science, The University of Auckland, Auckland, New Zealand
- *Correspondence: Andreas W. Kempa-Liehr, ; Justin M. O'Sullivan,
| | - Justin M. O'Sullivan
- Liggins Institute, University of Auckland, Auckland, New Zealand
- Maurice Wilkins Centre for Molecular Biodiscovery, Auckland, New Zealand
- MRC Lifecourse Epidemiology Unit, University of Southampton, Southampton, United Kingdom
- Singapore Institute for Clinical Sciences, Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
- Australian Parkinson’s Mission, Garvan Institute of Medical Research, Sydney, NSW, Australia
- *Correspondence: Andreas W. Kempa-Liehr, ; Justin M. O'Sullivan,
| |
Collapse
|
24
|
An Effective Ensemble Automatic Feature Selection Method for Network Intrusion Detection. INFORMATION 2022. [DOI: 10.3390/info13070314] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
The mass of redundant and irrelevant data in network traffic brings serious challenges to intrusion detection, and feature selection can effectively remove meaningless information from the data. Most current filtered and embedded feature selection methods use a fixed threshold or ratio to determine the number of features in a subset, which requires a priori knowledge. In contrast, wrapped feature selection methods are computationally complex and time-consuming; meanwhile, individual feature selection methods have a bias in evaluating features. This work designs an ensemble-based automatic feature selection method called EAFS. Firstly, we calculate the feature importance or ranks based on individual methods, then add features to subsets sequentially by importance and evaluate subset performance comprehensively by designing an NSOM to obtain the subset with the largest NSOM value. When searching for a subset, the subset with higher accuracy is retained to lower the computational complexity by calculating the accuracy when the full set of features is used. Finally, the obtained subsets are ensembled, and by comparing the experimental results on three large-scale public datasets, the method described in this study can help in the classification, and also compared with other methods, we discover that our method outperforms other recent methods in terms of performance.
Collapse
|
25
|
Aguilera A, Pezoa R, Rodríguez-Delherbe A. A novel ensemble feature selection method for pixel-level segmentation of HER2 overexpression. COMPLEX INTELL SYST 2022. [DOI: 10.1007/s40747-022-00774-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
AbstractClassifying histopathology images on a pixel-level requires sets of features able to capture the complex characteristics of the images, like the irregular cell morphology and the color heterogeneity on the tissue aspect. In this context, feature selection becomes a crucial step in the classification process such that it reduces model complexity and computational costs, avoids overfitting, and thereby it improves the model performance. In this study, we propose a new ensemble feature selection method by combining a set of base selectors, classifiers, and rank aggregation methods, aiming to determine from any initial set of handcrafted features, a smaller set of relevant color and texture pixel-level features, subsequently used for segmenting HER2 overexpression on a pixel-level, in breast cancer tissue images. We have been able to significantly reduce the set of initial features, using the proposed ensemble feature selection method. The best results are obtained using $$\chi ^2$$
χ
2
, Random Forest, and Runoff as the based selector, classifier, and aggregation method, respectively. The classification performance of the best model trained on the selected features set results in 0.939 recall, 0.866 specificity, 0.903 accuracy, 0.875 precision, and 0.906 F1-score.
Collapse
|
26
|
Hoffmann Souza ML, da Costa CA, de Oliveira Ramos G, da Rosa Righi R. A feature identification method to explain anomalies in condition monitoring. COMPUT IND 2021. [DOI: 10.1016/j.compind.2021.103528] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
27
|
Yang P, Huang H, Liu C. Feature selection revisited in the single-cell era. Genome Biol 2021; 22:321. [PMID: 34847932 PMCID: PMC8638336 DOI: 10.1186/s13059-021-02544-3] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2021] [Accepted: 11/15/2021] [Indexed: 12/13/2022] Open
Abstract
Recent advances in single-cell biotechnologies have resulted in high-dimensional datasets with increased complexity, making feature selection an essential technique for single-cell data analysis. Here, we revisit feature selection techniques and summarise recent developments. We review their application to a range of single-cell data types generated from traditional cytometry and imaging technologies and the latest array of single-cell omics technologies. We highlight some of the challenges and future directions and finally consider their scalability and make general recommendations on each type of feature selection method. We hope this review stimulates future research and application of feature selection in the single-cell era.
Collapse
Affiliation(s)
- Pengyi Yang
- School of Mathematics and Statistics, University of Sydney, Sydney, NSW, 2006, Australia.
- Computational Systems Biology Group, Children's Medical Research Institute, University of Sydney, Westmead, NSW, 2145, Australia.
- Charles Perkins Centre, University of Sydney, Sydney, NSW, 2006, Australia.
| | - Hao Huang
- School of Mathematics and Statistics, University of Sydney, Sydney, NSW, 2006, Australia
- Computational Systems Biology Group, Children's Medical Research Institute, University of Sydney, Westmead, NSW, 2145, Australia
| | - Chunlei Liu
- Computational Systems Biology Group, Children's Medical Research Institute, University of Sydney, Westmead, NSW, 2145, Australia
| |
Collapse
|
28
|
Ma B, Tang Q, Qin Y, Bashir MF. Policyholder cluster divergence based differential premium in diabetes insurance. MANAGERIAL AND DECISION ECONOMICS 2021; 42:1793-1807. [DOI: 10.1002/mde.3345] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/22/2021] [Accepted: 04/04/2021] [Indexed: 01/04/2025]
Abstract
Traditional health insurance pricing, which is based on experience rates, cannot correctly estimate the risk types of policyholders, can lead to serious adverse selection. Due to massive data volumes and developments in data analysis technology, the underwriting process can more accurately reflect the insured's risk type. Therefore, this paper based on policyholder cluster divergence proposes a differential premium approach by employing fuzzy c‐means algorithm (FCM) with an extended initial multistate Markov model to formulate the differential premium that matches the policyholder's risk category. Our results confirm that the proposed differential premium approach better reveals the policyholder's risk type as compared with unified pricing and effectively counteracts adverse selection.
Collapse
Affiliation(s)
- Benjiang Ma
- School of Business Central South University Changsha China
| | - Qing Tang
- School of Business Central South University Changsha China
| | - Yifang Qin
- College of Tourism and Cultural Industries Hunan University of Science and Engineering Yongzhou China
| | | |
Collapse
|
29
|
Jiang Z, Zhang Y, Wang J. A multi-surrogate-assisted dual-layer ensemble feature selection algorithm. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2021.107625] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
30
|
Chen X, Wang Q, Zhuang S. Ensemble dimension reduction based on spectral disturbance for subspace clustering. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2021.107182] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
31
|
Feature selection via max-independent ratio and min-redundant ratio based on adaptive weighted kernel density estimation. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2021.03.049] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
32
|
Wang Z, Tsai CF, Lin WC. Data cleaning issues in class imbalanced datasets: instance selection and missing values imputation for one-class classifiers. DATA TECHNOLOGIES AND APPLICATIONS 2021. [DOI: 10.1108/dta-01-2021-0027] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
PurposeClass imbalance learning, which exists in many domain problem datasets, is an important research topic in data mining and machine learning. One-class classification techniques, which aim to identify anomalies as the minority class from the normal data as the majority class, are one representative solution for class imbalanced datasets. Since one-class classifiers are trained using only normal data to create a decision boundary for later anomaly detection, the quality of the training set, i.e. the majority class, is one key factor that affects the performance of one-class classifiers.Design/methodology/approachIn this paper, we focus on two data cleaning or preprocessing methods to address class imbalanced datasets. The first method examines whether performing instance selection to remove some noisy data from the majority class can improve the performance of one-class classifiers. The second method combines instance selection and missing value imputation, where the latter is used to handle incomplete datasets that contain missing values.FindingsThe experimental results are based on 44 class imbalanced datasets; three instance selection algorithms, including IB3, DROP3 and the GA, the CART decision tree for missing value imputation, and three one-class classifiers, which include OCSVM, IFOREST and LOF, show that if the instance selection algorithm is carefully chosen, performing this step could improve the quality of the training data, which makes one-class classifiers outperform the baselines without instance selection. Moreover, when class imbalanced datasets contain some missing values, combining missing value imputation and instance selection, regardless of which step is first performed, can maintain similar data quality as datasets without missing values.Originality/valueThe novelty of this paper is to investigate the effect of performing instance selection on the performance of one-class classifiers, which has never been done before. Moreover, this study is the first attempt to consider the scenario of missing values that exist in the training set for training one-class classifiers. In this case, performing missing value imputation and instance selection with different orders are compared.
Collapse
|
33
|
Qu Y, Wang P, Liu B, Song C, Wang D, Yang H, Zhang Z, Chen P, Kang X, Du K, Yao H, Zhou B, Han T, Zuo N, Han Y, Lu J, Yu C, Zhang X, Jiang T, Zhou Y, Liu Y. AI4AD: Artificial intelligence analysis for Alzheimer's disease classification based on a multisite DTI database. BRAIN DISORDERS 2021. [DOI: 10.1016/j.dscb.2021.100005] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
|