1
|
Islam T, Sheakh MA, Tahosin MS, Hena MH, Akash S, Bin Jardan YA, FentahunWondmie G, Nafidi HA, Bourhia M. Predictive modeling for breast cancer classification in the context of Bangladeshi patients by use of machine learning approach with explainable AI. Sci Rep 2024; 14:8487. [PMID: 38605059 PMCID: PMC11009331 DOI: 10.1038/s41598-024-57740-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Accepted: 03/21/2024] [Indexed: 04/13/2024] Open
Abstract
Breast cancer has rapidly increased in prevalence in recent years, making it one of the leading causes of mortality worldwide. Among all cancers, it is by far the most common. Diagnosing this illness manually requires significant time and expertise. Since detecting breast cancer is a time-consuming process, preventing its further spread can be aided by creating machine-based forecasts. Machine learning and Explainable AI are crucial in classification as they not only provide accurate predictions but also offer insights into how the model arrives at its decisions, aiding in the understanding and trustworthiness of the classification results. In this study, we evaluate and compare the classification accuracy, precision, recall, and F1 scores of five different machine learning methods using a primary dataset (500 patients from Dhaka Medical College Hospital). Five different supervised machine learning techniques, including decision tree, random forest, logistic regression, naive bayes, and XGBoost, have been used to achieve optimal results on our dataset. Additionally, this study applied SHAP analysis to the XGBoost model to interpret the model's predictions and understand the impact of each feature on the model's output. We compared the accuracy with which several algorithms classified the data, as well as contrasted with other literature in this field. After final evaluation, this study found that XGBoost achieved the best model accuracy, which is 97%.
Collapse
Affiliation(s)
- Taminul Islam
- School of Computing, Southern Illinois University Carbondale, Carbondale, IL, USA
| | - Md Alif Sheakh
- Department of Computer Science and Engineering, Daffodil International University, Dhaka, Bangladesh
| | - Mst Sazia Tahosin
- Department of Computer Science and Engineering, Daffodil International University, Dhaka, Bangladesh
| | - Most Hasna Hena
- Department of Computer Science and Engineering, Daffodil International University, Dhaka, Bangladesh
| | - Shopnil Akash
- Department of Pharmacy, Faculty of Allied Health Sciences, Daffodil International University, Dhaka, Bangladesh
| | - Yousef A Bin Jardan
- Department of Pharmaceutics, College of Pharmacy, King Saud University, P.O. Box 11451, Riyadh, Saudi Arabia
| | | | - Hiba-Allah Nafidi
- Department of Food Science, Faculty of Agricultural and Food Sciences, Laval University, 2325, Quebec City, QC, G1V 0A6, Canada
| | - Mohammed Bourhia
- Laboratory of Biotechnology and Natural Resources Valorization, Ibn Zohr University, 80060, Agadir, Morocco
| |
Collapse
|
2
|
Gao J, Merchant AM. A Machine Learning Approach in Predicting Mortality Following Emergency General Surgery. Am Surg 2021; 87:1379-1385. [PMID: 34378431 DOI: 10.1177/00031348211038568] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Abstract
BACKGROUND There is a significant mortality burden associated with emergency general surgery (EGS) procedures. The objective of this study was to develop and validate the use of a machine learning approach to predict mortality following EGS. METHODS The American College of Surgeons National Surgical Quality Improvement Program database was queried for patients who underwent EGS between 2012 and 2017. We developed a machine learning algorithm to predict mortality following EGS and compared its performance with existing risk-prediction models of American Society of Anesthesiologists (ASA) classification, American College of Surgeon Surgical Risk Calculator (ACS-SRC), and the modified frailty index (mFI) using the area under receiver operative curve (AUC), sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). RESULTS The machine learning algorithm had a very high performance for predicting mortality following EGS, and it had superior performance compared to the ASA classification, ACS-SRC, and the mFI, as measured by the AUC, sensitivity, specificity, PPV, and NPV. DISCUSSION Machine learning approaches may be a promising tool to predict outcomes for EGS, aiding clinicians in surgical decision-making and counseling of patients and family, improving clinical outcomes by identifying modifiable risk factors than can be optimized, and decreasing treatment costs through resource allocation.
Collapse
Affiliation(s)
- Jeff Gao
- Department of Surgery, 12286Rutgers New Jersey Medical School, Newark, NJ, USA
| | - Aziz M Merchant
- Department of Surgery, 12286Rutgers New Jersey Medical School, Newark, NJ, USA
| |
Collapse
|
3
|
Mirsadeghi L, Haji Hosseini R, Banaei-Moghaddam AM, Kavousi K. EARN: an ensemble machine learning algorithm to predict driver genes in metastatic breast cancer. BMC Med Genomics 2021; 14:122. [PMID: 33962648 PMCID: PMC8105935 DOI: 10.1186/s12920-021-00974-3] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2020] [Accepted: 04/27/2021] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND Today, there are a lot of markers on the prognosis and diagnosis of complex diseases such as primary breast cancer. However, our understanding of the drivers that influence cancer aggression is limited. METHODS In this work, we study somatic mutation data consists of 450 metastatic breast tumor samples from cBio Cancer Genomics Portal. We use four software tools to extract features from this data. Then, an ensemble classifier (EC) learning algorithm called EARN (Ensemble of Artificial Neural Network, Random Forest, and non-linear Support Vector Machine) is proposed to evaluate plausible driver genes for metastatic breast cancer (MBCA). The decision-making strategy for the proposed ensemble machine is based on the aggregation of the predicted scores obtained from individual learning classifiers to be prioritized homo sapiens genes annotated as protein-coding from NCBI. RESULTS This study is an attempt to focus on the findings in several aspects of MBCA prognosis and diagnosis. First, drivers and passengers predicted by SVM, ANN, RF, and EARN are introduced. Second, biological inferences of predictions are discussed based on gene set enrichment analysis. Third, statistical validation and comparison of all learning methods are performed by some evaluation metrics. Finally, the pathway enrichment analysis (PEA) using ReactomeFIVIz tool (FDR < 0.03) for the top 100 genes predicted by EARN leads us to propose a new gene set panel for MBCA. It includes HDAC3, ABAT, GRIN1, PLCB1, and KPNA2 as well as NCOR1, TBL1XR1, SIRT4, KRAS, CACNA1E, PRKCG, GPS2, SIN3A, ACTB, KDM6B, and PRMT1. Furthermore, we compare results for MBCA to other outputs regarding 983 primary tumor samples of breast invasive carcinoma (BRCA) obtained from the Cancer Genome Atlas (TCGA). The comparison between outputs shows that ROC-AUC reaches 99.24% using EARN for MBCA and 99.79% for BRCA. This statistical result is better than three individual classifiers in each case. CONCLUSIONS This research using an integrative approach assists precision oncologists to design compact targeted panels that eliminate the need for whole-genome/exome sequencing. The schematic representation of the proposed model is presented as the Graphic abstract.
Collapse
Affiliation(s)
- Leila Mirsadeghi
- Department of Biology, Faculty of Science, Payame Noor University, Tehran, Iran
| | - Reza Haji Hosseini
- Department of Biology, Faculty of Science, Payame Noor University, Tehran, Iran.
| | - Ali Mohammad Banaei-Moghaddam
- Laboratory of Genomics and Epigenomics (LGE), Department of Biochemistry, Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran, Iran
| | - Kaveh Kavousi
- Laboratory of Complex Biological Systems and Bioinformatics (CBB), Department of Bioinformatics, Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran, Iran.
| |
Collapse
|
4
|
Machine-learning Approach for the Development of a Novel Predictive Model for the Diagnosis of Hepatocellular Carcinoma. Sci Rep 2019; 9:7704. [PMID: 31147560 PMCID: PMC6543030 DOI: 10.1038/s41598-019-44022-8] [Citation(s) in RCA: 61] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2019] [Accepted: 05/07/2019] [Indexed: 02/08/2023] Open
Abstract
Because of its multifactorial nature, predicting the presence of cancer using a single biomarker is difficult. We aimed to establish a novel machine-learning model for predicting hepatocellular carcinoma (HCC) using real-world data obtained during clinical practice. To establish a predictive model, we developed a machine-learning framework which developed optimized classifiers and their respective hyperparameter, depending on the nature of the data, using a grid-search method. We applied the current framework to 539 and 1043 patients with and without HCC to develop a predictive model for the diagnosis of HCC. Using the optimal hyperparameter, gradient boosting provided the highest predictive accuracy for the presence of HCC (87.34%) and produced an area under the curve (AUC) of 0.940. Using cut-offs of 200 ng/mL for AFP, 40 mAu/mL for DCP, and 15% for AFP-L3, the accuracies of AFP, DCP, and AFP-L3 for predicting HCC were 70.67% (AUC, 0.766), 74.91% (AUC, 0.644), and 71.05% (AUC, 0.683), respectively. A novel predictive model using a machine-learning approach reduced the misclassification rate by about half compared with a single tumor marker. The framework used in the current study can be applied to various kinds of data, thus potentially become a translational mechanism between academic research and clinical practice.
Collapse
|
5
|
Abstract
The application of machine learning models for prediction and prognosis of disease development has become an irrevocable part of cancer studies aimed at improving the subsequent therapy and management of patients. The application of machine learning models for accurate prediction of survival time in breast cancer on the basis of clinical data is the main objective of the presented study. The paper discusses an approach to the problem in which the main factor used to predict survival time is the originally developed tumor-integrated clinical feature, which combines tumor stage, tumor size, and age at diagnosis. Two datasets from corresponding breast cancer studies are united by applying a data integration approach based on horizontal and vertical integration by using proper document-oriented and graph databases which show good performance and no data losses. Aside from data normalization and classification, the applied machine learning methods provide promising results in terms of accuracy of survival time prediction. The analysis of our experiments shows an advantage of the linear Support Vector Regression, Lasso regression, Kernel Ridge regression, K-neighborhood regression, and Decision Tree regression—these models achieve most accurate survival prognosis results. The cross-validation for accuracy demonstrates best performance of the same models on the studied breast cancer data. As a support for the proposed approach, a Python-based workflow has been developed and the plans for its further improvement are finally discussed in the paper.
Collapse
|
6
|
Mathieson L, Mendes A, Marsden J, Pond J, Moscato P. Computer-Aided Breast Cancer Diagnosis with Optimal Feature Sets: Reduction Rules and Optimization Techniques. Methods Mol Biol 2017; 1526:299-325. [PMID: 27896749 DOI: 10.1007/978-1-4939-6613-4_17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
This chapter introduces a new method for knowledge extraction from databases for the purpose of finding a discriminative set of features that is also a robust set for within-class classification. Our method is generic and we introduce it here in the field of breast cancer diagnosis from digital mammography data. The mathematical formalism is based on a generalization of the k-Feature Set problem called (α, β)-k-Feature Set problem, introduced by Cotta and Moscato (J Comput Syst Sci 67(4):686-690, 2003). This method proceeds in two steps: first, an optimal (α, β)-k-feature set of minimum cardinality is identified and then, a set of classification rules using these features is obtained. We obtain the (α, β)-k-feature set in two phases; first a series of extremely powerful reduction techniques, which do not lose the optimal solution, are employed; and second, a metaheuristic search to identify the remaining features to be considered or disregarded. Two algorithms were tested with a public domain digital mammography dataset composed of 71 malignant and 75 benign cases. Based on the results provided by the algorithms, we obtain classification rules that employ only a subset of these features.
Collapse
Affiliation(s)
- Luke Mathieson
- Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine (CIBM), Faculty of Engineering and Built Environment, The University of Newcastle, Callaghan, NSW, 2308, Australia
| | - Alexandre Mendes
- Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine (CIBM), Faculty of Engineering and Built Environment, The University of Newcastle, Callaghan, NSW, 2308, Australia
| | - John Marsden
- Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine (CIBM), Faculty of Engineering and Built Environment, The University of Newcastle, Callaghan, NSW, 2308, Australia
| | - Jeffrey Pond
- Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine (CIBM), Faculty of Engineering and Built Environment, The University of Newcastle, Callaghan, NSW, 2308, Australia
| | - Pablo Moscato
- Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine (CIBM), Faculty of Engineering and Built Environment, The University of Newcastle, Callaghan, NSW, 2308, Australia.
| |
Collapse
|
7
|
SVM Feature Selection Based Rotation Forest Ensemble Classifiers to Improve Computer-Aided Diagnosis of Parkinson Disease. J Med Syst 2011; 36:2141-7. [DOI: 10.1007/s10916-011-9678-1] [Citation(s) in RCA: 55] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2011] [Accepted: 02/27/2011] [Indexed: 10/18/2022]
|