Abstract
BACKGROUND
Despite the increasing number of studies in breast cancer survival prediction, there is little attention put toward deceased patients and their survival lengths. Moreover, developing a model that is both accurate and interpretable remains a challenge.
OBJECTIVE
This paper proposes a two-stage data analytic framework, where Stage I classifies the survival and deceased statuses and Stage II predicts the number of survival months for deceased females with cancer. Since medical data are not entirely clean nor prepared for model development, we aim to show that data preparation can strengthen a simple Generalized Linear Model (GLM)1 to predict as accurate as the complex models like Extreme Gradient Boosting (XGB)2 and Multilayer Perceptron based on Artificial Neural Networks (MLP-ANNs)3 in both stages.
METHODS
In Stage I, we use recent Surveillance, Epidemiology, and End Results (SEER)4 data from 2004 to 2016 to predict short term survival statuses from 6-months to 3-years with 6 month increments. Synthetic Minority Over-sampling Technique (SMOTE),5 Relocating Safe-Level SMOTE (RSLS)6, Adaptive Synthetic (ADASYN)7 re-sampling techniques, Least Absolute Shrinkage and Selection Operator (LASSO)8 and Random Forest (RF)9 feature selection methods along with integer and one-hot encoding are combined with the three popular data mining methods: GLM, XGB, and MLP. In Stage II, we predict the number of survival months for patients who are correctly predicted as deceased within 3-years. Again, we employ GLM, XGB, and MLP for regression along with LASSO and RF for feature selection and one-hot encoding to encode the categorical features.
RESULTS
We obtain Area Under the Receiver Operating Characteristic Curve (AUC)10 values of 0.900, 0.898, 0.877, 0.852, 0.852, and 0.858 for 6-month, 1-, 1.5-, 2-, 2.5, and 3-year survival time-points, respectively, using OneHotEncoding-GLM-LASSO-ADASYN. We use the change in the Odds Ratio values in GLM to manifest the impact of individual categorical levels and numerical features on the odds of death. In Stage II, we obtain Mean Absolute Error (MAE)11 of 7.960 months using OneHotEncoding-GLM-LASSO when predicting the number of survival months for deceased patients. We present the top contributing features and their coefficient values to illustrate how the presence of each feature alters the predicted number of survival months.
CONCLUSION
To the best of our knowledge, this is the first study that implements both breast cancer survival classification and regression in a two-stage approach. All data-driven findings are presented in order to assist clinicians make better care decisions using GLM, an interpretable and computationally efficient method that predicts survival status and survival lengths for deceased patients, to help foster human and machine interactions.
Collapse