1
|
Leitão BN, Veríssimo A, Carvalho AM, Vinga S. Enhancing Prognostic Signatures in Glioblastoma with Feature Selection and Regularised Cox Regression. Genes (Basel) 2025; 16:473. [PMID: 40428295 PMCID: PMC12111402 DOI: 10.3390/genes16050473] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2025] [Revised: 04/04/2025] [Accepted: 04/16/2025] [Indexed: 05/29/2025] Open
Abstract
BACKGROUND Glioblastoma is a highly aggressive brain tumour with poor survival outcomes, highlighting the need for reliable prognostic models. Developing robust and interpretable prognostic signatures is critical for improving patient stratification and guiding therapy. This study explored the integration of machine learning feature selection with regularised Cox regression to construct prognostic gene signatures for glioblastoma patients. METHODS We combined the Boruta algorithm and Random Survival Forests (RSFs) with regularised Cox regression, along with network-based regularisation techniques (HubCox and OrphanCox), to develop interpretable prognostic signatures for stratifying high- and low-risk glioblastoma patients. Using mRNA-seq and survival data from The Cancer Genome Atlas (TCGA), we developed predictive models following WHO-2021 glioma guidelines. RESULTS Integrating Boruta or RSF with regularised Cox regression improved the performance and interpretability. Boruta increased the concordance indexes (C-indexes) by 0.030 and 0.013 for LASSO and Elastic Net, respectively, while significantly reducing the feature numbers. RSF similarly enhanced the performance and feature reduction. The genes Lysyl Oxidase Like 1 (LOXL1) and Insulin Like Growth Factor Binding Protein 6 (IGFBP6) were consistently selected and linked to glioma survival, emphasising their clinical significance. The network-based methods demonstrated superior survival probability prediction (lower Integrated Brier Score), although with lower C-index values, highlighting limitations in ranking the survival times. To evaluate the generalisability, external validation using the Chinese Glioma Genome Atlas (CGGA) confirmed that a multigene signature derived from the most consistently selected genes significantly stratified the patients by risk. CONCLUSIONS This study underscored the utility of combining machine learning feature selection with survival analysis to enhance prognostic modelling while balancing predictive performance and interpretability.
Collapse
Affiliation(s)
- Beatriz N. Leitão
- Instituto de Engenharia de Sistemas e Computadores: Investigação e Desenvolvimento (INESC-ID), Instituto Superior Técnico, Universidade de Lisboa, 1000-029 Lisbon, Portugal
- Instituto de Telecomunicações (IT–Lisboa), Instituto Superior Técnico, Universidade de Lisboa, 1049-001 Lisbon, Portugal
| | | | - Alexandra M. Carvalho
- Instituto de Telecomunicações (IT–Lisboa), Instituto Superior Técnico, Universidade de Lisboa, 1049-001 Lisbon, Portugal
| | - Susana Vinga
- Instituto de Engenharia de Sistemas e Computadores: Investigação e Desenvolvimento (INESC-ID), Instituto Superior Técnico, Universidade de Lisboa, 1000-029 Lisbon, Portugal
- Instituto de Engenharia Mecânica (IDMEC), Instituto Superior Técnico, Universidade de Lisboa, 1049-001 Lisbon, Portugal
| |
Collapse
|
2
|
Zhang Y, Muller S. Robust variable selection methods with Cox model-a selective practical benchmark study. Brief Bioinform 2024; 25:bbae508. [PMID: 39400113 PMCID: PMC11472364 DOI: 10.1093/bib/bbae508] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2024] [Revised: 09/01/2024] [Accepted: 09/30/2024] [Indexed: 10/15/2024] Open
Abstract
With the advancement of biological and medical techniques, we can now obtain large amounts of high-dimensional omics data with censored survival information. This presents challenges in method development across various domains, particularly in variable selection. Given the inherently skewed distribution of the survival time outcome variable, robust variable selection methods offer potential solutions. Recently, there has been a focus on extending robust variable selection methods from linear regression models to survival models. However, despite these developments, robust methods are currently rarely used in practical applications, possibly due to a limited appreciation of their overall good performance. To address this gap, we conduct a selective review comparing the variable selection performance of twelve robust and non-robust penalised Cox models. Our study reveals the intricate relationship among covariates, survival outcomes, and modeling approaches, demonstrating how subtle variations can significantly impact the performance of methods considered. Based on our empirical research, we recommend the use of robust Cox models for variable selection in practice based on their superior performance in presence of outliers while maintaining good efficiency and accuracy when there are no outliers. This study provides valuable insights for method development and application, contributing to a better understanding of the relationship between correlated covariates and censored outcomes.
Collapse
Affiliation(s)
- Yunwei Zhang
- School of Mathematics, Statistics, Chemistry and Physics, Murdoch University, 90 South St, Murdoch WA 6150, Australia
- School of Mathematical and Physical Sciences, Macquarie University, 12 Wally's Walk, Macquarie Park NSW 2109, Australia
- School of Mathematics and Statistics, The University of Sydney, F07 Eastern Ave, Camperdown NSW 2050, Australia
| | - Samuel Muller
- School of Mathematical and Physical Sciences, Macquarie University, 12 Wally's Walk, Macquarie Park NSW 2109, Australia
- School of Mathematics and Statistics, The University of Sydney, F07 Eastern Ave, Camperdown NSW 2050, Australia
| |
Collapse
|
3
|
Rahnenführer J, De Bin R, Benner A, Ambrogi F, Lusa L, Boulesteix AL, Migliavacca E, Binder H, Michiels S, Sauerbrei W, McShane L. Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges. BMC Med 2023; 21:182. [PMID: 37189125 DOI: 10.1186/s12916-023-02858-y] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/28/2022] [Accepted: 04/03/2023] [Indexed: 05/17/2023] Open
Abstract
BACKGROUND In high-dimensional data (HDD) settings, the number of variables associated with each observation is very large. Prominent examples of HDD in biomedical research include omics data with a large number of variables such as many measurements across the genome, proteome, or metabolome, as well as electronic health records data that have large numbers of variables recorded for each patient. The statistical analysis of such data requires knowledge and experience, sometimes of complex methods adapted to the respective research questions. METHODS Advances in statistical methodology and machine learning methods offer new opportunities for innovative analyses of HDD, but at the same time require a deeper understanding of some fundamental statistical concepts. Topic group TG9 "High-dimensional data" of the STRATOS (STRengthening Analytical Thinking for Observational Studies) initiative provides guidance for the analysis of observational studies, addressing particular statistical challenges and opportunities for the analysis of studies involving HDD. In this overview, we discuss key aspects of HDD analysis to provide a gentle introduction for non-statisticians and for classically trained statisticians with little experience specific to HDD. RESULTS The paper is organized with respect to subtopics that are most relevant for the analysis of HDD, in particular initial data analysis, exploratory data analysis, multiple testing, and prediction. For each subtopic, main analytical goals in HDD settings are outlined. For each of these goals, basic explanations for some commonly used analysis methods are provided. Situations are identified where traditional statistical methods cannot, or should not, be used in the HDD setting, or where adequate analytic tools are still lacking. Many key references are provided. CONCLUSIONS This review aims to provide a solid statistical foundation for researchers, including statisticians and non-statisticians, who are new to research with HDD or simply want to better evaluate and understand the results of HDD analyses.
Collapse
Affiliation(s)
| | | | - Axel Benner
- Division of Biostatistics, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Federico Ambrogi
- Department of Clinical Sciences and Community Health, University of Milan, Milan, Italy
- Scientific Directorate, IRCCS Policlinico San Donato, San Donato Milanese, Italy
| | - Lara Lusa
- Department of Mathematics, Faculty of Mathematics, Natural Sciences and Information Technology, University of Primorksa, Koper, Slovenia
- Institute of Biostatistics and Medical Informatics, University of Ljubljana, Ljubljana, Slovenia
| | - Anne-Laure Boulesteix
- Institute for Medical Information Processing, Biometry and Epidemiology, Ludwig Maximilian University of Munich, Munich, Germany
| | | | - Harald Binder
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany
| | - Stefan Michiels
- Service de Biostatistique et d'Épidémiologie, Gustave Roussy, Université Paris-Saclay, Villejuif, France
- Oncostat U1018, Inserm, Université Paris-Saclay, Labeled Ligue Contre le Cancer, Villejuif, France
| | - Willi Sauerbrei
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany
| | - Lisa McShane
- Biometric Research Program, Division of Cancer Treatment and Diagnosis, National Cancer Institute, Bethesda, MD, USA.
| |
Collapse
|
4
|
Sun H, Gao Q, Zhu G, Han C, Yan H, Wang T. Identification of influential observations in high-dimensional survival data through robust penalized Cox regression based on trimming. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:5352-5378. [PMID: 36896549 DOI: 10.3934/mbe.2023248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
Penalized Cox regression can efficiently be used for the determination of biomarkers in high-dimensional genomic data related to disease prognosis. However, results of Penalized Cox regression is influenced by the heterogeneity of the samples who have different dependent structure between survival time and covariates from most individuals. These observations are called influential observations or outliers. A robust penalized Cox model (Reweighted Elastic Net-type maximum trimmed partial likelihood estimator, Rwt MTPL-EN) is proposed to improve the prediction accuracy and identify influential observations. A new algorithm AR-Cstep to solve Rwt MTPL-EN model is also proposed. This method has been validated by simulation study and application to glioma microarray expression data. When there were no outliers, the results of Rwt MTPL-EN were close to the Elastic Net (EN). When outliers existed, the results of EN were impacted by outliers. And whenever the censored rate was large or low, the robust Rwt MTPL-EN performed better than EN. and could resist the outliers in both predictors and response. In terms of outliers detection accuracy, Rwt MTPL-EN was much higher than EN. The outliers who "lived too long" made EN perform worse, but were accurately detected by Rwt MTPL-EN. Through the analysis of glioma gene expression data, most of the outliers identified by EN were those "failed too early", but most of them were not obvious outliers according to risk estimated from omics data or clinical variables. Most of the outliers identified by Rwt MTPL-EN were those who "lived too long", and most of them were obvious outliers according to risk estimated from omics data or clinical variables. Rwt MTPL-EN can be adopted to detect influential observations in high-dimensional survival data.
Collapse
Affiliation(s)
- Hongwei Sun
- Department of Health Statistics, School of Public Health and Management, Binzhou Medical University, Yantai City, Shandong 264003, China
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan City, Shanxi 030001, China
| | - Qian Gao
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan City, Shanxi 030001, China
| | - Guiming Zhu
- Department of Health Statistics, School of Public Health and Management, Binzhou Medical University, Yantai City, Shandong 264003, China
| | - Chunlei Han
- Department of Health Statistics, School of Public Health and Management, Binzhou Medical University, Yantai City, Shandong 264003, China
| | - Haosen Yan
- Department of Health Statistics, School of Public Health and Management, Binzhou Medical University, Yantai City, Shandong 264003, China
| | - Tong Wang
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan City, Shanxi 030001, China
| |
Collapse
|
5
|
Kausar T, Akbar A, Qasim M. Influence diagnostics for the Cox proportional hazards regression model: method, simulation and applications. J STAT COMPUT SIM 2022. [DOI: 10.1080/00949655.2022.2145608] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]
Affiliation(s)
- Tehzeeb Kausar
- Department of Statistics, Bahauddin Zakariya University, Multan, Pakistan
| | - Atif Akbar
- Department of Statistics, Bahauddin Zakariya University, Multan, Pakistan
| | - Muhammad Qasim
- Department of Economics, Finance and Statistics, Jönköping University, Jönköping, Sweden
| |
Collapse
|
6
|
A 5G Hubs Location Hierarchized Problem that Balances the Connection of the Users. MOBILE NETWORKS AND APPLICATIONS 2022. [PMCID: PMC9380672 DOI: 10.1007/s11036-022-02020-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
7
|
Mining subgraph coverage patterns from graph transactions. INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS 2021; 13:105-121. [PMID: 34873579 PMCID: PMC8636072 DOI: 10.1007/s41060-021-00292-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2021] [Accepted: 10/24/2021] [Indexed: 11/25/2022]
Abstract
Pattern mining from graph transactional data (GTD) is an active area of research with applications in the domains of bioinformatics, chemical informatics and social networks. Existing works address the problem of mining frequent subgraphs from GTD. However, the knowledge concerning the coverage aspect of a set of subgraphs is also valuable for improving the performance of several applications. In this regard, we introduce the notion of subgraph coverage patterns (SCPs). Given a GTD, a subgraph coverage pattern is a set of subgraphs subject to relative frequency, coverage and overlap constraints provided by the user. We propose the Subgraph ID-based Flat Transactional (SIFT) framework for the efficient extraction of SCPs from a given GTD. Our performance evaluation using three real datasets demonstrates that our proposed SIFT framework is indeed capable of efficiently extracting SCPs from GTD. Furthermore, we demonstrate the effectiveness of SIFT through a case study in computer-aided drug design.
Collapse
|