1
|
Goh V, Mallett S, Rodriguez-Justo M, Boulter V, Glynne-Jones R, Khan S, Lessels S, Patel D, Prezzi D, Taylor S, Halligan S. Evaluation of prognostic models to improve prediction of metastasis in patients following potentially curative treatment for primary colorectal cancer: the PROSPECT trial. Health Technol Assess 2025; 29:1-91. [PMID: 40230305 PMCID: PMC12010235 DOI: 10.3310/btmt7049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/16/2025] Open
Abstract
Background Despite apparently curative treatment, many patients with colorectal cancer develop subsequent metastatic disease. Current prognostic models are criticised because they are based on standard staging and omit novel biomarkers. Improved prognostication is an unmet need. Objectives To improve prognostication for colorectal cancer by developing a baseline multivariable model of standard clinicopathological predictors, and to then improve prediction via addition of promising novel imaging, genetic and immunohistochemical biomarkers. Design Prospective multicentre cohort. Setting Thirteen National Health Service hospitals. Participants Consecutive adult patients with colorectal cancer. Interventions Collection of prespecified standard clinicopathological variables and more novel imaging, genetic and immunohistochemical biomarkers, followed by 3-year follow-up to identify postoperative metastasis. Main outcome Best multivariable prognostic model including perfusion computed tomography compared with tumour/node staging. Secondary outcomes: Additive benefit of perfusion computed tomography and other biomarkers to best baseline model comprising standard clinicopathological predictors; measurement variability between local and central review; biological relationships between perfusion computed tomography and pathology variables. Results Between 2011 and 2016, 448 participants were recruited; 122 (27%) were withdrawn, leaving 326 (226 male, 100 female; mean ± standard deviation 66 ± 10.7 years); 183 (56%) had rectal cancer. Most cancers were locally advanced [≥ T3 stage, 227 (70%)]; 151 (46%) were node-positive (≥ N1 stage); 306 (94%) had surgery; 79 (24%) had neoadjuvant therapy. The resection margin was positive in 15 (5%); 93 (28%) had venous invasion; 125 (38%) had postoperative adjuvant chemotherapy; 81 (25%, 57 male) developed recurrent disease. Prediction of recurrent disease by the baseline clinicopathological time-to-event Weibull multivariable model (age, sex, tumour/node stage, tumour size and location, treatment, venous invasion) was superior to tumour/node staging: sensitivity: 0.57 (95% confidence interval 0.45 to 0.68), specificity 0.74 (95% confidence interval 0.68 to 0.79) versus sensitivity 0.56 (95% confidence interval 0.44 to 0.67), specificity 0.58 (95% confidence interval 0.51 to 0.64), respectively. Addition of perfusion computed tomography variables did not improve prediction significantly: c-statistic: 0.77 (95% confidence interval 0.71 to 0.83) versus 0.76 (95% confidence interval 0.70 to 0.82). Perfusion computed tomography parameters did not differ significantly between patients with and without recurrence (e.g. mean ± standard deviation blood flow of 60.3 ± 24.2 vs. 61.7 ± 34.2 ml/minute/100 ml). Furthermore, baseline model prediction was not improved significantly by the addition of any novel genetic or immunohistochemical biomarkers. We observed variation between local and central computed tomography measurements but neither improved model prediction significantly. We found no clear association between perfusion computed tomography variables and any immunohistochemical measurement or genetic expression. Limitations The number of patients developing metastasis was lower than expected from historical data. Our findings should not be overinterpreted. While the baseline model was superior to tumour/node staging, any clinical utility needs definition in daily practice. Conclusions A prognostic model of standard clinicopathological variables outperformed tumour/node staging, but novel biomarkers did not improve prediction significantly. Biomarkers that appear promising in small single-centre studies may contribute nothing substantial to prognostication when evaluated rigorously. Future work It would be desirable for other researchers to externally evaluate the baseline model. Trial registration This trial is registered as ISRCTN95037515. Funding This award was funded by the National Institute for Health and Care Research (NIHR) Health Technology Assessment programme (NIHR award ref: 09/22/49) and is published in full in Health Technology Assessment; Vol. 29, No. 8. See the NIHR Funding and Awards website for further award information.
Collapse
Affiliation(s)
- Vicky Goh
- School of Biomedical Engineering and Imaging Sciences, King's College London, London, UK
| | | | | | | | | | | | - Sarah Lessels
- Scottish Clinical Trials Research Unit (SCTRU), NHS National Services Scotland, Edinburgh, Scotland
| | - Dominic Patel
- Research Department of Pathology, UCL Cancer Institute, London, UK
| | - Davide Prezzi
- School of Biomedical Engineering and Imaging Sciences, King's College London, London, UK
| | | | | |
Collapse
|
2
|
Kızılaslan F, Michael Swanson D, Vitelli V. A Weibull mixture cure frailty model for high-dimensional covariates. Stat Methods Med Res 2025:9622802251327687. [PMID: 40165441 DOI: 10.1177/09622802251327687] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/02/2025]
Abstract
A novel mixture cure frailty model is introduced for handling censored survival data. Mixture cure models are preferable when the existence of a cured fraction among patients can be assumed. However, such models are heavily underexplored: frailty structures within cure models remain largely undeveloped, and furthermore, most existing methods do not work for high-dimensional datasets, when the number of predictors is significantly larger than the number of observations. In this study, we introduce a novel extension of the Weibull mixture cure model that incorporates a frailty component, employed to model an underlying latent population heterogeneity with respect to the outcome risk. Additionally, high-dimensional covariates are integrated into both the cure rate and survival part of the model, providing a comprehensive approach to employ the model in the context of high-dimensional omics data. We also perform variable selection via an adaptive elastic-net penalization, and propose a novel approach to inference using the expectation-maximization (EM) algorithm. Extensive simulation studies are conducted across various scenarios to demonstrate the performance of the model, and results indicate that our proposed method outperforms competitor models. We apply the novel approach to analyze RNAseq gene expression data from bulk breast cancer patients included in The Cancer Genome Atlas (TCGA) database. A set of prognostic biomarkers is then derived from selected genes, and subsequently validated via both functional enrichment analysis and comparison to the existing biological literature. Finally, a prognostic risk score index based on the identified biomarkers is proposed and validated by exploring the patients' survival.
Collapse
Affiliation(s)
- Fatih Kızılaslan
- Oslo Centre for Biostatistics and Epidemiology, Department of Biostatistics, University of Oslo, Norway
| | - David Michael Swanson
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Valeria Vitelli
- Oslo Centre for Biostatistics and Epidemiology, Department of Biostatistics, University of Oslo, Norway
| |
Collapse
|
3
|
Goh V, Mallett S, Boulter V, Glynne-Jones R, Khan S, Lessels S, Patel D, Prezzi D, Rodriguez-Justo M, Taylor SA, Beable R, Betts M, Breen DJ, Britton I, Brush J, Correa P, Dodds N, Dunlop J, Gourtsoyianni S, Griffin N, Higginson A, Lowe A, Slater A, Strugnell M, Tolan D, Zealley I, Halligan S. Multivariable prognostic modelling to improve prediction of colorectal cancer recurrence: the PROSPeCT trial. Eur Radiol 2024; 34:6992-7001. [PMID: 38836939 PMCID: PMC11519198 DOI: 10.1007/s00330-024-10803-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2023] [Revised: 03/25/2024] [Accepted: 04/05/2024] [Indexed: 06/06/2024]
Abstract
OBJECTIVE Improving prognostication to direct personalised therapy remains an unmet need. This study prospectively investigated promising CT, genetic, and immunohistochemical markers to improve the prediction of colorectal cancer recurrence. MATERIAL AND METHODS This multicentre trial (ISRCTN 95037515) recruited patients with primary colorectal cancer undergoing CT staging from 13 hospitals. Follow-up identified cancer recurrence and death. A baseline model for cancer recurrence at 3 years was developed from pre-specified clinicopathological variables (age, sex, tumour-node stage, tumour size, location, extramural venous invasion, and treatment). Then, CT perfusion (blood flow, blood volume, transit time and permeability), genetic (RAS, RAF, and DNA mismatch repair), and immunohistochemical markers of angiogenesis and hypoxia (CD105, vascular endothelial growth factor, glucose transporter protein, and hypoxia-inducible factor) were added to assess whether prediction improved over tumour-node staging alone as the main outcome measure. RESULTS Three hundred twenty-six of 448 participants formed the final cohort (226 male; mean 66 ± 10 years. 227 (70%) had ≥ T3 stage cancers; 151 (46%) were node-positive; 81 (25%) developed subsequent recurrence. The sensitivity and specificity of staging alone for recurrence were 0.56 [95% CI: 0.44, 0.67] and 0.58 [0.51, 0.64], respectively. The baseline clinicopathologic model improved specificity (0.74 [0.68, 0.79], with equivalent sensitivity of 0.57 [0.45, 0.68] for high vs medium/low-risk participants. The addition of prespecified CT perfusion, genetic, and immunohistochemical markers did not improve prediction over and above the clinicopathologic model (sensitivity, 0.58-0.68; specificity, 0.75-0.76). CONCLUSION A multivariable clinicopathological model outperformed staging in identifying patients at high risk of recurrence. Promising CT, genetic, and immunohistochemical markers investigated did not further improve prognostication in rigorous prospective evaluation. CLINICAL RELEVANCE STATEMENT A prognostic model based on clinicopathological variables including age, sex, tumour-node stage, size, location, and extramural venous invasion better identifies colorectal cancer patients at high risk of recurrence for neoadjuvant/adjuvant therapy than stage alone. KEY POINTS Identification of colorectal cancer patients at high risk of recurrence is an unmet need for treatment personalisation. This model for recurrence, incorporating many patient variables, had higher specificity than staging alone. Continued optimisation of risk stratification schema will help individualise treatment plans and follow-up schedules.
Collapse
Affiliation(s)
- Vicky Goh
- School of Biomedical Engineering & Imaging Sciences, King's College London, London, UK.
- Department of Radiology, Guys and St. Thomas' NHS Foundation Trust, London, UK.
| | - Susan Mallett
- Centre for Medical Imaging, Division of Medicine, University College London, London, UK
| | - Victor Boulter
- Patient Representative, Mount Vernon Cancer Centre, Northwood, UK
| | | | - Saif Khan
- Research Department of Pathology, UCL Cancer Institute, University College London, London, UK
| | - Sarah Lessels
- Scottish Clinical Trials Research Unit, Public Health Scotland, Edinburgh, UK
| | - Dominic Patel
- Research Department of Pathology, UCL Cancer Institute, University College London, London, UK
| | - Davide Prezzi
- School of Biomedical Engineering & Imaging Sciences, King's College London, London, UK
- Department of Radiology, Guys and St. Thomas' NHS Foundation Trust, London, UK
| | - Manuel Rodriguez-Justo
- Research Department of Pathology, UCL Cancer Institute, University College London, London, UK
| | - Stuart A Taylor
- Centre for Medical Imaging, Division of Medicine, University College London, London, UK
| | - Richard Beable
- Department of Radiology, Portsmouth Hospitals University NHS Trust, Portsmouth, UK
| | - Margaret Betts
- Department of Radiology, Oxford University Hospitals NHS Foundation Trust, Oxford, UK
| | - David J Breen
- Department of Radiology, University Hospital Southampton NHS Foundation Trust, Southampton, UK
| | - Ingrid Britton
- Department of Radiology, University Hospitals North Midlands NHS Trust, Stoke-On-Trent, UK
| | - John Brush
- Department of Radiology, Western General Hospital, NHS Lothian, Edinburgh, UK
| | - Peter Correa
- Department of Oncology, University Hospitals Coventry and Warwickshire NHS Trust, Coventry, UK
| | - Nicholas Dodds
- Department of Radiology, Jersey General Hospital, St. Helier, Jersey
| | - Joanna Dunlop
- Scottish Clinical Trials Research Unit, Public Health Scotland, Edinburgh, UK
| | - Sofia Gourtsoyianni
- School of Biomedical Engineering & Imaging Sciences, King's College London, London, UK
| | - Nyree Griffin
- Department of Radiology, Guys and St. Thomas' NHS Foundation Trust, London, UK
| | - Antony Higginson
- Department of Radiology, Portsmouth Hospitals University NHS Trust, Portsmouth, UK
| | - Andrew Lowe
- Department of Radiology, Musgrove Park Hospital, Somerset NHS Foundation Trust, Taunton, UK
| | - Andrew Slater
- Department of Radiology, Oxford University Hospitals NHS Foundation Trust, Oxford, UK
| | | | - Damian Tolan
- Department of Radiology, St James's University Hospital, Leeds Teaching Hospitals NHS Trust, Leeds, UK
| | - Ian Zealley
- Department of Radiology, Ninewells Hospital, NHS Tayside, Dundee, UK
| | - Steve Halligan
- Centre for Medical Imaging, Division of Medicine, University College London, London, UK
| |
Collapse
|
4
|
Li Y, Herold T, Mansmann U, Hornung R. Does combining numerous data types in multi-omics data improve or hinder performance in survival prediction? Insights from a large-scale benchmark study. BMC Med Inform Decis Mak 2024; 24:244. [PMID: 39223659 PMCID: PMC11370316 DOI: 10.1186/s12911-024-02642-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2024] [Accepted: 08/21/2024] [Indexed: 09/04/2024] Open
Abstract
BACKGROUND Predictive modeling based on multi-omics data, which incorporates several types of omics data for the same patients, has shown potential to outperform single-omics predictive modeling. Most research in this domain focuses on incorporating numerous data types, despite the complexity and cost of acquiring them. The prevailing assumption is that increasing the number of data types necessarily improves predictive performance. However, the integration of less informative or redundant data types could potentially hinder this performance. Therefore, identifying the most effective combinations of omics data types that enhance predictive performance is critical for cost-effective and accurate predictions. METHODS In this study, we systematically evaluated the predictive performance of all 31 possible combinations including at least one of five genomic data types (mRNA, miRNA, methylation, DNAseq, and copy number variation) using 14 cancer datasets with right-censored survival outcomes, publicly available from the TCGA database. We employed various prediction methods and up-weighted clinical data in every model to leverage their predictive importance. Harrell's C-index and the integrated Brier Score were used as performance measures. To assess the robustness of our findings, we performed a bootstrap analysis at the level of the included datasets. Statistical testing was conducted for key results, limiting the number of tests to ensure a low risk of false positives. RESULTS Contrary to expectations, we found that using only mRNA data or a combination of mRNA and miRNA data was sufficient for most cancer types. For some cancer types, the additional inclusion of methylation data led to improved prediction results. Far from enhancing performance, the introduction of more data types most often resulted in a decline in performance, which varied between the two performance measures. CONCLUSIONS Our findings challenge the prevailing notion that combining multiple omics data types in multi-omics survival prediction improves predictive performance. Thus, the widespread approach in multi-omics prediction of incorporating as many data types as possible should be reconsidered to avoid suboptimal prediction results and unnecessary expenditure.
Collapse
Affiliation(s)
- Yingxia Li
- Institute for Medical Information Processing, Biometry and Epidemiology, LMU Munich, Marchioninistr. 15, 81377, Munich, Germany.
| | - Tobias Herold
- Laboratory for Leukemia Diagnostics, Department of Medicine III, LMU University Hospital, LMU Munich, Munich, Germany
| | - Ulrich Mansmann
- Institute for Medical Information Processing, Biometry and Epidemiology, LMU Munich, Marchioninistr. 15, 81377, Munich, Germany
| | - Roman Hornung
- Institute for Medical Information Processing, Biometry and Epidemiology, LMU Munich, Marchioninistr. 15, 81377, Munich, Germany
- Munich Center for Machine Learning (MCML), Munich, Germany
| |
Collapse
|
5
|
Rahnenführer J, De Bin R, Benner A, Ambrogi F, Lusa L, Boulesteix AL, Migliavacca E, Binder H, Michiels S, Sauerbrei W, McShane L. Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges. BMC Med 2023; 21:182. [PMID: 37189125 DOI: 10.1186/s12916-023-02858-y] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/28/2022] [Accepted: 04/03/2023] [Indexed: 05/17/2023] Open
Abstract
BACKGROUND In high-dimensional data (HDD) settings, the number of variables associated with each observation is very large. Prominent examples of HDD in biomedical research include omics data with a large number of variables such as many measurements across the genome, proteome, or metabolome, as well as electronic health records data that have large numbers of variables recorded for each patient. The statistical analysis of such data requires knowledge and experience, sometimes of complex methods adapted to the respective research questions. METHODS Advances in statistical methodology and machine learning methods offer new opportunities for innovative analyses of HDD, but at the same time require a deeper understanding of some fundamental statistical concepts. Topic group TG9 "High-dimensional data" of the STRATOS (STRengthening Analytical Thinking for Observational Studies) initiative provides guidance for the analysis of observational studies, addressing particular statistical challenges and opportunities for the analysis of studies involving HDD. In this overview, we discuss key aspects of HDD analysis to provide a gentle introduction for non-statisticians and for classically trained statisticians with little experience specific to HDD. RESULTS The paper is organized with respect to subtopics that are most relevant for the analysis of HDD, in particular initial data analysis, exploratory data analysis, multiple testing, and prediction. For each subtopic, main analytical goals in HDD settings are outlined. For each of these goals, basic explanations for some commonly used analysis methods are provided. Situations are identified where traditional statistical methods cannot, or should not, be used in the HDD setting, or where adequate analytic tools are still lacking. Many key references are provided. CONCLUSIONS This review aims to provide a solid statistical foundation for researchers, including statisticians and non-statisticians, who are new to research with HDD or simply want to better evaluate and understand the results of HDD analyses.
Collapse
Affiliation(s)
| | | | - Axel Benner
- Division of Biostatistics, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Federico Ambrogi
- Department of Clinical Sciences and Community Health, University of Milan, Milan, Italy
- Scientific Directorate, IRCCS Policlinico San Donato, San Donato Milanese, Italy
| | - Lara Lusa
- Department of Mathematics, Faculty of Mathematics, Natural Sciences and Information Technology, University of Primorksa, Koper, Slovenia
- Institute of Biostatistics and Medical Informatics, University of Ljubljana, Ljubljana, Slovenia
| | - Anne-Laure Boulesteix
- Institute for Medical Information Processing, Biometry and Epidemiology, Ludwig Maximilian University of Munich, Munich, Germany
| | | | - Harald Binder
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany
| | - Stefan Michiels
- Service de Biostatistique et d'Épidémiologie, Gustave Roussy, Université Paris-Saclay, Villejuif, France
- Oncostat U1018, Inserm, Université Paris-Saclay, Labeled Ligue Contre le Cancer, Villejuif, France
| | - Willi Sauerbrei
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany
| | - Lisa McShane
- Biometric Research Program, Division of Cancer Treatment and Diagnosis, National Cancer Institute, Bethesda, MD, USA.
| |
Collapse
|
6
|
Jardillier R, Koca D, Chatelain F, Guyon L. Optimal microRNA Sequencing Depth to Predict Cancer Patient Survival with Random Forest and Cox Models. Genes (Basel) 2022; 13:2275. [PMID: 36553544 PMCID: PMC9777708 DOI: 10.3390/genes13122275] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Revised: 11/18/2022] [Accepted: 11/23/2022] [Indexed: 12/12/2022] Open
Abstract
(1) Background: tumor profiling enables patient survival prediction. The two essential parameters to be calibrated when designing a study based on tumor profiles from a cohort are the sequencing depth of RNA-seq technology and the number of patients. This calibration is carried out under cost constraints, and a compromise has to be found. In the context of survival data, the goal of this work is to benchmark the impact of the number of patients and of the sequencing depth of miRNA-seq and mRNA-seq on the predictive capabilities for both the Cox model with elastic net penalty and random survival forest. (2) Results: we first show that the Cox model and random survival forest provide comparable prediction capabilities, with significant differences for some cancers. Second, we demonstrate that miRNA and/or mRNA data improve prediction over clinical data alone. mRNA-seq data leads to slightly better prediction than miRNA-seq, with the notable exception of lung adenocarcinoma for which the tumor miRNA profile shows higher predictive power. Third, we demonstrate that the sequencing depth of RNA-seq data can be reduced for most of the investigated cancers without degrading the prediction abilities, allowing the creation of independent validation sets at a lower cost. Finally, we show that the number of patients in the training dataset can be reduced for the Cox model and random survival forest, allowing the use of different models on different patient subgroups.
Collapse
Affiliation(s)
- Rémy Jardillier
- Univ. Grenoble Alpes, CEA, Inserm, IRIG, BioSanté U1292, BCI, 38000 Grenoble, France
- Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-Lab, Institute of Engineering University Grenoble Alpes, 38000 Grenoble, France
| | - Dzenis Koca
- Univ. Grenoble Alpes, CEA, Inserm, IRIG, BioSanté U1292, BCI, 38000 Grenoble, France
| | - Florent Chatelain
- Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-Lab, Institute of Engineering University Grenoble Alpes, 38000 Grenoble, France
| | - Laurent Guyon
- Univ. Grenoble Alpes, CEA, Inserm, IRIG, BioSanté U1292, BCI, 38000 Grenoble, France
| |
Collapse
|
7
|
Jardillier R, Koca D, Chatelain F, Guyon L. Prognosis of lasso-like penalized Cox models with tumor profiling improves prediction over clinical data alone and benefits from bi-dimensional pre-screening. BMC Cancer 2022; 22:1045. [PMID: 36199072 PMCID: PMC9533541 DOI: 10.1186/s12885-022-10117-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2022] [Accepted: 09/14/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Prediction of patient survival from tumor molecular '-omics' data is a key step toward personalized medicine. Cox models performed on RNA profiling datasets are popular for clinical outcome predictions. But these models are applied in the context of "high dimension", as the number p of covariates (gene expressions) greatly exceeds the number n of patients and e of events. Thus, pre-screening together with penalization methods are widely used for dimensional reduction. METHODS In the present paper, (i) we benchmark the performance of the lasso penalization and three variants (i.e., ridge, elastic net, adaptive elastic net) on 16 cancers from TCGA after pre-screening, (ii) we propose a bi-dimensional pre-screening procedure based on both gene variability and p-values from single variable Cox models to predict survival, and (iii) we compare our results with iterative sure independence screening (ISIS). RESULTS First, we show that integration of mRNA-seq data with clinical data improves predictions over clinical data alone. Second, our bi-dimensional pre-screening procedure can only improve, in moderation, the C-index and/or the integrated Brier score, while excluding irrelevant genes for prediction. We demonstrate that the different penalization methods reached comparable prediction performances, with slight differences among datasets. Finally, we provide advice in the case of multi-omics data integration. CONCLUSIONS Tumor profiles convey more prognostic information than clinical variables such as stage for many cancer subtypes. Lasso and Ridge penalizations perform similarly than Elastic Net penalizations for Cox models in high-dimension. Pre-screening of the top 200 genes in term of single variable Cox model p-values is a practical way to reduce dimension, which may be particularly useful when integrating multi-omics.
Collapse
Affiliation(s)
- Rémy Jardillier
- IRIG, Biosanté U1292, Univ. Grenoble Alpes, Inserm, CEA, Grenoble, France
- GIPSA-lab, Institute of Engineering University Grenoble Alpes, Univ. Grenoble Alpes, CNRS, Grenoble INP, Grenoble, France
| | - Dzenis Koca
- IRIG, Biosanté U1292, Univ. Grenoble Alpes, Inserm, CEA, Grenoble, France
| | - Florent Chatelain
- GIPSA-lab, Institute of Engineering University Grenoble Alpes, Univ. Grenoble Alpes, CNRS, Grenoble INP, Grenoble, France
| | - Laurent Guyon
- IRIG, Biosanté U1292, Univ. Grenoble Alpes, Inserm, CEA, Grenoble, France
| |
Collapse
|
8
|
Couckuyt A, Seurinck R, Emmaneel A, Quintelier K, Novak D, Van Gassen S, Saeys Y. Challenges in translational machine learning. Hum Genet 2022; 141:1451-1466. [PMID: 35246744 PMCID: PMC8896412 DOI: 10.1007/s00439-022-02439-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2021] [Accepted: 02/08/2022] [Indexed: 11/25/2022]
Abstract
Machine learning (ML) algorithms are increasingly being used to help implement clinical decision support systems. In this new field, we define as "translational machine learning", joint efforts and strong communication between data scientists and clinicians help to span the gap between ML and its adoption in the clinic. These collaborations also improve interpretability and trust in translational ML methods and ultimately aim to result in generalizable and reproducible models. To help clinicians and bioinformaticians refine their translational ML pipelines, we review the steps from model building to the use of ML in the clinic. We discuss experimental setup, computational analysis, interpretability and reproducibility, and emphasize the challenges involved. We highly advise collaboration and data sharing between consortia and institutes to build multi-centric cohorts that facilitate ML methodologies that generalize across centers. In the end, we hope that this review provides a way to streamline translational ML and helps to tackle the challenges that come with it.
Collapse
Affiliation(s)
- Artuur Couckuyt
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Gent, Belgium
- Data Mining and Modeling for Biomedicine, VIB-UGent Center for Inflammation Research, Gent, Belgium
| | - Ruth Seurinck
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Gent, Belgium
- Data Mining and Modeling for Biomedicine, VIB-UGent Center for Inflammation Research, Gent, Belgium
| | - Annelies Emmaneel
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Gent, Belgium
- Data Mining and Modeling for Biomedicine, VIB-UGent Center for Inflammation Research, Gent, Belgium
| | - Katrien Quintelier
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Gent, Belgium
- Data Mining and Modeling for Biomedicine, VIB-UGent Center for Inflammation Research, Gent, Belgium
- Department of Pulmonary Diseases, Erasmus MC, Rotterdam, The Netherlands
| | - David Novak
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Gent, Belgium
- Data Mining and Modeling for Biomedicine, VIB-UGent Center for Inflammation Research, Gent, Belgium
| | - Sofie Van Gassen
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Gent, Belgium
- Data Mining and Modeling for Biomedicine, VIB-UGent Center for Inflammation Research, Gent, Belgium
| | - Yvan Saeys
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Gent, Belgium.
- Data Mining and Modeling for Biomedicine, VIB-UGent Center for Inflammation Research, Gent, Belgium.
| |
Collapse
|
9
|
Diaz-Uriarte R, Gómez de Lope E, Giugno R, Fröhlich H, Nazarov PV, Nepomuceno-Chamorro IA, Rauschenberger A, Glaab E. Ten quick tips for biomarker discovery and validation analyses using machine learning. PLoS Comput Biol 2022; 18:e1010357. [PMID: 35951526 PMCID: PMC9371329 DOI: 10.1371/journal.pcbi.1010357] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Affiliation(s)
- Ramon Diaz-Uriarte
- Department of Biochemistry, School of Medicine, Universidad Autónoma de Madrid, Instituto de Investigaciones Biomédicas ‘Alberto Sols’ (UAM-CSIC), Madrid, Spain
| | - Elisa Gómez de Lope
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Luxembourg
| | - Rosalba Giugno
- Department of Computer Science, University of Verona, Verona, Italy
| | - Holger Fröhlich
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Sankt Augustin, Germany
- Bonn-Aachen International Centre for IT (b-it), Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn, Germany
| | - Petr V. Nazarov
- Department of Cancer Research, Luxembourg Institute of Health, Strassen, Luxembourg
| | | | - Armin Rauschenberger
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Luxembourg
| | - Enrico Glaab
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Luxembourg
- * E-mail:
| |
Collapse
|
10
|
Halligan S, Menu Y, Mallett S. Why did European Radiology reject my radiomic biomarker paper? How to correctly evaluate imaging biomarkers in a clinical setting. Eur Radiol 2021; 31:9361-9368. [PMID: 34003349 PMCID: PMC8589811 DOI: 10.1007/s00330-021-07971-1] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2021] [Revised: 03/06/2021] [Accepted: 03/31/2021] [Indexed: 12/23/2022]
Abstract
This review explains in simple terms, accessible to the non-statistician, general principles regarding the correct research methods to develop and then evaluate imaging biomarkers in a clinical setting, including radiomic biomarkers. The distinction between diagnostic and prognostic biomarkers is made and emphasis placed on the need to assess clinical utility within the context of a multivariable model. Such models should not be restricted to imaging biomarkers and must include relevant disease and patient characteristics likely to be clinically useful. Biomarker utility is based on whether its addition to the basic clinical model improves diagnosis or prediction. Approaches to both model development and evaluation are explained and the need for adequate amounts of representative data stressed so as to avoid underpowering and overfitting. Advice is provided regarding how to report the research correctly. KEY POINTS: • Imaging biomarker research is common but methodological errors are encountered frequently that may mean the research is not clinically useful. • The clinical utility of imaging biomarkers is best assessed by their additive effect on multivariable models based on clinical factors known to be important. • The data used to develop such models should be sufficient for the number of variables investigated and the model should be evaluated, preferably using data unrelated to development.
Collapse
Affiliation(s)
- Steve Halligan
- Centre for Medical Imaging, University College London UCL, 43-45 Foley Street, London, W1W 7TS, UK.
| | - Yves Menu
- Department of Diagnostic and Interventional Radiology, Saint Antoine Hospital, APHP-Sorbonne University, Paris, France
| | - Sue Mallett
- Centre for Medical Imaging, University College London UCL, 43-45 Foley Street, London, W1W 7TS, UK
| |
Collapse
|
11
|
Wilkinson J, Arnold KF, Murray EJ, van Smeden M, Carr K, Sippy R, de Kamps M, Beam A, Konigorski S, Lippert C, Gilthorpe MS, Tennant PWG. Time to reality check the promises of machine learning-powered precision medicine. Lancet Digit Health 2020; 2:e677-e680. [PMID: 33328030 PMCID: PMC9060421 DOI: 10.1016/s2589-7500(20)30200-4] [Citation(s) in RCA: 116] [Impact Index Per Article: 23.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2020] [Revised: 07/29/2020] [Accepted: 08/07/2020] [Indexed: 12/14/2022]
Abstract
Machine learning methods, combined with large electronic health databases, could enable a personalised approach to medicine through improved diagnosis and prediction of individual responses to therapies. If successful, this strategy would represent a revolution in clinical research and practice. However, although the vision of individually tailored medicine is alluring, there is a need to distinguish genuine potential from hype. We argue that the goal of personalised medical care faces serious challenges, many of which cannot be addressed through algorithmic complexity, and call for collaboration between traditional methodologists and experts in medical machine learning to avoid extensive research waste.
Collapse
Affiliation(s)
- Jack Wilkinson
- Centre for Biostatistics, Manchester Academic Health Science Centre, Division of Population Health, Health Services Research and Primary Care, University of Manchester, Manchester, UK.
| | - Kellyn F Arnold
- Leeds Institute for Data Analytics, University of Leeds, Leeds, UK; Faculty of Medicine and Health, University of Leeds, Leeds, UK
| | - Eleanor J Murray
- Department of Epidemiology, Boston University School of Public Health, Boston, MA, USA
| | - Maarten van Smeden
- Department of Clinical Epidemiology, Leiden University Medical Center, Leiden, Netherlands
| | - Kareem Carr
- Department of Biostatistics, Harvard TH Chan School of Public Health, Boston, MA, USA
| | - Rachel Sippy
- Institute for Global Health and Translational Science, SUNY Upstate Medical University, Syracuse, NY, USA; Department of Geography, University of Florida, Gainesville, FL, USA; Emerging Pathogens Institute, University of Florida, Gainesville, FL, USA
| | - Marc de Kamps
- Leeds Institute for Data Analytics, University of Leeds, Leeds, UK; School of Computing, University of Leeds, Leeds, UK
| | - Andrew Beam
- Department of Epidemiology, Boston University School of Public Health, Boston, MA, USA
| | - Stefan Konigorski
- Digital Health & Machine Learning Research Group, Hasso Plattner Institut for Digital Engineering, Potsdam, Germany; Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Christoph Lippert
- Digital Health & Machine Learning Research Group, Hasso Plattner Institut for Digital Engineering, Potsdam, Germany; Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Mark S Gilthorpe
- Leeds Institute for Data Analytics, University of Leeds, Leeds, UK; Faculty of Medicine and Health, University of Leeds, Leeds, UK; Alan Turing Institute, London, UK
| | - Peter W G Tennant
- Leeds Institute for Data Analytics, University of Leeds, Leeds, UK; Faculty of Medicine and Health, University of Leeds, Leeds, UK; Alan Turing Institute, London, UK
| |
Collapse
|
12
|
Abstract
OBJECTIVES: Barrett's esophagus (BE) is the precursor lesion and a major risk factor for esophageal adenocarcinoma (EAC). Although patients with BE undergo routine endoscopic surveillance, current screening methodologies have proven ineffective at identifying individuals at risk of EAC. Since microRNAs (miRNAs) have potential diagnostic and prognostic value as disease biomarkers, we sought to identify an miRNA signature of BE and EAC. METHODS: High-throughput sequencing of miRNAs was performed on serum and tissue biopsies from 31 patients identified either as normal, gastroesophageal reflux disease (GERD), BE, BE with low-grade dysplasia (LGD), or EAC. Logistic regression modeling of miRNA profiles with Lasso regularization was used to identify discriminating miRNA. Quantitative reverse transcription polymerase chain reaction was used to validate changes in miRNA expression using 46 formalin-fixed, paraffin-embedded specimens obtained from normal, GERD, BE, BE with LGD or HGD, and EAC subjects. RESULTS: A 3-class predictive model was able to classify tissue samples into normal, GERD/BE, or LGD/EAC classes with an accuracy of 80%. Sixteen miRNAs were identified that predicted 1 of the 3 classes. Our analysis confirmed previous reports indicating that miR-29c-3p and miR-193b-5p expressions are altered in BE and EAC and identified miR-4485-5p as a novel biomarker of esophageal dysplasia. Quantitative reverse transcription polymerase chain reaction validated 11 of 16 discriminating miRNAs. DISCUSSION: Our data provide an miRNA signature of normal, precancerous, and cancerous tissue that may stratify patients at risk of progressing to EAC. We found that serum miRNAs have a limited ability to distinguish between disease states, thus limiting their potential utility in early disease detection.
Collapse
|
13
|
Engebretsen S, Glad IK. Partially linear monotone methods with automatic variable selection and monotonicity direction discovery. Stat Med 2020; 39:3549-3568. [PMID: 32851696 DOI: 10.1002/sim.8680] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2020] [Revised: 05/07/2020] [Accepted: 06/10/2020] [Indexed: 11/10/2022]
Abstract
In many statistical regression and prediction problems, it is reasonable to assume monotone relationships between certain predictor variables and the outcome. Genomic effects on phenotypes are, for instance, often assumed to be monotone. However, in some settings, it may be reasonable to assume a partially linear model, where some of the covariates can be assumed to have a linear effect. One example is a prediction model using both high-dimensional gene expression data, and low-dimensional clinical data, or when combining continuous and categorical covariates. We study methods for fitting the partially linear monotone model, where some covariates are assumed to have a linear effect on the response, and some are assumed to have a monotone (potentially nonlinear) effect. Most existing methods in the literature for fitting such models are subject to the limitation that they have to be provided the monotonicity directions a priori for the different monotone effects. We here present methods for fitting partially linear monotone models which perform both automatic variable selection, and monotonicity direction discovery. The proposed methods perform comparably to, or better than, existing methods, in terms of estimation, prediction, and variable selection performance, in simulation experiments in both classical and high-dimensional data settings.
Collapse
Affiliation(s)
| | - Ingrid K Glad
- Department of Mathematics, University of Oslo, Oslo, Norway
| |
Collapse
|