1
|
Erboz A, Kesekler E, Gentili PL, Uversky VN, Coskuner-Weber O. Electromagnetic radiation and biophoton emission in neuronal communication and neurodegenerative diseases. PROGRESS IN BIOPHYSICS AND MOLECULAR BIOLOGY 2025; 195:87-99. [PMID: 39732343 DOI: 10.1016/j.pbiomolbio.2024.12.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/15/2024] [Revised: 12/08/2024] [Accepted: 12/24/2024] [Indexed: 12/30/2024]
Abstract
The intersection of electromagnetic radiation and neuronal communication, focusing on the potential role of biophoton emission in brain function and neurodegenerative diseases is an emerging research area. Traditionally, it is believed that neurons encode and communicate information via electrochemical impulses, generating electromagnetic fields detectable by EEG and MEG. Recent discoveries indicate that neurons may also emit biophotons, suggesting an additional communication channel alongside the regular synaptic interactions. This dual signaling system is analyzed for its potential in synchronizing neuronal activity and improving information transfer, with implications for brain-like computing systems. The clinical relevance is explored through the lens of neurodegenerative diseases and intrinsically disordered proteins, where oxidative stress may alter biophoton emission, offering clues for pathological conditions, such as Alzheimer's and Parkinson's diseases. The potential therapeutic use of Low-Level Laser Therapy (LLLT) is also examined for its ability to modulate biophoton activity and mitigate oxidative stress, presenting new opportunities for treatment. Here, we invite further exploration into the intricate roles the electromagnetic phenomena play in brain function, potentially leading to breakthroughs in computational neuroscience and medical therapies for neurodegenerative diseases.
Collapse
Affiliation(s)
- Aysin Erboz
- Molecular Biotechnology, Turkish-German University, Sahinkaya Caddesi No. 106, Beykoz, Istanbul, 34820, Turkey
| | - Elif Kesekler
- Molecular Biotechnology, Turkish-German University, Sahinkaya Caddesi No. 106, Beykoz, Istanbul, 34820, Turkey
| | - Pier Luigi Gentili
- Department of Chemistry, Biology, and Biotechnology, Università degli Studi di Perugia, 06123, Perugia, Italy.
| | - Vladimir N Uversky
- Department of Molecular Medicine and USF Health Byrd Alzheimer's Institute, Morsani College of Medicine, University of South Florida, 12901 Bruce B. Downs Blvd., MDC07, Tampa, FL 33612, USA.
| | - Orkid Coskuner-Weber
- Molecular Biotechnology, Turkish-German University, Sahinkaya Caddesi No. 106, Beykoz, Istanbul, 34820, Turkey.
| |
Collapse
|
2
|
Schreck N, Slynko A, Saadati M, Benner A. Statistical plasmode simulations-Potentials, challenges and recommendations. Stat Med 2024; 43:1804-1825. [PMID: 38356231 DOI: 10.1002/sim.10012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2023] [Revised: 12/18/2023] [Accepted: 01/02/2024] [Indexed: 02/16/2024]
Abstract
Statistical data simulation is essential in the development of statistical models and methods as well as in their performance evaluation. To capture complex data structures, in particular for high-dimensional data, a variety of simulation approaches have been introduced including parametric and the so-called plasmode simulations. While there are concerns about the realism of parametrically simulated data, it is widely claimed that plasmodes come very close to reality with some aspects of the "truth" known. However, there are no explicit guidelines or state-of-the-art on how to perform plasmode data simulations. In the present paper, we first review existing literature and introduce the concept of statistical plasmode simulation. We then discuss advantages and challenges of statistical plasmodes and provide a step-wise procedure for their generation, including key steps to their implementation and reporting. Finally, we illustrate the concept of statistical plasmodes as well as the proposed plasmode generation procedure by means of a public real RNA data set on breast carcinoma patients.
Collapse
Affiliation(s)
- Nicholas Schreck
- Division of Biostatistics, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Alla Slynko
- Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario, Canada
| | - Maral Saadati
- Division of Biostatistics, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Axel Benner
- Division of Biostatistics, German Cancer Research Center (DKFZ), Heidelberg, Germany
| |
Collapse
|
3
|
Rahnenführer J, De Bin R, Benner A, Ambrogi F, Lusa L, Boulesteix AL, Migliavacca E, Binder H, Michiels S, Sauerbrei W, McShane L. Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges. BMC Med 2023; 21:182. [PMID: 37189125 DOI: 10.1186/s12916-023-02858-y] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/28/2022] [Accepted: 04/03/2023] [Indexed: 05/17/2023] Open
Abstract
BACKGROUND In high-dimensional data (HDD) settings, the number of variables associated with each observation is very large. Prominent examples of HDD in biomedical research include omics data with a large number of variables such as many measurements across the genome, proteome, or metabolome, as well as electronic health records data that have large numbers of variables recorded for each patient. The statistical analysis of such data requires knowledge and experience, sometimes of complex methods adapted to the respective research questions. METHODS Advances in statistical methodology and machine learning methods offer new opportunities for innovative analyses of HDD, but at the same time require a deeper understanding of some fundamental statistical concepts. Topic group TG9 "High-dimensional data" of the STRATOS (STRengthening Analytical Thinking for Observational Studies) initiative provides guidance for the analysis of observational studies, addressing particular statistical challenges and opportunities for the analysis of studies involving HDD. In this overview, we discuss key aspects of HDD analysis to provide a gentle introduction for non-statisticians and for classically trained statisticians with little experience specific to HDD. RESULTS The paper is organized with respect to subtopics that are most relevant for the analysis of HDD, in particular initial data analysis, exploratory data analysis, multiple testing, and prediction. For each subtopic, main analytical goals in HDD settings are outlined. For each of these goals, basic explanations for some commonly used analysis methods are provided. Situations are identified where traditional statistical methods cannot, or should not, be used in the HDD setting, or where adequate analytic tools are still lacking. Many key references are provided. CONCLUSIONS This review aims to provide a solid statistical foundation for researchers, including statisticians and non-statisticians, who are new to research with HDD or simply want to better evaluate and understand the results of HDD analyses.
Collapse
Affiliation(s)
| | | | - Axel Benner
- Division of Biostatistics, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Federico Ambrogi
- Department of Clinical Sciences and Community Health, University of Milan, Milan, Italy
- Scientific Directorate, IRCCS Policlinico San Donato, San Donato Milanese, Italy
| | - Lara Lusa
- Department of Mathematics, Faculty of Mathematics, Natural Sciences and Information Technology, University of Primorksa, Koper, Slovenia
- Institute of Biostatistics and Medical Informatics, University of Ljubljana, Ljubljana, Slovenia
| | - Anne-Laure Boulesteix
- Institute for Medical Information Processing, Biometry and Epidemiology, Ludwig Maximilian University of Munich, Munich, Germany
| | | | - Harald Binder
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany
| | - Stefan Michiels
- Service de Biostatistique et d'Épidémiologie, Gustave Roussy, Université Paris-Saclay, Villejuif, France
- Oncostat U1018, Inserm, Université Paris-Saclay, Labeled Ligue Contre le Cancer, Villejuif, France
| | - Willi Sauerbrei
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany
| | - Lisa McShane
- Biometric Research Program, Division of Cancer Treatment and Diagnosis, National Cancer Institute, Bethesda, MD, USA.
| |
Collapse
|
4
|
Jardillier R, Koca D, Chatelain F, Guyon L. Prognosis of lasso-like penalized Cox models with tumor profiling improves prediction over clinical data alone and benefits from bi-dimensional pre-screening. BMC Cancer 2022; 22:1045. [PMID: 36199072 PMCID: PMC9533541 DOI: 10.1186/s12885-022-10117-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2022] [Accepted: 09/14/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Prediction of patient survival from tumor molecular '-omics' data is a key step toward personalized medicine. Cox models performed on RNA profiling datasets are popular for clinical outcome predictions. But these models are applied in the context of "high dimension", as the number p of covariates (gene expressions) greatly exceeds the number n of patients and e of events. Thus, pre-screening together with penalization methods are widely used for dimensional reduction. METHODS In the present paper, (i) we benchmark the performance of the lasso penalization and three variants (i.e., ridge, elastic net, adaptive elastic net) on 16 cancers from TCGA after pre-screening, (ii) we propose a bi-dimensional pre-screening procedure based on both gene variability and p-values from single variable Cox models to predict survival, and (iii) we compare our results with iterative sure independence screening (ISIS). RESULTS First, we show that integration of mRNA-seq data with clinical data improves predictions over clinical data alone. Second, our bi-dimensional pre-screening procedure can only improve, in moderation, the C-index and/or the integrated Brier score, while excluding irrelevant genes for prediction. We demonstrate that the different penalization methods reached comparable prediction performances, with slight differences among datasets. Finally, we provide advice in the case of multi-omics data integration. CONCLUSIONS Tumor profiles convey more prognostic information than clinical variables such as stage for many cancer subtypes. Lasso and Ridge penalizations perform similarly than Elastic Net penalizations for Cox models in high-dimension. Pre-screening of the top 200 genes in term of single variable Cox model p-values is a practical way to reduce dimension, which may be particularly useful when integrating multi-omics.
Collapse
Affiliation(s)
- Rémy Jardillier
- IRIG, Biosanté U1292, Univ. Grenoble Alpes, Inserm, CEA, Grenoble, France
- GIPSA-lab, Institute of Engineering University Grenoble Alpes, Univ. Grenoble Alpes, CNRS, Grenoble INP, Grenoble, France
| | - Dzenis Koca
- IRIG, Biosanté U1292, Univ. Grenoble Alpes, Inserm, CEA, Grenoble, France
| | - Florent Chatelain
- GIPSA-lab, Institute of Engineering University Grenoble Alpes, Univ. Grenoble Alpes, CNRS, Grenoble INP, Grenoble, France
| | - Laurent Guyon
- IRIG, Biosanté U1292, Univ. Grenoble Alpes, Inserm, CEA, Grenoble, France
| |
Collapse
|
5
|
Diaz-Uriarte R, Gómez de Lope E, Giugno R, Fröhlich H, Nazarov PV, Nepomuceno-Chamorro IA, Rauschenberger A, Glaab E. Ten quick tips for biomarker discovery and validation analyses using machine learning. PLoS Comput Biol 2022; 18:e1010357. [PMID: 35951526 PMCID: PMC9371329 DOI: 10.1371/journal.pcbi.1010357] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Affiliation(s)
- Ramon Diaz-Uriarte
- Department of Biochemistry, School of Medicine, Universidad Autónoma de Madrid, Instituto de Investigaciones Biomédicas ‘Alberto Sols’ (UAM-CSIC), Madrid, Spain
| | - Elisa Gómez de Lope
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Luxembourg
| | - Rosalba Giugno
- Department of Computer Science, University of Verona, Verona, Italy
| | - Holger Fröhlich
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Sankt Augustin, Germany
- Bonn-Aachen International Centre for IT (b-it), Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn, Germany
| | - Petr V. Nazarov
- Department of Cancer Research, Luxembourg Institute of Health, Strassen, Luxembourg
| | | | - Armin Rauschenberger
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Luxembourg
| | - Enrico Glaab
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, Luxembourg
- * E-mail:
| |
Collapse
|
6
|
Sauerbrei W, Royston P. Investigating treatment-effect modification by a continuous covariate in IPD meta-analysis: an approach using fractional polynomials. BMC Med Res Methodol 2022; 22:98. [PMID: 35382744 PMCID: PMC8985287 DOI: 10.1186/s12874-022-01516-w] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2021] [Accepted: 01/17/2022] [Indexed: 11/26/2022] Open
Abstract
BACKGROUND In clinical trials, there is considerable interest in investigating whether a treatment effect is similar in all patients, or that one or more prognostic variables indicate a differential response to treatment. To examine this, a continuous predictor is usually categorised into groups according to one or more cutpoints. Several weaknesses of categorization are well known. To avoid the disadvantages of cutpoints and to retain full information, it is preferable to keep continuous variables continuous in the analysis. To handle this issue, the Subpopulation Treatment Effect Pattern Plot (STEPP) was proposed about two decades ago, followed by the multivariable fractional polynomial interaction (MFPI) approach. Provided individual patient data (IPD) from several studies are available, it is possible to investigate for treatment heterogeneity with meta-analysis techniques. Meta-STEPP was recently proposed and in patients with primary breast cancer an interaction of estrogen receptors with chemotherapy was investigated in eight randomized controlled trials (RCTs). METHODS We use data from eight randomized controlled trials in breast cancer to illustrate issues from two main tasks. The first task is to derive a treatment effect function (TEF), that is, a measure of the treatment effect on the continuous scale of the covariate in the individual studies. The second is to conduct a meta-analysis of the continuous TEFs from the eight studies by applying pointwise averaging to obtain a mean function. We denote the method metaTEF. To improve reporting of available data and all steps of the analysis we introduce a three-part profile called MethProf-MA. RESULTS Although there are considerable differences between the studies (populations with large differences in prognosis, sample size, effective sample size, length of follow up, proportion of patients with very low estrogen receptor values) our results provide clear evidence of an interaction, irrespective of the choice of the FP function and random or fixed effect models. CONCLUSIONS In contrast to cutpoint-based analyses, metaTEF retains the full information from continuous covariates and avoids several critical issues when performing IPD meta-analyses of continuous effect modifiers in randomised trials. Early experience suggests it is a promising approach. TRIAL REGISTRATION Not applicable.
Collapse
Affiliation(s)
- Willi Sauerbrei
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Freiburg, Germany.
| | - Patrick Royston
- MRC Clinical Trials Unit at UCL, Institute of Clinical Trials and Methodology, University College London, London, UK
| |
Collapse
|
7
|
Tarazona S, Arzalluz-Luque A, Conesa A. Undisclosed, unmet and neglected challenges in multi-omics studies. NATURE COMPUTATIONAL SCIENCE 2021; 1:395-402. [PMID: 38217236 DOI: 10.1038/s43588-021-00086-z] [Citation(s) in RCA: 77] [Impact Index Per Article: 19.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/18/2021] [Accepted: 05/17/2021] [Indexed: 01/15/2024]
Abstract
Multi-omics approaches have become a reality in both large genomics projects and small laboratories. However, the multi-omics research community still faces a number of issues that have either not been sufficiently discussed or for which current solutions are still limited. In this Perspective, we elaborate on these limitations and suggest points of attention for future research. We finally discuss new opportunities and challenges brought to the field by the rapid development of single-cell high-throughput molecular technologies.
Collapse
Affiliation(s)
- Sonia Tarazona
- Department of Applied Statistics, Operations Research and Quality, Universitat Politècnica de València, Valencia, Spain
| | - Angeles Arzalluz-Luque
- Department of Applied Statistics, Operations Research and Quality, Universitat Politècnica de València, Valencia, Spain
| | - Ana Conesa
- Microbiology and Cell Science Department, Institute for Food and Agricultural Research, University of Florida, Gainesville, FL, USA.
- Genetics Institute, University of Florida, Gainesville, FL, USA.
- Institute for Integrative Systems Biology, Spanish National Research Council, Valencia, Spain.
| |
Collapse
|
8
|
Boulesteix AL, Groenwold RH, Abrahamowicz M, Binder H, Briel M, Hornung R, Morris TP, Rahnenführer J, Sauerbrei W. Introduction to statistical simulations in health research. BMJ Open 2020; 10:e039921. [PMID: 33318113 PMCID: PMC7737058 DOI: 10.1136/bmjopen-2020-039921] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
In health research, statistical methods are frequently used to address a wide variety of research questions. For almost every analytical challenge, different methods are available. But how do we choose between different methods and how do we judge whether the chosen method is appropriate for our specific study? Like in any science, in statistics, experiments can be run to find out which methods should be used under which circumstances. The main objective of this paper is to demonstrate that simulation studies, that is, experiments investigating synthetic data with known properties, are an invaluable tool for addressing these questions. We aim to provide a first introduction to simulation studies for data analysts or, more generally, for researchers involved at different levels in the analyses of health data, who (1) may rely on simulation studies published in statistical literature to choose their statistical methods and who, thus, need to understand the criteria of assessing the validity and relevance of simulation results and their interpretation; and/or (2) need to understand the basic principles of designing statistical simulations in order to efficiently collaborate with more experienced colleagues or start learning to conduct their own simulations. We illustrate the implementation of a simulation study and the interpretation of its results through a simple example inspired by recent literature, which is completely reproducible using the R-script available from online supplemental file 1.
Collapse
Affiliation(s)
- Anne-Laure Boulesteix
- Institute for Medical Information Processing, Biometry and Epidemiology, Ludwig Maximilian University of Munich, Munich, Germany
| | - Rolf Hh Groenwold
- Department of Clinical Epidemiology, Leiden University Medical Centre, Leiden, The Netherlands
- Department of Biomedical Data Science, Leiden University Medical Centre, Leiden, The Netherlands
| | - Michal Abrahamowicz
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, Quebec, Canada
| | - Harald Binder
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg im Breisgau, Germany
| | - Matthias Briel
- Department of Clinical Research, Institute for Clinical Epidemiology and Biostatistics, University Hospital Basel and University of Basel, Basel, Switzerland
- Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada
| | - Roman Hornung
- Institute for Medical Information Processing, Biometry and Epidemiology, Ludwig Maximilian University of Munich, Munich, Germany
| | | | - Jörg Rahnenführer
- Department of Statistics, TU Dortmund University, Dortmund, Nordrhein-Westfalen, Germany
| | - Willi Sauerbrei
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg im Breisgau, Germany
| |
Collapse
|
9
|
Herrmann M, Probst P, Hornung R, Jurinovic V, Boulesteix AL. Large-scale benchmark study of survival prediction methods using multi-omics data. Brief Bioinform 2020; 22:5895463. [PMID: 32823283 PMCID: PMC8138887 DOI: 10.1093/bib/bbaa167] [Citation(s) in RCA: 46] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2020] [Revised: 06/25/2020] [Accepted: 07/03/2020] [Indexed: 12/18/2022] Open
Abstract
Multi-omics data, that is, datasets containing different types of high-dimensional molecular variables, are increasingly often generated for the investigation of various diseases. Nevertheless, questions remain regarding the usefulness of multi-omics data for the prediction of disease outcomes such as survival time. It is also unclear which methods are most appropriate to derive such prediction models. We aim to give some answers to these questions through a large-scale benchmark study using real data. Different prediction methods from machine learning and statistics were applied on 18 multi-omics cancer datasets (35 to 1000 observations, up to 100 000 variables) from the database 'The Cancer Genome Atlas' (TCGA). The considered outcome was the (censored) survival time. Eleven methods based on boosting, penalized regression and random forest were compared, comprising both methods that do and that do not take the group structure of the omics variables into account. The Kaplan-Meier estimate and a Cox model using only clinical variables were used as reference methods. The methods were compared using several repetitions of 5-fold cross-validation. Uno's C-index and the integrated Brier score served as performance metrics. The results indicate that methods taking into account the multi-omics structure have a slightly better prediction performance. Taking this structure into account can protect the predictive information in low-dimensional groups-especially clinical variables-from not being exploited during prediction. Moreover, only the block forest method outperformed the Cox model on average, and only slightly. This indicates, as a by-product of our study, that in the considered TCGA studies the utility of multi-omics data for prediction purposes was limited. Contact:moritz.herrmann@stat.uni-muenchen.de, +49 89 2180 3198 Supplementary information: Supplementary data are available at Briefings in Bioinformatics online. All analyses are reproducible using R code freely available on Github.
Collapse
Affiliation(s)
- Moritz Herrmann
- Department of Statistics, Ludwig Maximilian University, Munich, 80539, Germany
| | - Philipp Probst
- Institute for Medical Information Processing, Biometry, and Epidemiology, Ludwig Maximilian University, Munich, 81377, Germany
| | - Roman Hornung
- Institute for Medical Information Processing, Biometry, and Epidemiology, Ludwig Maximilian University, Munich, 81377, Germany
| | - Vindi Jurinovic
- Institute for Medical Information Processing, Biometry, and Epidemiology, Ludwig Maximilian University, Munich, 81377, Germany
| | - Anne-Laure Boulesteix
- Institute for Medical Information Processing, Biometry, and Epidemiology, Ludwig Maximilian University, Munich, 81377, Germany
| |
Collapse
|
10
|
Samaga D, Hornung R, Braselmann H, Hess J, Zitzelsberger H, Belka C, Boulesteix AL, Unger K. Single-center versus multi-center data sets for molecular prognostic modeling: a simulation study. Radiat Oncol 2020; 15:109. [PMID: 32410693 PMCID: PMC7227093 DOI: 10.1186/s13014-020-01543-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2019] [Accepted: 04/22/2020] [Indexed: 02/07/2023] Open
Abstract
Background Prognostic models based on high-dimensional omics data generated from clinical patient samples, such as tumor tissues or biopsies, are increasingly used for prognosis of radio-therapeutic success. The model development process requires two independent discovery and validation data sets. Each of them may contain samples collected in a single center or a collection of samples from multiple centers. Multi-center data tend to be more heterogeneous than single-center data but are less affected by potential site-specific biases. Optimal use of limited data resources for discovery and validation with respect to the expected success of a study requires dispassionate, objective decision-making. In this work, we addressed the impact of the choice of single-center and multi-center data as discovery and validation data sets, and assessed how this impact depends on the three data characteristics signal strength, number of informative features and sample size. Methods We set up a simulation study to quantify the predictive performance of a model trained and validated on different combinations of in silico single-center and multi-center data. The standard bioinformatical analysis workflow of batch correction, feature selection and parameter estimation was emulated. For the determination of model quality, four measures were used: false discovery rate, prediction error, chance of successful validation (significant correlation of predicted and true validation data outcome) and model calibration. Results In agreement with literature about generalizability of signatures, prognostic models fitted to multi-center data consistently outperformed their single-center counterparts when the prediction error was the quality criterion of interest. However, for low signal strengths and small sample sizes, single-center discovery sets showed superior performance with respect to false discovery rate and chance of successful validation. Conclusions With regard to decision making, this simulation study underlines the importance of study aims being defined precisely a priori. Minimization of the prediction error requires multi-center discovery data, whereas single-center data are preferable with respect to false discovery rate and chance of successful validation when the expected signal or sample size is low. In contrast, the choice of validation data solely affects the quality of the estimator of the prediction error, which was more precise on multi-center validation data.
Collapse
Affiliation(s)
- Daniel Samaga
- Helmholtz Zentrum, München, Ingolstädter Landstr. 1, Neuherberg, 85764, Germany.
| | - Roman Hornung
- Department of Medical Information Processing, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, Munich, 81377, Germany
| | - Herbert Braselmann
- Helmholtz Zentrum, München, Ingolstädter Landstr. 1, Neuherberg, 85764, Germany
| | - Julia Hess
- Helmholtz Zentrum, München, Ingolstädter Landstr. 1, Neuherberg, 85764, Germany.,Clinical Cooperation Group Personalized Radiotherapy in Head and Neck Cancer, Helmholtz Zentrum München, Research Center for Environmental Health (GmbH), Munich, Ingolstädter Landstr. 1, Munich, 85764, Germany.,Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, Munich, 81377, Germany
| | - Horst Zitzelsberger
- Helmholtz Zentrum, München, Ingolstädter Landstr. 1, Neuherberg, 85764, Germany.,Clinical Cooperation Group Personalized Radiotherapy in Head and Neck Cancer, Helmholtz Zentrum München, Research Center for Environmental Health (GmbH), Munich, Ingolstädter Landstr. 1, Munich, 85764, Germany.,Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, Munich, 81377, Germany
| | - Claus Belka
- Clinical Cooperation Group Personalized Radiotherapy in Head and Neck Cancer, Helmholtz Zentrum München, Research Center for Environmental Health (GmbH), Munich, Ingolstädter Landstr. 1, Munich, 85764, Germany.,Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, Munich, 81377, Germany
| | - Anne-Laure Boulesteix
- Department of Medical Information Processing, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, Munich, 81377, Germany
| | - Kristian Unger
- Helmholtz Zentrum, München, Ingolstädter Landstr. 1, Neuherberg, 85764, Germany.,Clinical Cooperation Group Personalized Radiotherapy in Head and Neck Cancer, Helmholtz Zentrum München, Research Center for Environmental Health (GmbH), Munich, Ingolstädter Landstr. 1, Munich, 85764, Germany.,Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, Munich, 81377, Germany
| |
Collapse
|