Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Bernau C, Augustin T, Boulesteix AL. Correcting the optimal resampling-based error rate by estimating the error rate of wrapper algorithms. Biometrics 2013;69:693-702. [PMID: 23845182 DOI: 10.1111/biom.12041] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2011] [Revised: 02/01/2013] [Accepted: 02/01/2013] [Indexed: 11/27/2022]

For:	Bernau C, Augustin T, Boulesteix AL. Correcting the optimal resampling-based error rate by estimating the error rate of wrapper algorithms. Biometrics 2013;69:693-702. [PMID: 23845182 DOI: 10.1111/biom.12041] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2011] [Revised: 02/01/2013] [Accepted: 02/01/2013] [Indexed: 11/27/2022]

Number

Cited by Other Article(s)

Tsamardinos I. Don't lose samples to estimation. PATTERNS (NEW YORK, N.Y.) 2022;3:100612. [PMID: 36569551 PMCID: PMC9782254 DOI: 10.1016/j.patter.2022.100612] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]

Herrmann M, Probst P, Hornung R, Jurinovic V, Boulesteix AL. Large-scale benchmark study of survival prediction methods using multi-omics data. Brief Bioinform 2020;22:5895463. [PMID: 32823283 PMCID: PMC8138887 DOI: 10.1093/bib/bbaa167] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2020] [Revised: 06/25/2020] [Accepted: 07/03/2020] [Indexed: 12/18/2022] Open

Abstract

Multi-omics data, that is, datasets containing different types of high-dimensional molecular variables, are increasingly often generated for the investigation of various diseases. Nevertheless, questions remain regarding the usefulness of multi-omics data for the prediction of disease outcomes such as survival time. It is also unclear which methods are most appropriate to derive such prediction models. We aim to give some answers to these questions through a large-scale benchmark study using real data. Different prediction methods from machine learning and statistics were applied on 18 multi-omics cancer datasets (35 to 1000 observations, up to 100 000 variables) from the database 'The Cancer Genome Atlas' (TCGA). The considered outcome was the (censored) survival time. Eleven methods based on boosting, penalized regression and random forest were compared, comprising both methods that do and that do not take the group structure of the omics variables into account. The Kaplan-Meier estimate and a Cox model using only clinical variables were used as reference methods. The methods were compared using several repetitions of 5-fold cross-validation. Uno's C-index and the integrated Brier score served as performance metrics. The results indicate that methods taking into account the multi-omics structure have a slightly better prediction performance. Taking this structure into account can protect the predictive information in low-dimensional groups-especially clinical variables-from not being exploited during prediction. Moreover, only the block forest method outperformed the Cox model on average, and only slightly. This indicates, as a by-product of our study, that in the considered TCGA studies the utility of multi-omics data for prediction purposes was limited. Contact:moritz.herrmann@stat.uni-muenchen.de, +49 89 2180 3198 Supplementary information: Supplementary data are available at Briefings in Bioinformatics online. All analyses are reproducible using R code freely available on Github.

Collapse

Tibshirani RJ, Rosset S. Excess Optimism: How Biased is the Apparent Error of an Estimator Tuned by SURE? J Am Stat Assoc 2019. [DOI: 10.1080/01621459.2018.1429276] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]

On the overestimation of random forest's out-of-bag error. PLoS One 2018;13:e0201904. [PMID: 30080866 PMCID: PMC6078316 DOI: 10.1371/journal.pone.0201904] [Citation(s) in RCA: 71] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2018] [Accepted: 07/24/2018] [Indexed: 11/19/2022] Open

Abstract

The ensemble method random forests has become a popular classification tool in bioinformatics and related fields. The out-of-bag error is an error estimation technique often used to evaluate the accuracy of a random forest and to select appropriate values for tuning parameters, such as the number of candidate predictors that are randomly drawn for a split, referred to as mtry. However, for binary classification problems with metric predictors it has been shown that the out-of-bag error can overestimate the true prediction error depending on the choices of random forests parameters. Based on simulated and real data this paper aims to identify settings for which this overestimation is likely. It is, moreover, questionable whether the out-of-bag error can be used in classification tasks for selecting tuning parameters like mtry, because the overestimation is seen to depend on the parameter mtry. The simulation-based and real-data based studies with metric predictor variables performed in this paper show that the overestimation is largest in balanced settings and in settings with few observations, a large number of predictor variables, small correlations between predictors and weak effects. There was hardly any impact of the overestimation on tuning parameter selection. However, although the prediction performance of random forests was not substantially affected when using the out-of-bag error for tuning parameter selection in the present studies, one cannot be sure that this applies to all future data. For settings with metric predictor variables it is therefore strongly recommended to use stratified subsampling with sampling fractions that are proportional to the class sizes for both tuning parameter selection and error estimation in random forests. This yielded less biased estimates of the true prediction error. In unbalanced settings, in which there is a strong interest in predicting observations from the smaller classes well, sampling the same number of observations from each class is a promising alternative.

Collapse

Tsamardinos I, Greasidou E, Borboudakis G. Bootstrapping the out-of-sample predictions for efficient and accurate cross-validation. Mach Learn 2018;107:1895-1922. [PMID: 30393425 PMCID: PMC6191021 DOI: 10.1007/s10994-018-5714-4] [Citation(s) in RCA: 86] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2017] [Accepted: 04/21/2018] [Indexed: 12/26/2022]

Meisner A, Parikh CR, Kerr KF. Using ordinal outcomes to construct and select biomarker combinations for single-level prediction. Diagn Progn Res 2018;2:8. [PMID: 31093558 PMCID: PMC6460803 DOI: 10.1186/s41512-018-0028-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/27/2017] [Accepted: 04/16/2018] [Indexed: 01/18/2023] Open

Methodological issues in current practice may lead to bias in the development of biomarker combinations for predicting acute kidney injury. Kidney Int 2017;89:429-38. [PMID: 26398494 PMCID: PMC4805513 DOI: 10.1038/ki.2015.283] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2015] [Revised: 07/27/2015] [Accepted: 07/31/2015] [Indexed: 12/22/2022]

Jong VL, Ahout IML, van den Ham HJ, Jans J, Zaaraoui-Boutahar F, Zomer A, Simonetti E, Bijl MA, Brand HK, van IJcken WFJ, de Jonge MI, Fraaij PL, de Groot R, Osterhaus ADME, Eijkemans MJ, Ferwerda G, Andeweg AC. Transcriptome assists prognosis of disease severity in respiratory syncytial virus infected infants. Sci Rep 2016;6:36603. [PMID: 27833115 PMCID: PMC5105123 DOI: 10.1038/srep36603] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2016] [Accepted: 10/17/2016] [Indexed: 12/17/2022] Open

Affiliation(s)

Victor L. Jong Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht, The Netherlands Department of Viroscience, Erasmus Medical Center, Rotterdam, The Netherlands
Inge M. L. Ahout Department of Pediatrics, Laboratory of Pediatric Infectious Diseases, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Nijmegen, The Netherlands
Henk-Jan van den Ham Department of Viroscience, Erasmus Medical Center, Rotterdam, The Netherlands
Jop Jans Department of Pediatrics, Laboratory of Pediatric Infectious Diseases, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Nijmegen, The Netherlands
Fatiha Zaaraoui-Boutahar Department of Viroscience, Erasmus Medical Center, Rotterdam, The Netherlands
Aldert Zomer Department of Pediatrics, Laboratory of Pediatric Infectious Diseases, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Nijmegen, The Netherlands
Elles Simonetti Department of Pediatrics, Laboratory of Pediatric Infectious Diseases, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Nijmegen, The Netherlands
Maarten A. Bijl Department of Viroscience, Erasmus Medical Center, Rotterdam, The Netherlands
H. Kim Brand Department of Pediatrics, Laboratory of Pediatric Infectious Diseases, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Nijmegen, The Netherlands
Wilfred F. J. van IJcken Center for Biomics, Erasmus Medical Center, Rotterdam, The Netherlands
Marien I. de Jonge Department of Pediatrics, Laboratory of Pediatric Infectious Diseases, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Nijmegen, The Netherlands
Pieter L. Fraaij Department of Viroscience, Erasmus Medical Center, Rotterdam, The Netherlands Department of Pediatrics, Erasmus Medical Center, Rotterdam, The Netherlands
Ronald de Groot Department of Pediatrics, Laboratory of Pediatric Infectious Diseases, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Nijmegen, The Netherlands
Albert D. M. E. Osterhaus Department of Viroscience, Erasmus Medical Center, Rotterdam, The Netherlands Research Institute for Infectious Diseases and Zoonoses, Veterinary University Hannover, Germany
Marinus J. Eijkemans Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht, The Netherlands
Gerben Ferwerda Department of Pediatrics, Laboratory of Pediatric Infectious Diseases, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Nijmegen, The Netherlands
Arno C. Andeweg Department of Viroscience, Erasmus Medical Center, Rotterdam, The Netherlands

Collapse

Jong VL, Novianti PW, Roes KCB, Eijkemans MJC. Selecting a classification function for class prediction with gene expression data. Bioinformatics 2016;32:1814-22. [PMID: 26873933 DOI: 10.1093/bioinformatics/btw034] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2015] [Accepted: 01/15/2016] [Indexed: 11/13/2022] Open

Hornung R, Bernau C, Truntzer C, Wilson R, Stadler T, Boulesteix AL. A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization. BMC Med Res Methodol 2015;15:95. [PMID: 26537575 PMCID: PMC4634762 DOI: 10.1186/s12874-015-0088-9] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2015] [Accepted: 10/19/2015] [Indexed: 01/06/2023] Open

Abstract

Background

In applications of supervised statistical learning in the biomedical field it is necessary to assess the prediction error of the respective prediction rules. Often, data preparation steps are performed on the dataset—in its entirety—before training/test set based prediction error estimation by cross-validation (CV)—an approach referred to as “incomplete CV”. Whether incomplete CV can result in an optimistically biased error estimate depends on the data preparation step under consideration. Several empirical studies have investigated the extent of bias induced by performing preliminary supervised variable selection before CV. To our knowledge, however, the potential bias induced by other data preparation steps has not yet been examined in the literature. In this paper we investigate this bias for two common data preparation steps: normalization and principal component analysis for dimension reduction of the covariate space (PCA). Furthermore we obtain preliminary results for the following steps: optimization of tuning parameters, variable filtering by variance and imputation of missing values.

Methods

We devise the easily interpretable and general measure CVIIM (“CV Incompleteness Impact Measure”) to quantify the extent of bias induced by incomplete CV with respect to a data preparation step of interest. This measure can be used to determine whether a specific data preparation step should, as a general rule, be performed in each CV iteration or whether an incomplete CV procedure would be acceptable in practice. We apply CVIIM to large collections of microarray datasets to answer this question for normalization and PCA.

Results

Performing normalization on the entire dataset before CV did not result in a noteworthy optimistic bias in any of the investigated cases. In contrast, when performing PCA before CV, medium to strong underestimates of the prediction error were observed in multiple settings.

Conclusions

While the investigated forms of normalization can be safely performed before CV, PCA has to be performed anew in each CV split to protect against optimistic bias.

Electronic supplementary material

The online version of this article (doi:10.1186/s12874-015-0088-9) contains supplementary material, which is available to authorized users.

Collapse

Safonov I, Gartseev I, Pikhletsky M, Tishutin O, Bailey MJA. An approach for model assissment for activity recognition. PATTERN RECOGNITION AND IMAGE ANALYSIS 2015. [DOI: 10.1134/s1054661815020224] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]

Jung SH, Chen Y, Ahn H. Type I error control for tree classification. Cancer Inform 2014;13:11-8. [PMID: 25452689 PMCID: PMC4237155 DOI: 10.4137/cin.s16342] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2014] [Revised: 10/05/2014] [Accepted: 10/08/2014] [Indexed: 11/18/2022] Open

Ding Y, Tang S, Liao SG, Jia J, Oesterreich S, Lin Y, Tseng GC. Bias correction for selecting the minimal-error classifier from many machine learning models. ACTA ACUST UNITED AC 2014;30:3152-8. [PMID: 25086004 DOI: 10.1093/bioinformatics/btu520] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]

Abstract

MOTIVATION

Supervised machine learning is commonly applied in genomic research to construct a classifier from the training data that is generalizable to predict independent testing data. When test datasets are not available, cross-validation is commonly used to estimate the error rate. Many machine learning methods are available, and it is well known that no universally best method exists in general. It has been a common practice to apply many machine learning methods and report the method that produces the smallest cross-validation error rate. Theoretically, such a procedure produces a selection bias. Consequently, many clinical studies with moderate sample sizes (e.g. n = 30-60) risk reporting a falsely small cross-validation error rate that could not be validated later in independent cohorts.

RESULTS

In this article, we illustrated the probabilistic framework of the problem and explored the statistical and asymptotic properties. We proposed a new bias correction method based on learning curve fitting by inverse power law (IPL) and compared it with three existing methods: nested cross-validation, weighted mean correction and Tibshirani-Tibshirani procedure. All methods were compared in simulation datasets, five moderate size real datasets and two large breast cancer datasets. The result showed that IPL outperforms the other methods in bias correction with smaller variance, and it has an additional advantage to extrapolate error estimates for larger sample sizes, a practical feature to recommend whether more samples should be recruited to improve the classifier and accuracy. An R package 'MLbias' and all source files are publicly available.

AVAILABILITY AND IMPLEMENTATION

tsenglab.biostat.pitt.edu/software.htm.

CONTACT

ctseng@pitt.edu

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Collapse

Affiliation(s)

Ying Ding Joint Carnegie Mellon University-University of Pittsburgh Ph.D. Program in Computational Biology, Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA 15261, USA and Magee-Womens Research Institute, Pittsburgh, PA 15213, USA Joint Carnegie Mellon University-University of Pittsburgh Ph.D. Program in Computational Biology, Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA 15261, USA and Magee-Womens Research Institute, Pittsburgh, PA 15213, USA
Shaowu Tang Joint Carnegie Mellon University-University of Pittsburgh Ph.D. Program in Computational Biology, Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA 15261, USA and Magee-Womens Research Institute, Pittsburgh, PA 15213, USA
Serena G Liao Joint Carnegie Mellon University-University of Pittsburgh Ph.D. Program in Computational Biology, Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA 15261, USA and Magee-Womens Research Institute, Pittsburgh, PA 15213, USA
Jia Jia Joint Carnegie Mellon University-University of Pittsburgh Ph.D. Program in Computational Biology, Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA 15261, USA and Magee-Womens Research Institute, Pittsburgh, PA 15213, USA
Steffi Oesterreich Joint Carnegie Mellon University-University of Pittsburgh Ph.D. Program in Computational Biology, Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA 15261, USA and Magee-Womens Research Institute, Pittsburgh, PA 15213, USA
Yan Lin Joint Carnegie Mellon University-University of Pittsburgh Ph.D. Program in Computational Biology, Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA 15261, USA and Magee-Womens Research Institute, Pittsburgh, PA 15213, USA
George C Tseng Joint Carnegie Mellon University-University of Pittsburgh Ph.D. Program in Computational Biology, Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA 15261, USA and Magee-Womens Research Institute, Pittsburgh, PA 15213, USA Joint Carnegie Mellon University-University of Pittsburgh Ph.D. Program in Computational Biology, Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA 15261, USA and Magee-Womens Research Institute, Pittsburgh, PA 15213, USA

Collapse

Krstajic D, Buturovic LJ, Leahy DE, Thomas S. Cross-validation pitfalls when selecting and assessing regression and classification models. J Cheminform 2014;6:10. [PMID: 24678909 PMCID: PMC3994246 DOI: 10.1186/1758-2946-6-10] [Citation(s) in RCA: 353] [Impact Index Per Article: 35.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2014] [Accepted: 03/25/2014] [Indexed: 11/10/2022] Open

Boulesteix AL, Schmid M. Machine learning versus statistical modeling. Biom J 2014;56:588-93. [DOI: 10.1002/bimj.201300226] [Citation(s) in RCA: 42] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2013] [Revised: 11/04/2013] [Accepted: 11/06/2013] [Indexed: 12/19/2022]