1
|
Tsamardinos I. Don't lose samples to estimation. PATTERNS (NEW YORK, N.Y.) 2022; 3:100612. [PMID: 36569551 PMCID: PMC9782254 DOI: 10.1016/j.patter.2022.100612] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
In a typical predictive modeling task, we are asked to produce a final predictive model to employ operationally for predictions, as well as an estimate of its out-of-sample predictive performance. Typically, analysts hold out a portion of the available data, called a Test set, to estimate the model predictive performance on unseen (out-of-sample) records, thus "losing these samples to estimation." However, this practice is unacceptable when the total sample size is low. To avoid losing data to estimation, we need a shift in our perspective: we do not estimate the performance of a specific model instance; we estimate the performance of the pipeline that produces the model. This pipeline is applied on all available samples to produce the final model; no samples are lost to estimation. An estimate of its performance is provided by training the same pipeline on subsets of the samples. When multiple pipelines are tried, additional considerations that correct for the "winner's curse" need to be in place.
Collapse
Affiliation(s)
- Ioannis Tsamardinos
- Computer Science Department, University of Crete, Heraklion, Greece,JADBio – Gnosis DA S.A, Heraklion, Greece,Institute of Applied and Computational Mathematics, Foundation for Research and Technology, Hellas, Heraklion, Greece,Corresponding author
| |
Collapse
|
2
|
Herrmann M, Probst P, Hornung R, Jurinovic V, Boulesteix AL. Large-scale benchmark study of survival prediction methods using multi-omics data. Brief Bioinform 2020; 22:5895463. [PMID: 32823283 PMCID: PMC8138887 DOI: 10.1093/bib/bbaa167] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2020] [Revised: 06/25/2020] [Accepted: 07/03/2020] [Indexed: 12/18/2022] Open
Abstract
Multi-omics data, that is, datasets containing different types of high-dimensional molecular variables, are increasingly often generated for the investigation of various diseases. Nevertheless, questions remain regarding the usefulness of multi-omics data for the prediction of disease outcomes such as survival time. It is also unclear which methods are most appropriate to derive such prediction models. We aim to give some answers to these questions through a large-scale benchmark study using real data. Different prediction methods from machine learning and statistics were applied on 18 multi-omics cancer datasets (35 to 1000 observations, up to 100 000 variables) from the database 'The Cancer Genome Atlas' (TCGA). The considered outcome was the (censored) survival time. Eleven methods based on boosting, penalized regression and random forest were compared, comprising both methods that do and that do not take the group structure of the omics variables into account. The Kaplan-Meier estimate and a Cox model using only clinical variables were used as reference methods. The methods were compared using several repetitions of 5-fold cross-validation. Uno's C-index and the integrated Brier score served as performance metrics. The results indicate that methods taking into account the multi-omics structure have a slightly better prediction performance. Taking this structure into account can protect the predictive information in low-dimensional groups-especially clinical variables-from not being exploited during prediction. Moreover, only the block forest method outperformed the Cox model on average, and only slightly. This indicates, as a by-product of our study, that in the considered TCGA studies the utility of multi-omics data for prediction purposes was limited. Contact:moritz.herrmann@stat.uni-muenchen.de, +49 89 2180 3198 Supplementary information: Supplementary data are available at Briefings in Bioinformatics online. All analyses are reproducible using R code freely available on Github.
Collapse
Affiliation(s)
- Moritz Herrmann
- Department of Statistics, Ludwig Maximilian University, Munich, 80539, Germany
| | - Philipp Probst
- Institute for Medical Information Processing, Biometry, and Epidemiology, Ludwig Maximilian University, Munich, 81377, Germany
| | - Roman Hornung
- Institute for Medical Information Processing, Biometry, and Epidemiology, Ludwig Maximilian University, Munich, 81377, Germany
| | - Vindi Jurinovic
- Institute for Medical Information Processing, Biometry, and Epidemiology, Ludwig Maximilian University, Munich, 81377, Germany
| | - Anne-Laure Boulesteix
- Institute for Medical Information Processing, Biometry, and Epidemiology, Ludwig Maximilian University, Munich, 81377, Germany
| |
Collapse
|
3
|
Tibshirani RJ, Rosset S. Excess Optimism: How Biased is the Apparent Error of an Estimator Tuned by SURE? J Am Stat Assoc 2019. [DOI: 10.1080/01621459.2018.1429276] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Affiliation(s)
- Ryan J. Tibshirani
- Department of Statistics and Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA
| | - Saharon Rosset
- Department of Statistics, Tel Aviv University, Tel Aviv, Israel
| |
Collapse
|
4
|
On the overestimation of random forest's out-of-bag error. PLoS One 2018; 13:e0201904. [PMID: 30080866 PMCID: PMC6078316 DOI: 10.1371/journal.pone.0201904] [Citation(s) in RCA: 71] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2018] [Accepted: 07/24/2018] [Indexed: 11/19/2022] Open
Abstract
The ensemble method random forests has become a popular classification tool in bioinformatics and related fields. The out-of-bag error is an error estimation technique often used to evaluate the accuracy of a random forest and to select appropriate values for tuning parameters, such as the number of candidate predictors that are randomly drawn for a split, referred to as mtry. However, for binary classification problems with metric predictors it has been shown that the out-of-bag error can overestimate the true prediction error depending on the choices of random forests parameters. Based on simulated and real data this paper aims to identify settings for which this overestimation is likely. It is, moreover, questionable whether the out-of-bag error can be used in classification tasks for selecting tuning parameters like mtry, because the overestimation is seen to depend on the parameter mtry. The simulation-based and real-data based studies with metric predictor variables performed in this paper show that the overestimation is largest in balanced settings and in settings with few observations, a large number of predictor variables, small correlations between predictors and weak effects. There was hardly any impact of the overestimation on tuning parameter selection. However, although the prediction performance of random forests was not substantially affected when using the out-of-bag error for tuning parameter selection in the present studies, one cannot be sure that this applies to all future data. For settings with metric predictor variables it is therefore strongly recommended to use stratified subsampling with sampling fractions that are proportional to the class sizes for both tuning parameter selection and error estimation in random forests. This yielded less biased estimates of the true prediction error. In unbalanced settings, in which there is a strong interest in predicting observations from the smaller classes well, sampling the same number of observations from each class is a promising alternative.
Collapse
|
5
|
Tsamardinos I, Greasidou E, Borboudakis G. Bootstrapping the out-of-sample predictions for efficient and accurate cross-validation. Mach Learn 2018; 107:1895-1922. [PMID: 30393425 PMCID: PMC6191021 DOI: 10.1007/s10994-018-5714-4] [Citation(s) in RCA: 86] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2017] [Accepted: 04/21/2018] [Indexed: 12/26/2022]
Abstract
Cross-Validation (CV), and out-of-sample performance-estimation protocols in general, are often employed both for (a) selecting the optimal combination of algorithms and values of hyper-parameters (called a configuration) for producing the final predictive model, and (b) estimating the predictive performance of the final model. However, the cross-validated performance of the best configuration is optimistically biased. We present an efficient bootstrap method that corrects for the bias, called Bootstrap Bias Corrected CV (BBC-CV). BBC-CV's main idea is to bootstrap the whole process of selecting the best-performing configuration on the out-of-sample predictions of each configuration, without additional training of models. In comparison to the alternatives, namely the nested cross-validation (Varma and Simon in BMC Bioinform 7(1):91, 2006) and a method by Tibshirani and Tibshirani (Ann Appl Stat 822-829, 2009), BBC-CV is computationally more efficient, has smaller variance and bias, and is applicable to any metric of performance (accuracy, AUC, concordance index, mean squared error). Subsequently, we employ again the idea of bootstrapping the out-of-sample predictions to speed up the CV process. Specifically, using a bootstrap-based statistical criterion we stop training of models on new folds of inferior (with high probability) configurations. We name the method Bootstrap Bias Corrected with Dropping CV (BBCD-CV) that is both efficient and provides accurate performance estimates.
Collapse
Affiliation(s)
- Ioannis Tsamardinos
- Computer Science Department, University of Crete and Gnosis Data Analysis PC, Heraklion, Greece
| | - Elissavet Greasidou
- Computer Science Department, University of Crete and Gnosis Data Analysis PC, Heraklion, Greece
| | - Giorgos Borboudakis
- Computer Science Department, University of Crete and Gnosis Data Analysis PC, Heraklion, Greece
| |
Collapse
|
6
|
Meisner A, Parikh CR, Kerr KF. Using ordinal outcomes to construct and select biomarker combinations for single-level prediction. Diagn Progn Res 2018; 2:8. [PMID: 31093558 PMCID: PMC6460803 DOI: 10.1186/s41512-018-0028-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/27/2017] [Accepted: 04/16/2018] [Indexed: 01/18/2023] Open
Abstract
BACKGROUND Biomarker studies may involve an ordinal outcome, such as no, mild, or severe disease. There is often interest in predicting one particular level of the outcome due to its clinical significance. METHODS A simple approach to constructing biomarker combinations in this context involves dichotomizing the outcome and using a binary logistic regression model. We assessed whether more sophisticated methods offer advantages over this simple approach. It is often necessary to select among several candidate biomarker combinations. One strategy involves selecting a combination based on its ability to predict the outcome level of interest. We propose an algorithm that leverages the ordinal outcome to inform combination selection. We apply this algorithm to data from a study of acute kidney injury after cardiac surgery, where kidney injury may be absent, mild, or severe. RESULTS Using more sophisticated modeling approaches to construct combinations provided gains over the simple binary logistic regression approach in specific settings. In the examples considered, the proposed algorithm for combination selection tended to reduce the impact of bias due to selection and to provide combinations with improved performance. CONCLUSIONS Methods that utilize the ordinal nature of the outcome in the construction and/or selection of biomarker combinations have the potential to yield better combinations.
Collapse
Affiliation(s)
- Allison Meisner
- 0000 0001 2171 9311grid.21107.35Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD USA
| | - Chirag R. Parikh
- 0000000419368710grid.47100.32Program of Applied Translational Research, Department of Medicine, Yale School of Medicine, New Haven, CT USA
- Department of Internal Medicine, Veterans Affairs Medical Center, West Haven, CT USA
| | - Kathleen F. Kerr
- 0000000122986657grid.34477.33Department of Biostatistics, University of Washington, Seattle, WA USA
| |
Collapse
|
7
|
Methodological issues in current practice may lead to bias in the development of biomarker combinations for predicting acute kidney injury. Kidney Int 2017; 89:429-38. [PMID: 26398494 PMCID: PMC4805513 DOI: 10.1038/ki.2015.283] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2015] [Revised: 07/27/2015] [Accepted: 07/31/2015] [Indexed: 12/22/2022]
Abstract
Individual biomarkers of renal injury are only modestly predictive of acute kidney injury (AKI). Using multiple biomarkers has the potential to improve predictive capacity. In this systematic review, statistical methods of articles developing biomarker combinations to predict acute kidney injury were assessed. We identified and described three potential sources of bias (resubstitution bias, model selection bias and bias due to center differences) that may compromise the development of biomarker combinations. Fifteen studies reported developing kidney injury biomarker combinations for the prediction of AKI after cardiac surgery (8 articles), in the intensive care unit (4 articles) or other settings (3 articles). All studies were susceptible to at least one source of bias and did not account for or acknowledge the bias. Inadequate reporting often hindered our assessment of the articles. We then evaluated, when possible (7 articles), the performance of published biomarker combinations in the TRIBE-AKI cardiac surgery cohort. Predictive performance was markedly attenuated in six out of seven cases. Thus, deficiencies in analysis and reporting are avoidable and care should be taken to provide accurate estimates of risk prediction model performance. Hence, rigorous design, analysis and reporting of biomarker combination studies are essential to realizing the promise of biomarkers in clinical practice.
Collapse
|
8
|
Jong VL, Ahout IML, van den Ham HJ, Jans J, Zaaraoui-Boutahar F, Zomer A, Simonetti E, Bijl MA, Brand HK, van IJcken WFJ, de Jonge MI, Fraaij PL, de Groot R, Osterhaus ADME, Eijkemans MJ, Ferwerda G, Andeweg AC. Transcriptome assists prognosis of disease severity in respiratory syncytial virus infected infants. Sci Rep 2016; 6:36603. [PMID: 27833115 PMCID: PMC5105123 DOI: 10.1038/srep36603] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2016] [Accepted: 10/17/2016] [Indexed: 12/17/2022] Open
Abstract
Respiratory syncytial virus (RSV) causes infections that range from common cold to severe lower respiratory tract infection requiring high-level medical care. Prediction of the course of disease in individual patients remains challenging at the first visit to the pediatric wards and RSV infections may rapidly progress to severe disease. In this study we investigate whether there exists a genomic signature that can accurately predict the course of RSV. We used early blood microarray transcriptome profiles from 39 hospitalized infants that were followed until recovery and of which the level of disease severity was determined retrospectively. Applying support vector machine learning on age by sex standardized transcriptomic data, an 84 gene signature was identified that discriminated hospitalized infants with eventually less severe RSV infection from infants that suffered from most severe RSV disease. This signature yielded an area under the receiver operating characteristic curve (AUC) of 0.966 using leave-one-out cross-validation on the experimental data and an AUC of 0.858 on an independent validation cohort consisting of 53 infants. A combination of the gene signature with age and sex yielded an AUC of 0.971. Thus, the presented signature may serve as the basis to develop a prognostic test to support clinical management of RSV patients.
Collapse
Affiliation(s)
- Victor L. Jong
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht, The Netherlands
- Department of Viroscience, Erasmus Medical Center, Rotterdam, The Netherlands
| | - Inge M. L. Ahout
- Department of Pediatrics, Laboratory of Pediatric Infectious Diseases, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Nijmegen, The Netherlands
| | | | - Jop Jans
- Department of Pediatrics, Laboratory of Pediatric Infectious Diseases, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Nijmegen, The Netherlands
| | | | - Aldert Zomer
- Department of Pediatrics, Laboratory of Pediatric Infectious Diseases, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Elles Simonetti
- Department of Pediatrics, Laboratory of Pediatric Infectious Diseases, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Maarten A. Bijl
- Department of Viroscience, Erasmus Medical Center, Rotterdam, The Netherlands
| | - H. Kim Brand
- Department of Pediatrics, Laboratory of Pediatric Infectious Diseases, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Nijmegen, The Netherlands
| | | | - Marien I. de Jonge
- Department of Pediatrics, Laboratory of Pediatric Infectious Diseases, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Pieter L. Fraaij
- Department of Viroscience, Erasmus Medical Center, Rotterdam, The Netherlands
- Department of Pediatrics, Erasmus Medical Center, Rotterdam, The Netherlands
| | - Ronald de Groot
- Department of Pediatrics, Laboratory of Pediatric Infectious Diseases, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Albert D. M. E. Osterhaus
- Department of Viroscience, Erasmus Medical Center, Rotterdam, The Netherlands
- Research Institute for Infectious Diseases and Zoonoses, Veterinary University Hannover, Germany
| | - Marinus J. Eijkemans
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht, The Netherlands
| | - Gerben Ferwerda
- Department of Pediatrics, Laboratory of Pediatric Infectious Diseases, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Arno C. Andeweg
- Department of Viroscience, Erasmus Medical Center, Rotterdam, The Netherlands
| |
Collapse
|
9
|
Jong VL, Novianti PW, Roes KCB, Eijkemans MJC. Selecting a classification function for class prediction with gene expression data. Bioinformatics 2016; 32:1814-22. [PMID: 26873933 DOI: 10.1093/bioinformatics/btw034] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2015] [Accepted: 01/15/2016] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Class predicting with gene expression is widely used to generate diagnostic and/or prognostic models. The literature reveals that classification functions perform differently across gene expression datasets. The question, which classification function should be used for a given dataset remains to be answered. In this study, a predictive model for choosing an optimal function for class prediction on a given dataset was devised. RESULTS To achieve this, gene expression data were simulated for different values of gene-pairs correlations, sample size, genes' variances, deferentially expressed genes and fold changes. For each simulated dataset, ten classifiers were built and evaluated using ten classification functions. The resulting accuracies from 1152 different simulation scenarios by ten classification functions were then modeled using a linear mixed effects regression on the studied data characteristics, yielding a model that predicts the accuracy of the functions on a given data. An application of our model on eight real-life datasets showed positive correlations (0.33-0.82) between the predicted and expected accuracies. CONCLUSION The here presented predictive model might serve as a guide to choose an optimal classification function among the 10 studied functions, for any given gene expression data. AVAILABILITY AND IMPLEMENTATION The R source code for the analysis and an R-package 'SPreFuGED' are available at Bioinformatics online. CONTACT v.l.jong@umcutecht.nl SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Victor L Jong
- Biostatistics & Research Support, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, 3508 GA, Utrecht, The Netherlands, Viroscience Lab, Erasmus Medical Center Rotterdam, Rotterdam, CE 3015, The Netherlands and
| | - Putri W Novianti
- Biostatistics & Research Support, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, 3508 GA, Utrecht, The Netherlands, Epidemiology & Biostatistics Department, Vrije University Medical Center Amsterdam, HV Amsterdam 1081, The Netherlands
| | - Kit C B Roes
- Biostatistics & Research Support, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, 3508 GA, Utrecht, The Netherlands
| | - Marinus J C Eijkemans
- Biostatistics & Research Support, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, 3508 GA, Utrecht, The Netherlands
| |
Collapse
|
10
|
Hornung R, Bernau C, Truntzer C, Wilson R, Stadler T, Boulesteix AL. A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization. BMC Med Res Methodol 2015; 15:95. [PMID: 26537575 PMCID: PMC4634762 DOI: 10.1186/s12874-015-0088-9] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2015] [Accepted: 10/19/2015] [Indexed: 01/06/2023] Open
Abstract
Background In applications of supervised statistical learning in the biomedical field it is necessary to assess the prediction error of the respective prediction rules. Often, data preparation steps are performed on the dataset—in its entirety—before training/test set based prediction error estimation by cross-validation (CV)—an approach referred to as “incomplete CV”. Whether incomplete CV can result in an optimistically biased error estimate depends on the data preparation step under consideration. Several empirical studies have investigated the extent of bias induced by performing preliminary supervised variable selection before CV. To our knowledge, however, the potential bias induced by other data preparation steps has not yet been examined in the literature. In this paper we investigate this bias for two common data preparation steps: normalization and principal component analysis for dimension reduction of the covariate space (PCA). Furthermore we obtain preliminary results for the following steps: optimization of tuning parameters, variable filtering by variance and imputation of missing values. Methods We devise the easily interpretable and general measure CVIIM (“CV Incompleteness Impact Measure”) to quantify the extent of bias induced by incomplete CV with respect to a data preparation step of interest. This measure can be used to determine whether a specific data preparation step should, as a general rule, be performed in each CV iteration or whether an incomplete CV procedure would be acceptable in practice. We apply CVIIM to large collections of microarray datasets to answer this question for normalization and PCA. Results Performing normalization on the entire dataset before CV did not result in a noteworthy optimistic bias in any of the investigated cases. In contrast, when performing PCA before CV, medium to strong underestimates of the prediction error were observed in multiple settings. Conclusions While the investigated forms of normalization can be safely performed before CV, PCA has to be performed anew in each CV split to protect against optimistic bias. Electronic supplementary material The online version of this article (doi:10.1186/s12874-015-0088-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Roman Hornung
- Department of Medical Informatics, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, Munich, D-81377, Germany.
| | - Christoph Bernau
- Department of Medical Informatics, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, Munich, D-81377, Germany. .,Leibniz Supercomputing Center, Boltzmannstr. 1, Garching, D-85748, Germany.
| | - Caroline Truntzer
- Clinical and Innovation Proteomic Platform, Pôle de Recherche Université de Bourgogne, 15 Bd Maréchal de Lattre de Tassigny, Dijon, F-21000, France.
| | - Rory Wilson
- Department of Medical Informatics, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, Munich, D-81377, Germany.
| | - Thomas Stadler
- Department of Urology, University of Munich, Marchioninistr. 15, Munich, D-81377, Germany.
| | - Anne-Laure Boulesteix
- Department of Medical Informatics, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, Munich, D-81377, Germany.
| |
Collapse
|
11
|
Safonov I, Gartseev I, Pikhletsky M, Tishutin O, Bailey MJA. An approach for model assissment for activity recognition. PATTERN RECOGNITION AND IMAGE ANALYSIS 2015. [DOI: 10.1134/s1054661815020224] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
12
|
Jung SH, Chen Y, Ahn H. Type I error control for tree classification. Cancer Inform 2014; 13:11-8. [PMID: 25452689 PMCID: PMC4237155 DOI: 10.4137/cin.s16342] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2014] [Revised: 10/05/2014] [Accepted: 10/08/2014] [Indexed: 11/18/2022] Open
Abstract
Binary tree classification has been useful for classifying the whole population based on the levels of outcome variable that is associated with chosen predictors. Often we start a classification with a large number of candidate predictors, and each predictor takes a number of different cutoff values. Because of these types of multiplicity, binary tree classification method is subject to severe type I error probability. Nonetheless, there have not been many publications to address this issue. In this paper, we propose a binary tree classification method to control the probability to accept a predictor below certain level, say 5%.
Collapse
Affiliation(s)
- Sin-Ho Jung
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC 27710, USA
| | - Yong Chen
- Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY 11794-3600, USA
| | - Hongshik Ahn
- Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY 11794-3600, USA
| |
Collapse
|
13
|
Ding Y, Tang S, Liao SG, Jia J, Oesterreich S, Lin Y, Tseng GC. Bias correction for selecting the minimal-error classifier from many machine learning models. ACTA ACUST UNITED AC 2014; 30:3152-8. [PMID: 25086004 DOI: 10.1093/bioinformatics/btu520] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
MOTIVATION Supervised machine learning is commonly applied in genomic research to construct a classifier from the training data that is generalizable to predict independent testing data. When test datasets are not available, cross-validation is commonly used to estimate the error rate. Many machine learning methods are available, and it is well known that no universally best method exists in general. It has been a common practice to apply many machine learning methods and report the method that produces the smallest cross-validation error rate. Theoretically, such a procedure produces a selection bias. Consequently, many clinical studies with moderate sample sizes (e.g. n = 30-60) risk reporting a falsely small cross-validation error rate that could not be validated later in independent cohorts. RESULTS In this article, we illustrated the probabilistic framework of the problem and explored the statistical and asymptotic properties. We proposed a new bias correction method based on learning curve fitting by inverse power law (IPL) and compared it with three existing methods: nested cross-validation, weighted mean correction and Tibshirani-Tibshirani procedure. All methods were compared in simulation datasets, five moderate size real datasets and two large breast cancer datasets. The result showed that IPL outperforms the other methods in bias correction with smaller variance, and it has an additional advantage to extrapolate error estimates for larger sample sizes, a practical feature to recommend whether more samples should be recruited to improve the classifier and accuracy. An R package 'MLbias' and all source files are publicly available. AVAILABILITY AND IMPLEMENTATION tsenglab.biostat.pitt.edu/software.htm. CONTACT ctseng@pitt.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ying Ding
- Joint Carnegie Mellon University-University of Pittsburgh Ph.D. Program in Computational Biology, Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA 15261, USA and Magee-Womens Research Institute, Pittsburgh, PA 15213, USA Joint Carnegie Mellon University-University of Pittsburgh Ph.D. Program in Computational Biology, Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA 15261, USA and Magee-Womens Research Institute, Pittsburgh, PA 15213, USA
| | - Shaowu Tang
- Joint Carnegie Mellon University-University of Pittsburgh Ph.D. Program in Computational Biology, Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA 15261, USA and Magee-Womens Research Institute, Pittsburgh, PA 15213, USA
| | - Serena G Liao
- Joint Carnegie Mellon University-University of Pittsburgh Ph.D. Program in Computational Biology, Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA 15261, USA and Magee-Womens Research Institute, Pittsburgh, PA 15213, USA
| | - Jia Jia
- Joint Carnegie Mellon University-University of Pittsburgh Ph.D. Program in Computational Biology, Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA 15261, USA and Magee-Womens Research Institute, Pittsburgh, PA 15213, USA
| | - Steffi Oesterreich
- Joint Carnegie Mellon University-University of Pittsburgh Ph.D. Program in Computational Biology, Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA 15261, USA and Magee-Womens Research Institute, Pittsburgh, PA 15213, USA
| | - Yan Lin
- Joint Carnegie Mellon University-University of Pittsburgh Ph.D. Program in Computational Biology, Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA 15261, USA and Magee-Womens Research Institute, Pittsburgh, PA 15213, USA
| | - George C Tseng
- Joint Carnegie Mellon University-University of Pittsburgh Ph.D. Program in Computational Biology, Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA 15261, USA and Magee-Womens Research Institute, Pittsburgh, PA 15213, USA Joint Carnegie Mellon University-University of Pittsburgh Ph.D. Program in Computational Biology, Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA 15261, USA and Magee-Womens Research Institute, Pittsburgh, PA 15213, USA
| |
Collapse
|
14
|
Krstajic D, Buturovic LJ, Leahy DE, Thomas S. Cross-validation pitfalls when selecting and assessing regression and classification models. J Cheminform 2014; 6:10. [PMID: 24678909 PMCID: PMC3994246 DOI: 10.1186/1758-2946-6-10] [Citation(s) in RCA: 353] [Impact Index Per Article: 35.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2014] [Accepted: 03/25/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND We address the problem of selecting and assessing classification and regression models using cross-validation. Current state-of-the-art methods can yield models with high variance, rendering them unsuitable for a number of practical applications including QSAR. In this paper we describe and evaluate best practices which improve reliability and increase confidence in selected models. A key operational component of the proposed methods is cloud computing which enables routine use of previously infeasible approaches. METHODS We describe in detail an algorithm for repeated grid-search V-fold cross-validation for parameter tuning in classification and regression, and we define a repeated nested cross-validation algorithm for model assessment. As regards variable selection and parameter tuning we define two algorithms (repeated grid-search cross-validation and double cross-validation), and provide arguments for using the repeated grid-search in the general case. RESULTS We show results of our algorithms on seven QSAR datasets. The variation of the prediction performance, which is the result of choosing different splits of the dataset in V-fold cross-validation, needs to be taken into account when selecting and assessing classification and regression models. CONCLUSIONS We demonstrate the importance of repeating cross-validation when selecting an optimal model, as well as the importance of repeating nested cross-validation when assessing a prediction error.
Collapse
Affiliation(s)
- Damjan Krstajic
- Research Centre for Cheminformatics, Jasenova 7, 11030, Beograd, Serbia. .,Laboratory for Molecular Biomedicine, Institute of Molecular Genetics and Genetic Engineering, University of Belgrade, Vojvode Stepe 444a, 11010, Beograd, Serbia. .,Clinical Persona Inc, 932 Mouton Circle, East Palo Alto, CA, 94303, USA.
| | | | - David E Leahy
- Molplex Pharmaceuticals, Alderly Park, Macclesfield, SK10 4TF, UK
| | - Simon Thomas
- Cyprotex Discovery Ltd, 15 Beech Lane, Macclesfield, SK10 2DR, UK
| |
Collapse
|
15
|
Boulesteix AL, Schmid M. Machine learning versus statistical modeling. Biom J 2014; 56:588-93. [DOI: 10.1002/bimj.201300226] [Citation(s) in RCA: 42] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2013] [Revised: 11/04/2013] [Accepted: 11/06/2013] [Indexed: 12/19/2022]
Affiliation(s)
- Anne-Laure Boulesteix
- Department of Medical Informatics, Biometry and Epidemiology; University of Munich; Germany
| | | |
Collapse
|