1
|
Tabata K, Uraoka N, Benhamida J, Hanna MG, Sirintrapun SJ, Gallas BD, Gong Q, Aly RG, Emoto K, Matsuda KM, Hameed MR, Klimstra DS, Yagi Y. Validation of mitotic cell quantification via microscopy and multiple whole-slide scanners. Diagn Pathol 2019; 14:65. [PMID: 31238983 PMCID: PMC6593538 DOI: 10.1186/s13000-019-0839-8] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2018] [Accepted: 06/11/2019] [Indexed: 01/03/2023] Open
Abstract
BACKGROUND The establishment of whole-slide imaging (WSI) as a medical diagnostic device allows that pathologists may evaluate mitotic activity with this new technology. Furthermore, the image digitalization provides an opportunity to develop algorithms for automatic quantifications, ideally leading to improved reproducibility as compared to the naked eye examination by pathologists. In order to implement them effectively, accuracy of mitotic figure detection using WSI should be investigated. In this study, we aimed to measure pathologist performance in detecting mitotic figures (MFs) using multiple platforms (multiple scanners) and compare the results with those obtained using a brightfield microscope. METHODS Four slides of canine oral melanoma were prepared and digitized using 4 WSI scanners. In these slides, 40 regions of interest (ROIs) were demarcated, and five observers identified the MFs using different viewing modes: microscopy and WSI. We evaluated the inter- and intra-observer agreements between modes with Cohen's Kappa and determined "true" MFs with a consensus panel. We then assessed the accuracy (agreement with truth) using the average of sensitivity and specificity. RESULTS In the 40 ROIs, 155 candidate MFs were detected by five pathologists; 74 of them were determined to be true MFs. Inter- and intra-observer agreement was mostly "substantial" or greater (Kappa = 0.594-0.939). Accuracy was between 0.632 and 0.843 across all readers and modes. After averaging over readers for each modality, we found that mitosis detection accuracy for 3 of the 4 WSI scanners was significantly less than that of the microscope (p = 0.002, 0.012, and 0.001). CONCLUSIONS This study is the first to compare WSIs and microscopy in detecting MFs at the level of individual cells. Our results suggest that WSI can be used for mitotic cell detection and offers similar reproducibility to the microscope, with slightly less accuracy.
Collapse
Affiliation(s)
- Kazuhiro Tabata
- Department of Pathology, Memorial Sloan Kettering Cancer Center, 1275 York Avenue, New York, NY 10065 USA
- Department of Pathology, Nagasaki University Hospital, 1-7-1 Sakamoto, Nagasaki, Nagasaki 8528501 Japan
| | - Naohiro Uraoka
- Department of Pathology, Memorial Sloan Kettering Cancer Center, 1275 York Avenue, New York, NY 10065 USA
| | - Jamal Benhamida
- Department of Pathology, Memorial Sloan Kettering Cancer Center, 1275 York Avenue, New York, NY 10065 USA
| | - Matthew G. Hanna
- Department of Pathology, Memorial Sloan Kettering Cancer Center, 1275 York Avenue, New York, NY 10065 USA
| | | | - Brandon D. Gallas
- Center For Devices and Radiological Health, Office of Science and Engineering Laboratories, U.S. Food and Drug Administration, 10903 New Hampshire Avenue, Silver Spring, MD 20993 USA
| | - Qi Gong
- Center For Devices and Radiological Health, Office of Science and Engineering Laboratories, U.S. Food and Drug Administration, 10903 New Hampshire Avenue, Silver Spring, MD 20993 USA
| | - Rania G. Aly
- Department of Pathology, Memorial Sloan Kettering Cancer Center, 1275 York Avenue, New York, NY 10065 USA
- Department of Pathology, Faculty of Medicine, Alexandria university, 22 El-Guish Road, El-Shatby, Alexandria, 21526 Egypt
| | - Katsura Emoto
- Department of Pathology, Memorial Sloan Kettering Cancer Center, 1275 York Avenue, New York, NY 10065 USA
- Thoracic Service, Department of Surgery, Memorial Sloan Kettering Cancer Center, 1275 York Avenue, New York, 10065 NY USA
| | - Kant M. Matsuda
- Department of Pathology, Memorial Sloan Kettering Cancer Center, 1275 York Avenue, New York, NY 10065 USA
| | - Meera R. Hameed
- Department of Pathology, Memorial Sloan Kettering Cancer Center, 1275 York Avenue, New York, NY 10065 USA
| | - David S. Klimstra
- Department of Pathology, Memorial Sloan Kettering Cancer Center, 1275 York Avenue, New York, NY 10065 USA
| | - Yukako Yagi
- Department of Pathology, Memorial Sloan Kettering Cancer Center, 1275 York Avenue, New York, NY 10065 USA
| |
Collapse
|
2
|
The Reproducibility of Changes in Diagnostic Figures of Merit Across Laboratory and Clinical Imaging Reader Studies. Acad Radiol 2017; 24:1436-1446. [PMID: 28666723 DOI: 10.1016/j.acra.2017.05.007] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2017] [Revised: 04/28/2017] [Accepted: 05/01/2017] [Indexed: 11/23/2022]
Abstract
RATIONALE AND OBJECTIVES In this paper we examine which comparisons of reading performance between diagnostic imaging systems made in controlled retrospective laboratory studies may be representative of what we observe in later clinical studies. The change in a meaningful diagnostic figure of merit between two diagnostic modalities should be qualitatively or quantitatively comparable across all kinds of studies. MATERIALS AND METHODS In this meta-study we examine the reproducibility of relative measures of sensitivity, false positive fraction (FPF), area under the receiver operating characteristic (ROC) curve, and expected utility across laboratory and observational clinical studies for several different breast imaging modalities, including screen film mammography, digital mammography, breast tomosynthesis, and ultrasound. RESULTS Across studies of all types, the changes in the FPFs yielded very small probabilities of having a common mean value. The probabilities of relative sensitivity being the same across ultrasound and tomosynthesis studies were low. No evidence was found for different mean values of relative area under the ROC curve or relative expected utility within any of the study sets. CONCLUSION The comparison demonstrates that the ratios of areas under the ROC curve and expected utilities are reproducible across laboratory and clinical studies, whereas sensitivity and FPF are not.
Collapse
|
3
|
Harvey S, Gallagher AM, Nolan M, Hughes CM. Listening to Women: Expectations and Experiences in Breast Imaging. J Womens Health (Larchmt) 2016; 24:777-83. [PMID: 26390380 DOI: 10.1089/jwh.2015.29001.swh] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Affiliation(s)
- Susan Harvey
- 1 Director of Breast Imaging, The Russell H. Morgan Department of Radiology and Radiological Science, Johns Hopkins Medical Institutions , Baltimore, Maryland
| | | | - Martha Nolan
- 2 Society for Women's Health Research , Washington, DC
| | | |
Collapse
|
4
|
Value of gadolinium-enhanced MRI in detection of acute appendicitis in children and adolescents. AJR Am J Roentgenol 2015; 203:W543-8. [PMID: 25341169 DOI: 10.2214/ajr.13.12093] [Citation(s) in RCA: 38] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
OBJECTIVE The aim of this study was to determine both the value of gadolinium-enhanced MRI in children with suspected acute appendicitis and the best sequences for detecting acute appendicitis, to thereby decrease imaging time. MATERIALS AND METHODS This was a retrospective review of pediatric patients with suspected appendicitis who had undergone MRI at our institution between 2010 and 2011 after an indeterminate ultrasound examination. MRI examinations included T1-weighted unenhanced and contrast-enhanced, T2-weighted, and balanced steady-state free precession (SSFP) sequences in axial and coronal planes. Sequences were reviewed together and individually by five radiologists who were blinded to the final diagnosis. Radiologists were asked to score their confidence of appendicitis diagnosis using a 5-point scale. The diagnostic performance of each MR sequence was obtained by comparing the mean area under the curve (AUC) using receiver operating characteristic (ROC) analysis. RESULTS A total of 49 patients with clinically suspected appendicitis were included, of whom 16 received a diagnosis of appendicitis. The mean AUCs for reviewing all sequences together, contrast-enhanced sequences alone, T2-weighted sequences alone, and balanced SSFP alone were 0.984, 0.979, 0.944, and 0.910, respectively. No significant difference was observed between reviewing all sequences together versus contrast-enhanced sequences alone (p = 0.90) and T2-weighted sequences alone (p = 0.23). A significant difference was observed between contrast-enhanced sequences and balanced SSFP (p < 0.03). CONCLUSION Gadolinium-enhanced images and T2-weighted images are most helpful in the assessment of acute appendicitis in the pediatric population. These findings have led to protocol modifications that have reduced imaging time.
Collapse
|
5
|
Estimating the receiver operating characteristic curve in studies that match controls to cases on covariates. Acad Radiol 2013; 20:863-73. [PMID: 23601953 DOI: 10.1016/j.acra.2013.03.004] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2013] [Revised: 03/07/2013] [Accepted: 03/08/2013] [Indexed: 11/23/2022]
Abstract
RATIONALE AND OBJECTIVES Studies evaluating a new diagnostic imaging test may select control subjects without disease who are similar to case subjects with disease in regard to factors potentially related to the imaging result. Selecting one or more controls that are matched to each case on factors such as age, comorbidities, or study site improves study validity by eliminating potential biases due to differential characteristics of readings for cases versus controls. However, it is not widely appreciated that valid analysis requires that the receiver operating characteristic (ROC) curve be adjusted for covariates. We propose a new computationally simple method for estimating the covariate-adjusted ROC curve that is appropriate in matched case-control studies. MATERIALS AND METHODS We provide theoretical arguments for the validity of the estimator and demonstrate its application to data. We compare the statistical properties of the estimator with those of a previously proposed estimator of the covariate-adjusted ROC curve. We demonstrate an application of the estimator to data derived from a study of emergency medical services encounters where the goal is to diagnose critical illness in nontrauma, non-cardiac arrest patients. A novel bootstrap method is proposed for calculating confidence intervals. RESULTS The new estimator is computationally very simple, yet we show it yields values that approximate the existing, more complicated estimator in simulated data sets. We found that the new estimator has excellent statistical properties, with bias and efficiency comparable with the existing method. CONCLUSIONS In matched case-control studies, the ROC curve should be adjusted for matching covariates and can be estimated with the new computationally simple approach.
Collapse
|
6
|
Samuelson FW. Inference based on diagnostic measures from studies of new imaging devices. Acad Radiol 2013; 20:816-24. [PMID: 23643364 DOI: 10.1016/j.acra.2013.03.002] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2013] [Revised: 03/01/2013] [Accepted: 03/07/2013] [Indexed: 10/26/2022]
Abstract
RATIONALE AND OBJECTIVES Before using a new diagnostic imaging device regularly in a clinic, it should be studied using patients and radiologists. Often such studies report diagnostic performance in terms of sensitivity, specificity, area under the receiver operating characteristic curve (AUC), or differences thereof. In this report we look at how these studies differ from actual future clinical practice and how those differences may affect reported performance measures. MATERIALS AND METHODS We review signal detection (receiver operating characteristic) theory and decision theory. We compare diagnostic measures from several published studies in medical imaging and examine how they relate to theory and each other. RESULTS We see that clinical decisions can be modeled using signal detection and decision theories. Sensitivity and specificity are inextricably linked with clinical factors, such as prevalence and costs. Imaging devices are used in many different ways in clinical practice, so that sensitivities, specificities, and AUCs measured in studies of new diagnostic imaging devices will differ from those in actual future clinical use. CONCLUSIONS Measured sensitivities, specificities, and the directions of changes thereof are not necessarily consistent or reproducible across studies of new diagnostic devices. A change in the AUC, which should be independent of clinical costs or prevalence, is a consistent measure across similar studies, and a positive change in AUC is indicative of additional diagnostic information that will be available to radiologists in a future clinical environment.
Collapse
|
7
|
Nishikawa RM, Pesce LL. Estimating sensitivity and specificity for technology assessment based on observer studies. Acad Radiol 2013; 20:825-30. [PMID: 23660073 DOI: 10.1016/j.acra.2013.03.008] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2013] [Revised: 03/23/2013] [Accepted: 03/26/2013] [Indexed: 11/17/2022]
Abstract
RATIONALE AND OBJECTIVES The goal of this study was to determine the accuracy and precision of using scores from a receiver operating characteristic rating scale to estimate sensitivity and specificity. MATERIALS AND METHODS We used data collected in a previous study that measured the improvements in radiologists' ability to classify mammographic microcalcification clusters as benign or malignant with and without the use of a computer-aided diagnosis scheme. Sensitivity and specificity were estimated from the rating data from a question that directly asked the radiologists their biopsy recommendations, which was used as the "truth," because it is the actual recall decision, thus it is their subjective truth. By thresholding the rating data, sensitivity and specificity were estimated for different threshold values. RESULTS Because of interreader and intrareader variability, estimated sensitivity and specificity values for individual readers could be as much as 100% in error when using rating data compared to using the biopsy recommendation data. When pooled together, the estimates using thresholding the rating data were in good agreement with sensitivity and specificity estimated from the recommendation data. However, the statistical power of the rating data estimates was lower. CONCLUSIONS By simply asking the observer his or her explicit recommendation (eg, biopsy or no biopsy), sensitivity and specificity can be measured directly, giving a more accurate description of empirical variability and the power of the study can be maximized.
Collapse
Affiliation(s)
- Robert M Nishikawa
- Department of Radiology, The University of Chicago, 5841 South Maryland Avenue, MC-2026, Chicago, IL 60637, USA.
| | | |
Collapse
|
8
|
Abbey CK, Eckstein MP, Boone JM. Estimating the relative utility of screening mammography. Med Decis Making 2013; 33:510-20. [PMID: 23295543 DOI: 10.1177/0272989x12470756] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
BACKGROUND The concept of diagnostic utility is a fundamental component of signal detection theory, going back to some of its earliest works. Attaching utility values to the various possible outcomes of a diagnostic test should, in principle, lead to meaningful approaches to evaluating and comparing such systems. However, in many areas of medical imaging, utility is not used because it is presumed to be unknown. METHODS In this work, we estimate relative utility (the utility benefit of a detection relative to that of a correct rejection) for screening mammography using its known relation to the slope of a receiver operating characteristic (ROC) curve at the optimal operating point. The approach assumes that the clinical operating point is optimal for the goal of maximizing expected utility and therefore the slope at this point implies a value of relative utility for the diagnostic task, for known disease prevalence. We examine utility estimation in the context of screening mammography using the Digital Mammographic Imaging Screening Trials (DMIST) data. RESULTS We show how various conditions can influence the estimated relative utility, including characteristics of the rating scale, verification time, probability model, and scope of the ROC curve fit. Relative utility estimates range from 66 to 227. CONCLUSIONS We argue for one particular set of conditions that results in a relative utility estimate of 162 (±14%). This is broadly consistent with values in screening mammography determined previously by other means. At the disease prevalence found in the DMIST study (0.59% at 365-day verification), optimal ROC slopes are near unity, suggesting that utility-based assessments of screening mammography will be similar to those found using Youden's index.
Collapse
Affiliation(s)
- Craig K Abbey
- Department of Psychology, University of California, Santa Barbara, CA (CKA, ME),Department of Radiology, UC Davis Medical Center, Sacramento, CA (CKA, JMB)
| | - Miguel P Eckstein
- Department of Psychology, University of California, Santa Barbara, CA (CKA, ME)
| | - John M Boone
- Department of Radiology, UC Davis Medical Center, Sacramento, CA (CKA, JMB)
| |
Collapse
|
9
|
Eng J. Teaching receiver operating characteristic analysis: an interactive laboratory exercise. Acad Radiol 2012; 19:1452-6. [PMID: 23040502 DOI: 10.1016/j.acra.2012.09.003] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2012] [Revised: 09/09/2012] [Accepted: 09/10/2012] [Indexed: 11/16/2022]
Abstract
RATIONALE AND OBJECTIVES Despite its fundamental importance in the evaluation of diagnostic tests, receiver operating characteristic (ROC) analysis is not easily understood. The purpose of this project was to create a learning experience that resulted in an intuitive understanding of the basic principles of ROC analysis. MATERIALS AND METHODS An interactive laboratory exercise was developed for a class about radiology testing taught within a clinical epidemiology course between 2000 and 2009. The physician students in the course were clinical fellows from various medical specialties who were enrolled in a graduate degree program in clinical investigation. For the exercise, the class was divided into six groups. Each group interpreted radiographs from a set of 50 exams of the peripheral skeleton to determine the presence or absence of an acute fracture. Data from the class were pooled and given to each student. Students calculated the area under the ROC curve (AUC) corresponding to overall class performance. A binormal ROC curve was also fitted to the data from each class year. RESULTS The laboratory exercise was conducted for 8 years with approximately 20-30 students per year. The mean AUC over the eight laboratory classes was 0.72 with a standard deviation of 0.08 (range, 0.60-0.85). CONCLUSION With some simplifications in design, an observer study can be conducted in a laboratory classroom setting. Participatory data collection promotes the intuitive understanding of ROC analysis principles.
Collapse
Affiliation(s)
- John Eng
- Russell H. Morgan Department of Radiology and Radiological Science, Johns Hopkins University School of Medicine, 600 North Wolfe Street, Baltimore, MD 21287, USA.
| |
Collapse
|
10
|
Rafferty EA, Park JM, Philpotts LE, Poplack SP, Sumkin JH, Halpern EF, Niklason LT. Assessing radiologist performance using combined digital mammography and breast tomosynthesis compared with digital mammography alone: results of a multicenter, multireader trial. Radiology 2012; 266:104-13. [PMID: 23169790 DOI: 10.1148/radiol.12120674] [Citation(s) in RCA: 284] [Impact Index Per Article: 23.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
PURPOSE To compare radiologists' diagnostic accuracy and recall rates for breast tomosynthesis combined with digital mammography versus digital mammography alone. MATERIALS AND METHODS Institutional review board approval was obtained at each accruing institution. Participating women gave written informed consent. Mediolateral oblique and craniocaudal digital mammographic and tomosynthesis images of both breasts were obtained from 1192 subjects. Two enriched reader studies were performed to compare digital mammography with tomosynthesis against digital mammography alone. Study 1 comprised 312 cases (48 cancer cases) with images read by 12 radiologists; study 2, 312 cases (51 cancer cases) with 15 radiologists. Study 1 readers recorded only that an abnormality requiring recall was present; study 2 readers had additional training and recorded both lesion type and location. Diagnostic accuracy was compared with receiver operating characteristic analysis. Recall rates of noncancer cases, sensitivity, specificity, and positive and negative predictive values determined by analyzing Breast Imaging Reporting and Data System scores were compared for the two methods. RESULTS Diagnostic accuracy for combined tomosynthesis and digital mammography was superior to that of digital mammography alone. Average difference in area under the curve in study 1 was 7.2% (95% confidence interval [CI]: 3.7%, 10.8%; P < .001) and in study 2 was 6.8% (95% CI: 4.1%, 9.5%; P < .001). All 27 radiologists increased diagnostic accuracy with addition of tomosynthesis. Recall rates for noncancer cases for all readers significantly decreased with addition of tomosynthesis (range, 6%-67%; P < .001 for 25 readers, P < .03 for all readers). Increased sensitivity was largest for invasive cancers: 15% and 22% in studies 1 and 2 versus 3% for in situ cancers in both studies. CONCLUSION Addition of tomosynthesis to digital mammography offers the dual benefit of significantly increased diagnostic accuracy and significantly reduced recall rates for noncancer cases. SUPPLEMENTAL MATERIAL http://radiology.rsna.org/lookup/suppl/doi:10.1148/radiol.12120674/-/DC1.
Collapse
Affiliation(s)
- Elizabeth A Rafferty
- Department of Radiology, Massachusetts General Hospital, 55 Fruit St, Boston, MA 02114, USA.
| | | | | | | | | | | | | |
Collapse
|
11
|
Wunderlich A, Noo F. A nonparametric procedure for comparing the areas under correlated LROC curves. IEEE TRANSACTIONS ON MEDICAL IMAGING 2012; 31:2050-61. [PMID: 22736638 PMCID: PMC3619029 DOI: 10.1109/tmi.2012.2205015] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
Abstract
In contrast to the receiver operating characteristic (ROC) assessment paradigm, localization ROC (LROC) analysis provides a means to jointly assess the accuracy of localization and detection in an observer study. In a typical multireader, multicase (MRMC) evaluation, the data sets are paired so that correlations arise in observer performance both between readers and across the imaging conditions (e.g., reconstruction methods or scanning parameters) being compared. Therefore, MRMC evaluations motivate the need for a statistical methodology to compare correlated LROC curves. In this work, we suggest a nonparametric strategy for this purpose. Specifically, we find that seminal work of Sen on U-statistics can be applied to estimate the covariance matrix for a vector of LROC area estimates. The resulting covariance estimator is the LROC analog of the covariance estimator given by DeLong et al. for ROC analysis. Once the covariance matrix is estimated, it can be used to construct confidence intervals and/or confidence regions for purposes of comparing observer performance across imaging conditions. In addition, given the results of a small-scale pilot study, the covariance estimator may be used to estimate the number of images and observers needed to achieve a desired confidence interval size in a full-scale observer study. The utility of our methodology is illustrated with a human-observer LROC evaluation of three image reconstruction strategies for fan-beam x-ray computed tomography (CT).
Collapse
Affiliation(s)
- Adam Wunderlich
- Department of Radiology, University of Utah, Salt Lake City, UT 84108 USA
| | - Frédéric Noo
- Department of Radiology, University of Utah, Salt Lake City, UT 84108 USA
| |
Collapse
|
12
|
Chakraborty DP. Measuring agreement between rating interpretations and binary clinical interpretations of images: a simulation study of methods for quantifying the clinical relevance of an observer performance paradigm. Phys Med Biol 2012; 57:2873-904. [PMID: 22516804 DOI: 10.1088/0031-9155/57/10/2873] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Laboratory receiver operating characteristic (ROC) studies, that are often used to evaluate medical imaging systems, differ from 'live' clinical interpretations in several respects which could compromise their clinical relevance. The aim was to develop methodology for quantifying the clinical relevance of a laboratory ROC study. A simulator was developed to generate ROC ratings data and binary clinical interpretations classified as correct or incorrect for a common set of images interpreted under clinical and laboratory conditions. The area under the trapezoidal ROC curve (AUC) was used as the laboratory figure-of-merit and the fraction of correct clinical decisions as the clinical figure-of-merit. Conventional agreement measures (Pearson, Spearman, Kendall and kappa) between the bootstrap-induced fluctuations of the two figures of merit were estimated. A jackknife pseudovalue transformation applied to the figures of merit was also investigated as a way to capture agreement existing at the individual image level that could be lost at the figure-of-merit level. It is shown that the pseudovalues define a relevance-ROC curve. The area under this curve (rAUC) measures the ability of the laboratory figure-of-merit-based pseudovalues to correctly classify incorrect versus correct clinical interpretations. Therefore, rAUC is a measure of the clinical relevance of an ROC study. The conventional measures and rAUC were compared under varying simulator conditions. It was found that design details of the ROC study, namely the number of bins, the difficulty level of the images, the ratio of disease-present to disease-absent images and the unavoidable difference between laboratory and clinical performance levels, can lead to serious underestimation of the agreement as indicated by conventional agreement measures, even for perfectly correlated data, while rAUC showed high agreement and was relatively immune to these details. At the same time rAUC was sensitive to factors such as intrinsic correlation between the laboratory and clinical decision variables and differences in reporting thresholds that are expected to influence agreement both at the individual image level and at the figure-of-merit level. Suggestions are made for how to conduct relevance-ROC studies aimed at assessing agreement between laboratory and clinical interpretations. The method could be used to evaluate the clinical relevance of alternative scalar figures of merit, such as the sensitivity at a predifined specificity.
Collapse
Affiliation(s)
- Dev P Chakraborty
- Department of Radiology, University of Pittsburgh, 4771 Presbyterian South Tower, 200 Lothrop St, Pittsburgh, PA 15213, USA.
| |
Collapse
|
13
|
Samuelson F, Gallas BD, Myers KJ, Petrick N, Pinsky P, Sahiner B, Campbell G, Pennello GA. The importance of ROC data. Acad Radiol 2011; 18:257-8; author reply 259-61. [PMID: 21232688 DOI: 10.1016/j.acra.2010.10.016] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2010] [Revised: 10/18/2010] [Accepted: 10/20/2010] [Indexed: 11/19/2022]
|
14
|
Gur D, Bandos AI, Rockette HE. Reply. Acad Radiol 2011. [DOI: 10.1016/j.acra.2010.11.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|