1
|
Zhao X, Feng GC, Ao SH, Liu PL. Interrater reliability estimators tested against true interrater reliabilities. BMC Med Res Methodol 2022; 22:232. [PMID: 36038846 PMCID: PMC9426226 DOI: 10.1186/s12874-022-01707-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2022] [Accepted: 08/04/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Interrater reliability, aka intercoder reliability, is defined as true agreement between raters, aka coders, without chance agreement. It is used across many disciplines including medical and health research to measure the quality of ratings, coding, diagnoses, or other observations and judgements. While numerous indices of interrater reliability are available, experts disagree on which ones are legitimate or more appropriate. Almost all agree that percent agreement (ao), the oldest and the simplest index, is also the most flawed because it fails to estimate and remove chance agreement, which is produced by raters' random rating. The experts, however, disagree on which chance estimators are legitimate or better. The experts also disagree on which of the three factors, rating category, distribution skew, or task difficulty, an index should rely on to estimate chance agreement, or which factors the known indices in fact rely on. The most popular chance-adjusted indices, according to a functionalist view of mathematical statistics, assume that all raters conduct intentional and maximum random rating while typical raters conduct involuntary and reluctant random rating. The mismatches between the assumed and the actual rater behaviors cause the indices to rely on mistaken factors to estimate chance agreement, leading to the numerous paradoxes, abnormalities, and other misbehaviors of the indices identified by prior studies. METHODS We conducted a 4 × 8 × 3 between-subject controlled experiment with 4 subjects per cell. Each subject was a rating session with 100 pairs of rating by two raters, totaling 384 rating sessions as the experimental subjects. The experiment tested seven best-known indices of interrater reliability against the observed reliabilities and chance agreements. Impacts of the three factors, i.e., rating category, distribution skew, and task difficulty, on the indices were tested. RESULTS The most criticized index, percent agreement (ao), showed as the most accurate predictor of reliability, reporting directional r2 = .84. It was also the third best approximator, overestimating observed reliability by 13 percentage points on average. The three most acclaimed and most popular indices, Scott's π, Cohen's κ and Krippendorff's α, underperformed all other indices, reporting directional r2 = .312 and underestimated reliability by 31.4 ~ 31.8 points. The newest index, Gwet's AC1, emerged as the second-best predictor and the most accurate approximator. Bennett et al's S ranked behind AC1, and Perreault and Leigh's Ir ranked the fourth both for prediction and approximation. The reliance on category and skew and failure to rely on difficulty explain why the six chance-adjusted indices often underperformed ao, which they were created to outperform. The evidence corroborated the notion that the chance-adjusted indices assume intentional and maximum random rating while the raters instead exhibited involuntary and reluctant random rating. CONCLUSION The authors call for more empirical studies and especially more controlled experiments to falsify or qualify this study. If the main findings are replicated and the underlying theories supported, new thinking and new indices may be needed. Index designers may need to refrain from assuming intentional and maximum random rating, and instead assume involuntary and reluctant random rating. Accordingly, the new indices may need to rely on task difficulty, rather than distribution skew or rating category, to estimate chance agreement.
Collapse
Affiliation(s)
- Xinshu Zhao
- Department of Communication, Faculty of Social Sciences, University of Macau, Taipa, Macao.
| | - Guangchao Charles Feng
- Department of Communication, Faculty of Social Sciences, University of Macau, Taipa, Macao
| | - Song Harris Ao
- Department of Communication, Faculty of Social Sciences, University of Macau, Taipa, Macao
| | - Piper Liping Liu
- Department of Communication, Faculty of Social Sciences, University of Macau, Taipa, Macao
| |
Collapse
|
2
|
Scaringella L, Górska A, Calderon D, Benitez J. Should we teach in hybrid mode or fully online? A theory and empirical investigation on the service–profit chain in MBAs. INFORMATION & MANAGEMENT 2022. [DOI: 10.1016/j.im.2021.103573] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
3
|
Affiliation(s)
- Matthijs J. Warrens
- Groningen Institute for Educational Research, University of Groningen, Grote Rozenstraat 3, 9712 TG Groningen, The Netherlands
| |
Collapse
|
4
|
Beckler DT, Thumser ZC, Schofield JS, Marasco PD. Reliability in evaluator-based tests: using simulation-constructed models to determine contextually relevant agreement thresholds. BMC Med Res Methodol 2018; 18:141. [PMID: 30453897 PMCID: PMC6245899 DOI: 10.1186/s12874-018-0606-7] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2018] [Accepted: 11/02/2018] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND Indices of inter-evaluator reliability are used in many fields such as computational linguistics, psychology, and medical science; however, the interpretation of resulting values and determination of appropriate thresholds lack context and are often guided only by arbitrary "rules of thumb" or simply not addressed at all. Our goal for this work was to develop a method for determining the relationship between inter-evaluator agreement and error to facilitate meaningful interpretation of values, thresholds, and reliability. METHODS Three expert human evaluators completed a video analysis task, and averaged their results together to create a reference dataset of 300 time measurements. We simulated unique combinations of systematic error and random error onto the reference dataset to generate 4900 new hypothetical evaluators (each with 300 time measurements). The systematic errors and random errors made by the hypothetical evaluator population were approximated as the mean and variance of a normally-distributed error signal. Calculating the error (using percent error) and inter-evaluator agreement (using Krippendorff's alpha) between each hypothetical evaluator and the reference dataset allowed us to establish a mathematical model and value envelope of the worst possible percent error for any given amount of agreement. RESULTS We used the relationship between inter-evaluator agreement and error to make an informed judgment of an acceptable threshold for Krippendorff's alpha within the context of our specific test. To demonstrate the utility of our modeling approach, we calculated the percent error and Krippendorff's alpha between the reference dataset and a new cohort of trained human evaluators and used our contextually-derived Krippendorff's alpha threshold as a gauge of evaluator quality. Although all evaluators had relatively high agreement (> 0.9) compared to the rule of thumb (0.8), our agreement threshold permitted evaluators with low error, while rejecting one evaluator with relatively high error. CONCLUSIONS We found that our approach established threshold values of reliability, within the context of our evaluation criteria, that were far less permissive than the typically accepted "rule of thumb" cutoff for Krippendorff's alpha. This procedure provides a less arbitrary method for determining a reliability threshold and can be tailored to work within the context of any reliability index.
Collapse
Affiliation(s)
- Dylan T Beckler
- Laboratory for Bionic Integration, Department of Biomedical Engineering, ND20, Cleveland Clinic, 9500 Euclid Avenue, Cleveland, OH, 44195, USA
| | - Zachary C Thumser
- Laboratory for Bionic Integration, Department of Biomedical Engineering, ND20, Cleveland Clinic, 9500 Euclid Avenue, Cleveland, OH, 44195, USA
| | - Jonathon S Schofield
- Laboratory for Bionic Integration, Department of Biomedical Engineering, ND20, Cleveland Clinic, 9500 Euclid Avenue, Cleveland, OH, 44195, USA
| | - Paul D Marasco
- Laboratory for Bionic Integration, Department of Biomedical Engineering, ND20, Cleveland Clinic, 9500 Euclid Avenue, Cleveland, OH, 44195, USA.
| |
Collapse
|
5
|
Sgammato A, Donoghue JR. On the Performance of the Marginal Homogeneity Test to Detect Rater Drift. APPLIED PSYCHOLOGICAL MEASUREMENT 2018; 42:307-320. [PMID: 29881127 PMCID: PMC5978607 DOI: 10.1177/0146621617730390] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
When constructed response items are administered repeatedly, "trend scoring" can be used to test for rater drift. In trend scoring, raters rescore responses from the previous administration. Two simulation studies evaluated the utility of Stuart's Q measure of marginal homogeneity as a way of evaluating rater drift when monitoring trend scoring. In the first study, data were generated based on trend scoring tables obtained from an operational assessment. The second study tightly controlled table margins to disentangle certain features present in the empirical data. In addition to Q, the paired t test was included as a comparison, because of its widespread use in monitoring trend scoring. Sample size, number of score categories, interrater agreement, and symmetry/asymmetry of the margins were manipulated. For identical margins, both statistics had good Type I error control. For a unidirectional shift in margins, both statistics had good power. As expected, when shifts in the margins were balanced across categories, the t test had little power. Q demonstrated good power for all conditions and identified almost all items identified by the t test. Q shows substantial promise for monitoring of trend scoring.
Collapse
|
6
|
Projections of Future Land Use in Bangladesh under the Background of Baseline, Ecological Protection and Economic Development. SUSTAINABILITY 2017. [DOI: 10.3390/su9040505] [Citation(s) in RCA: 50] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
7
|
Ganjali M, Moradzadeh N, Baghfalaki T. Bayesian testing of agreement criteria under order constraints. J Korean Stat Soc 2017. [DOI: 10.1016/j.jkss.2016.06.004] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
8
|
Mielke PW, Berry KJ, Johnston JE. The Exact Variance of Weighted Kappa with Multiple Raters. Psychol Rep 2016; 101:655-60. [DOI: 10.2466/pr0.101.2.655-660] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Weighted kappa described by Cohen in 1968 is widely used in psychological research to measure agreement between two independent raters. Everitt then provided the exact variance for weighted kappa for two raters. In this paper, Everitt's exact variance is extended to three or more raters.
Collapse
|
9
|
Sicoly F. Estimating the Accuracy of Decisions Based on Cutting Scores. JOURNAL OF PSYCHOEDUCATIONAL ASSESSMENT 2016. [DOI: 10.1177/073428299201000102] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Classifications based on test results are used routinely in educational research and practice. Although test validity usually is expressed as a correlation coefficient, this does not indicate the expected accuracy of pass-fail or eligible-not eligible decisions based on test scores. This paper describes expectancy tables for converting a validity or test-retest reliability coefficient, r, into measures of classification accuracy for dichotomous categories. Results for a sample of correlation coefficients and cut-off scores are reported using several indicators of accuracy: sensitivity, efficiency, specificity, hit rate, and kappa. It appears that a validity coefficient above .90 is required to achieve a kappa above .70 and to keep false positive and false negative error rates below 25%. This suggests that many tests and measures considered to have adequate validity (between .60 and .90) often will have limited utility in making diagnostic, placement, or treatment decisions.
Collapse
Affiliation(s)
- Fiore Sicoly
- East York Board of Education, 840 Coxwell Avenue, Toronto, Ontario, Canada M4C 2V3
| |
Collapse
|
10
|
McGrath RE, Pogge DL, Stokes JM, Cragnolino A, Zaccario M, Hayman J, Piacentini T, Wayland-Smith D. Field Reliability of Comprehensive System Scoring in an Adolescent Inpatient Sample. Assessment 2016; 12:199-209. [PMID: 15914721 DOI: 10.1177/1073191104273384] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
The extent to which the Comprehensive System for the Rorschach is reliably scored has been a topic of some controversy. Although several studies have concluded it can be scored reliably in research settings, little is known about its reliability in field settings. This study evaluated the reliability of both response-level codes and protocol-level scores among 84 adolescent psychiatric inpatients in a clinical setting. Rorschachs were originally administered and scored for clinical purposes. Among response codes, 87% demonstrated acceptable reliability(> .60), and most coefficients exceeded .80. Results were similar for protocol-level scores, with only one score demonstrating less than adequate reliability. The findings are consistent with previous evidence, indicating reliable scoring is possible even in field settings.
Collapse
Affiliation(s)
- Robert E McGrath
- School of Psychology, Fairleigh Dickinson University, Teaneck, NJ 07666, USA.
| | | | | | | | | | | | | | | |
Collapse
|
11
|
Friedrich J, Fetherstonhaugh D, Casey S, Gallagher D. Argument Integration and Attitude Change: Suppression Effects in the Integration of One-Sided Arguments that Vary in Persuasiveness. PERSONALITY AND SOCIAL PSYCHOLOGY BULLETIN 2016. [DOI: 10.1177/0146167296222007] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Petty and Cacioppo's elaboration likelihood model of persuasion and Chaiken, Liberman, and Eagly `s heuristic-systematic model suggest that for highly involved message recipients, adding weaker arguments to strong arguments could suppress or dilute the overall persuasiveness of a message. The few previous studies addressing this prediction, however, have provided conflicting evidence. In the present study, involvement level, number of weak arguments, and number of strong arguments were varied factorially in a message advocating the institution of senior comprehensive exams. Results provided clear support for the predicted weak-argument suppression of attitude change. Analyses of thought-listing data supported the notion that suppression results from the integration of favorable and unfavorable cognitive responses to the communication. Further research questions regarding the processes by which mixed-quality messages exert theirpersuasive impact are discussed.
Collapse
|
12
|
Amendola LM, Jarvik GP, Leo MC, McLaughlin HM, Akkari Y, Amaral MD, Berg JS, Biswas S, Bowling KM, Conlin LK, Cooper GM, Dorschner MO, Dulik MC, Ghazani AA, Ghosh R, Green RC, Hart R, Horton C, Johnston JJ, Lebo MS, Milosavljevic A, Ou J, Pak CM, Patel RY, Punj S, Richards CS, Salama J, Strande NT, Yang Y, Plon SE, Biesecker LG, Rehm HL. Performance of ACMG-AMP Variant-Interpretation Guidelines among Nine Laboratories in the Clinical Sequencing Exploratory Research Consortium. Am J Hum Genet 2016; 98:1067-1076. [PMID: 27181684 DOI: 10.1016/j.ajhg.2016.03.024] [Citation(s) in RCA: 311] [Impact Index Per Article: 38.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2015] [Accepted: 03/22/2016] [Indexed: 02/06/2023] Open
Abstract
Evaluating the pathogenicity of a variant is challenging given the plethora of types of genetic evidence that laboratories consider. Deciding how to weigh each type of evidence is difficult, and standards have been needed. In 2015, the American College of Medical Genetics and Genomics (ACMG) and the Association for Molecular Pathology (AMP) published guidelines for the assessment of variants in genes associated with Mendelian diseases. Nine molecular diagnostic laboratories involved in the Clinical Sequencing Exploratory Research (CSER) consortium piloted these guidelines on 99 variants spanning all categories (pathogenic, likely pathogenic, uncertain significance, likely benign, and benign). Nine variants were distributed to all laboratories, and the remaining 90 were evaluated by three laboratories. The laboratories classified each variant by using both the laboratory's own method and the ACMG-AMP criteria. The agreement between the two methods used within laboratories was high (K-alpha = 0.91) with 79% concordance. However, there was only 34% concordance for either classification system across laboratories. After consensus discussions and detailed review of the ACMG-AMP criteria, concordance increased to 71%. Causes of initial discordance in ACMG-AMP classifications were identified, and recommendations on clarification and increased specification of the ACMG-AMP criteria were made. In summary, although an initial pilot of the ACMG-AMP guidelines did not lead to increased concordance in variant interpretation, comparing variant interpretations to identify differences and having a common framework to facilitate resolution of those differences were beneficial for improving agreement, allowing iterative movement toward increased reporting consistency for variants in genes associated with monogenic disease.
Collapse
Affiliation(s)
- Laura M Amendola
- Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, WA 98195, USA
| | - Gail P Jarvik
- Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, WA 98195, USA.
| | - Michael C Leo
- Center for Health Research, Kaiser Permanente, Portland, OR 97227, USA
| | - Heather M McLaughlin
- Laboratory for Molecular Medicine, Partners HealthCare Personalized Medicine, Cambridge, MA 02139, USA
| | - Yassmine Akkari
- Department of Molecular and Medical Genetics, Oregon Health and Science University, Portland, OR 97239, USA
| | | | - Jonathan S Berg
- Department of Genetics, University of North Carolina, Chapel Hill, NC 27599, USA
| | - Sawona Biswas
- Division of Human Genetics, Department of Pediatrics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Kevin M Bowling
- HudsonAlpha Institute for Biotechnology, Huntsville, AL 35806, USA
| | - Laura K Conlin
- Division of Human Genetics, Department of Pediatrics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Greg M Cooper
- HudsonAlpha Institute for Biotechnology, Huntsville, AL 35806, USA
| | - Michael O Dorschner
- Center for Precision Diagnostics, Department of Pathology, University of Washington, Seattle, WA 98195, USA
| | - Matthew C Dulik
- Department of Pathology and Laboratory Medicine, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Arezou A Ghazani
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA 02215, USA
| | | | - Robert C Green
- Laboratory for Molecular Medicine, Partners HealthCare Personalized Medicine, Cambridge, MA 02139, USA; Brigham and Women's Hospital and Harvard Medical School, Cambridge, MA 02115, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Ragan Hart
- Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, WA 98195, USA
| | - Carrie Horton
- Clinical Diagnostics, Ambry Genetics, Aliso Viejo, CA 92656, USA
| | - Jennifer J Johnston
- Intramural Research Program, National Human Genome Research Institute, NIH, Bethesda, MD 20892, USA
| | - Matthew S Lebo
- Laboratory for Molecular Medicine, Partners HealthCare Personalized Medicine, Cambridge, MA 02139, USA; Brigham and Women's Hospital and Harvard Medical School, Cambridge, MA 02115, USA
| | | | - Jeffrey Ou
- Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, WA 98195, USA
| | - Christine M Pak
- Department of Molecular and Medical Genetics, Oregon Health and Science University, Portland, OR 97239, USA
| | | | - Sumit Punj
- Department of Molecular and Medical Genetics, Oregon Health and Science University, Portland, OR 97239, USA
| | - Carolyn Sue Richards
- Department of Molecular and Medical Genetics, Oregon Health and Science University, Portland, OR 97239, USA
| | - Joseph Salama
- Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, WA 98195, USA
| | - Natasha T Strande
- Department of Genetics, University of North Carolina, Chapel Hill, NC 27599, USA
| | - Yaping Yang
- Baylor College of Medicine, Houston, TX 77030, USA
| | | | - Leslie G Biesecker
- Intramural Research Program, National Human Genome Research Institute, NIH, Bethesda, MD 20892, USA
| | - Heidi L Rehm
- Laboratory for Molecular Medicine, Partners HealthCare Personalized Medicine, Cambridge, MA 02139, USA; Brigham and Women's Hospital and Harvard Medical School, Cambridge, MA 02115, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
| |
Collapse
|
13
|
Kirilenko AP, Stepchenkova S. Inter-Coder Agreement in One-to-Many Classification: Fuzzy Kappa. PLoS One 2016; 11:e0149787. [PMID: 26933956 PMCID: PMC4775035 DOI: 10.1371/journal.pone.0149787] [Citation(s) in RCA: 43] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2015] [Accepted: 02/04/2016] [Indexed: 11/19/2022] Open
Abstract
Content analysis involves classification of textual, visual, or audio data. The inter-coder agreement is estimated by making two or more coders to classify the same data units, with subsequent comparison of their results. The existing methods of agreement estimation, e.g., Cohen’s kappa, require that coders place each unit of content into one and only one category (one-to-one coding) from the pre-established set of categories. However, in certain data domains (e.g., maps, photographs, databases of texts and images), this requirement seems overly restrictive. The restriction could be lifted, provided that there is a measure to calculate the inter-coder agreement in the one-to-many protocol. Building on the existing approaches to one-to-many coding in geography and biomedicine, such measure, fuzzy kappa, which is an extension of Cohen’s kappa, is proposed. It is argued that the measure is especially compatible with data from certain domains, when holistic reasoning of human coders is utilized in order to describe the data and access the meaning of communication.
Collapse
Affiliation(s)
- Andrei P Kirilenko
- The Department of Tourism, Recreation and Sport Management, University of Florida, P.O. Box 118208, Gainesville, FL, 32611-8208, United States of America
| | - Svetlana Stepchenkova
- The Department of Tourism, Recreation and Sport Management, University of Florida, P.O. Box 118208, Gainesville, FL, 32611-8208, United States of America
| |
Collapse
|
14
|
Mooney SJ, DiMaggio CJ, Lovasi GS, Neckerman KM, Bader MDM, Teitler JO, Sheehan DM, Jack DW, Rundle AG. Use of Google Street View to Assess Environmental Contributions to Pedestrian Injury. Am J Public Health 2016; 106:462-9. [PMID: 26794155 DOI: 10.2105/ajph.2015.302978] [Citation(s) in RCA: 42] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022]
Abstract
OBJECTIVES To demonstrate an information technology-based approach to assess characteristics of streets and intersections associated with injuries that is less costly and time-consuming than location-based studies of pedestrian injury. METHODS We used imagery captured by Google Street View from 2007 to 2011 to assess 9 characteristics of 532 intersections within New York City. We controlled for estimated pedestrian count and estimated the relation between intersections' characteristics and frequency of injurious collisions. RESULTS The count of pedestrian injuries at intersections was associated with the presence of marked crosswalks (80% increase; 95% confidence interval [CI] = 2%, 218%), pedestrian signals (156% increase; 95% CI = 69%, 259%), nearby billboards (42% increase; 95% CI = 7%, 90%), and bus stops (120% increase; 95% CI = 51%, 220%). Injury incidence per pedestrian was lower at intersections with higher estimated pedestrian volumes. CONCLUSIONS Consistent with in-person study observations, the information-technology approach found traffic islands, visual advertising, bus stops, and crosswalk infrastructures to be associated with elevated counts of pedestrian injury in New York City. Virtual site visits for pedestrian injury control studies are a viable and informative methodology.
Collapse
Affiliation(s)
- Stephen J Mooney
- Stephen J. Mooney, Charles J. DiMaggio, Gina S. Lovasi, Daniel M. Sheehan, and Andrew G. Rundle are with Department of Epidemiology, Mailman School of Public Health, Columbia University, New York, NY. Kathryn M. Neckerman is with Columbia Population Research Center, Columbia University. Michael D. M. Bader is with Department of Sociology, American University, Washington, DC. Julien O. Teitler is with School of Social Work, Columbia University. Darby W. Jack is with Department of Environmental Health Sciences, Mailman School of Public Health
| | - Charles J DiMaggio
- Stephen J. Mooney, Charles J. DiMaggio, Gina S. Lovasi, Daniel M. Sheehan, and Andrew G. Rundle are with Department of Epidemiology, Mailman School of Public Health, Columbia University, New York, NY. Kathryn M. Neckerman is with Columbia Population Research Center, Columbia University. Michael D. M. Bader is with Department of Sociology, American University, Washington, DC. Julien O. Teitler is with School of Social Work, Columbia University. Darby W. Jack is with Department of Environmental Health Sciences, Mailman School of Public Health
| | - Gina S Lovasi
- Stephen J. Mooney, Charles J. DiMaggio, Gina S. Lovasi, Daniel M. Sheehan, and Andrew G. Rundle are with Department of Epidemiology, Mailman School of Public Health, Columbia University, New York, NY. Kathryn M. Neckerman is with Columbia Population Research Center, Columbia University. Michael D. M. Bader is with Department of Sociology, American University, Washington, DC. Julien O. Teitler is with School of Social Work, Columbia University. Darby W. Jack is with Department of Environmental Health Sciences, Mailman School of Public Health
| | - Kathryn M Neckerman
- Stephen J. Mooney, Charles J. DiMaggio, Gina S. Lovasi, Daniel M. Sheehan, and Andrew G. Rundle are with Department of Epidemiology, Mailman School of Public Health, Columbia University, New York, NY. Kathryn M. Neckerman is with Columbia Population Research Center, Columbia University. Michael D. M. Bader is with Department of Sociology, American University, Washington, DC. Julien O. Teitler is with School of Social Work, Columbia University. Darby W. Jack is with Department of Environmental Health Sciences, Mailman School of Public Health
| | - Michael D M Bader
- Stephen J. Mooney, Charles J. DiMaggio, Gina S. Lovasi, Daniel M. Sheehan, and Andrew G. Rundle are with Department of Epidemiology, Mailman School of Public Health, Columbia University, New York, NY. Kathryn M. Neckerman is with Columbia Population Research Center, Columbia University. Michael D. M. Bader is with Department of Sociology, American University, Washington, DC. Julien O. Teitler is with School of Social Work, Columbia University. Darby W. Jack is with Department of Environmental Health Sciences, Mailman School of Public Health
| | - Julien O Teitler
- Stephen J. Mooney, Charles J. DiMaggio, Gina S. Lovasi, Daniel M. Sheehan, and Andrew G. Rundle are with Department of Epidemiology, Mailman School of Public Health, Columbia University, New York, NY. Kathryn M. Neckerman is with Columbia Population Research Center, Columbia University. Michael D. M. Bader is with Department of Sociology, American University, Washington, DC. Julien O. Teitler is with School of Social Work, Columbia University. Darby W. Jack is with Department of Environmental Health Sciences, Mailman School of Public Health
| | - Daniel M Sheehan
- Stephen J. Mooney, Charles J. DiMaggio, Gina S. Lovasi, Daniel M. Sheehan, and Andrew G. Rundle are with Department of Epidemiology, Mailman School of Public Health, Columbia University, New York, NY. Kathryn M. Neckerman is with Columbia Population Research Center, Columbia University. Michael D. M. Bader is with Department of Sociology, American University, Washington, DC. Julien O. Teitler is with School of Social Work, Columbia University. Darby W. Jack is with Department of Environmental Health Sciences, Mailman School of Public Health
| | - Darby W Jack
- Stephen J. Mooney, Charles J. DiMaggio, Gina S. Lovasi, Daniel M. Sheehan, and Andrew G. Rundle are with Department of Epidemiology, Mailman School of Public Health, Columbia University, New York, NY. Kathryn M. Neckerman is with Columbia Population Research Center, Columbia University. Michael D. M. Bader is with Department of Sociology, American University, Washington, DC. Julien O. Teitler is with School of Social Work, Columbia University. Darby W. Jack is with Department of Environmental Health Sciences, Mailman School of Public Health
| | - Andrew G Rundle
- Stephen J. Mooney, Charles J. DiMaggio, Gina S. Lovasi, Daniel M. Sheehan, and Andrew G. Rundle are with Department of Epidemiology, Mailman School of Public Health, Columbia University, New York, NY. Kathryn M. Neckerman is with Columbia Population Research Center, Columbia University. Michael D. M. Bader is with Department of Sociology, American University, Washington, DC. Julien O. Teitler is with School of Social Work, Columbia University. Darby W. Jack is with Department of Environmental Health Sciences, Mailman School of Public Health
| |
Collapse
|
15
|
Moradzadeh N, Ganjali M, Baghfalaki T. Weighted kappa as a function of unweighted kappas. COMMUN STAT-SIMUL C 2015. [DOI: 10.1080/03610918.2015.1105975] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
16
|
Nussbeck FW, Eid M. Multimethod latent class analysis. Front Psychol 2015; 6:1332. [PMID: 26441714 PMCID: PMC4584970 DOI: 10.3389/fpsyg.2015.01332] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2015] [Accepted: 08/19/2015] [Indexed: 11/13/2022] Open
Abstract
Correct and, hence, valid classifications of individuals are of high importance in the social sciences as these classifications are the basis for diagnoses and/or the assignment to a treatment. The via regia to inspect the validity of psychological ratings is the multitrait-multimethod (MTMM) approach. First, a latent variable model for the analysis of rater agreement (latent rater agreement model) will be presented that allows for the analysis of convergent validity between different measurement approaches (e.g., raters). Models of rater agreement are transferred to the level of latent variables. Second, the latent rater agreement model will be extended to a more informative MTMM latent class model. This model allows for estimating (i) the convergence of ratings, (ii) method biases in terms of differential latent distributions of raters and differential associations of categorizations within raters (specific rater bias), and (iii) the distinguishability of categories indicating if categories are satisfyingly distinct from each other. Finally, an empirical application is presented to exemplify the interpretation of the MTMM latent class model.
Collapse
Affiliation(s)
| | - Michael Eid
- Department of Education and Psychology, Freie Universitaet Berlin Berlin, Germany
| |
Collapse
|
17
|
Harmon-Walker G, Kaiser DH. The Bird's Nest Drawing: A study of construct validity and interrater reliability. ARTS IN PSYCHOTHERAPY 2015. [DOI: 10.1016/j.aip.2014.12.008] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
18
|
A sequential examination of parent-child interactions at anesthetic induction. J Clin Psychol Med Settings 2014; 21:374-85. [PMID: 25352168 DOI: 10.1007/s10880-014-9413-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Abstract
Parental presence is often employed to alleviate distress in children within the context of surgery under general anesthesia. The critical component of this intervention may not be the presence of the parent per se, but more importantly the behaviors in which the parent and child engage when the parent is present. The purpose of the current study was to examine the sequential and reciprocal relationships between parental behaviors and child distress during induction of general anesthesia. Participants were 32 children (3-6 years) receiving dental surgery as a day surgery procedure, and their parents. A modified Child Adult Medical Procedures Interaction Scale-Revised was used to code parent and child behaviors. Initial child distress led to increased parental provision of reassurance and decreased provision of physical comfort. Our findings may inform the development of preoperative preparation programs whereby parents can be appropriately educated about what behaviors will be helpful/unhelpful for their child during induction of general anesthesia.
Collapse
|
19
|
Chang CH, Yang JT, Lee MH. A Novel “Maximizing Kappa” Approach for Assessing the Ability of a Diagnostic Marker and Its Optimal Cutoff Value. J Biopharm Stat 2014; 25:1005-19. [DOI: 10.1080/10543406.2014.920347] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
20
|
|
21
|
Karelitz TM, Budescu DV. The Effect of the Raters' Marginal Distributions on Their Matched Agreement: A Rescaling Framework for Interpreting Kappa. MULTIVARIATE BEHAVIORAL RESEARCH 2013; 48:923-952. [PMID: 26745599 DOI: 10.1080/00273171.2013.830064] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Cohen's κ measures the improvement in classification above chance level and it is the most popular measure of interjudge agreement. Yet, there is considerable confusion about its interpretation. Specifically, researchers often ignore the fact that the observed level of matched agreement is bounded from above and below and the bounds are a function of the particular marginal distributions of the table. We propose that these bounds should be used to rescale the components of κ (observed and expected agreement). Rescaling κ in this manner results in κ', a measure that was originally proposed by Cohen (1960) and was largely ignored in both research and practice. This measure provides a common scale for agreement measures of tables with different marginal distributions. It reaches the maximal value of 1 when the judges show the highest level of agreement possible, given their marginal disagreements. We conclude that κ' should be used to measure the level of matched agreement contingent on a particular set of marginal distributions. The article provides a framework and a set of guidelines that facilitate comparisons between various types of agreement tables. We illustrate our points with simulations and real data from two studies-one involving judges' ratings of baseball players and one involving ratings of essays in high-stakes tests.
Collapse
Affiliation(s)
- Tzur M Karelitz
- a National Institute for Testing and Evaluation , Jerusalem , Israel
| | | |
Collapse
|
22
|
Parpia S, Koval JJ, Donner A. Evaluation of confidence intervals for the kappa statistic when the assumption of marginal homogeneity is violated. Comput Stat 2013. [DOI: 10.1007/s00180-013-0424-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
23
|
|
24
|
Rigor in qualitative supply chain management research. INTERNATIONAL JOURNAL OF PHYSICAL DISTRIBUTION & LOGISTICS MANAGEMENT 2012. [DOI: 10.1108/09600031211269767] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
25
|
Rotondi MA, Donner A. A confidence interval approach to sample size estimation for interobserver agreement studies with multiple raters and outcomes. J Clin Epidemiol 2012; 65:778-84. [DOI: 10.1016/j.jclinepi.2011.10.019] [Citation(s) in RCA: 70] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2011] [Revised: 10/25/2011] [Accepted: 10/30/2011] [Indexed: 10/28/2022]
|
26
|
Cohen’s quadratically weighted kappa is higher than linearly weighted kappa for tridiagonal agreement tables. ACTA ACUST UNITED AC 2012. [DOI: 10.1016/j.stamet.2011.08.006] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
27
|
Warrens MJ. A family of multi-rater kappas that can always be increased and decreased by combining categories. ACTA ACUST UNITED AC 2012. [DOI: 10.1016/j.stamet.2011.08.008] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
|
28
|
|
29
|
Fakhri A, Pakpour AH, Burri A, Morshedi H, Zeidi IM. The Female Sexual Function Index: Translation and Validation of an Iranian Version. J Sex Med 2012; 9:514-23. [DOI: 10.1111/j.1743-6109.2011.02553.x] [Citation(s) in RCA: 114] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
|
30
|
|
31
|
|
32
|
|
33
|
Jaudi S, Du Montcel ST, Fries N, Nizard J, Desfontaines VH, Dommergues M. Online evaluation of fetal second-trimester four-chamber view images: a comparison of six evaluation methods. ULTRASOUND IN OBSTETRICS & GYNECOLOGY : THE OFFICIAL JOURNAL OF THE INTERNATIONAL SOCIETY OF ULTRASOUND IN OBSTETRICS AND GYNECOLOGY 2011; 38:185-190. [PMID: 21308829 DOI: 10.1002/uog.8941] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
OBJECTIVE To compare six online evaluation methods for auditing routine second-trimester four-chamber view still images. METHODS We evaluated three different scoring grids (subjective, five-item score and seven-item score), which were applied with or without access to online help, resulting in a total of six evaluation methods. For the subjective scoring grid, images were rated as excellent, good, fair, poor or very poor. For the five-item score, 1 point was allocated for visualization (vs non-visualization or non-evaluable) of each of: heart crux, atria, ventricles, apex and aorta, yielding a score of 0-5. For the seven-item score, 1 point was allocated for clear (vs unclear) visualization of each of: moderator band at the apex, interventricular septum, atrioventricular valves, non-linear insertion of atrioventricular valves (normal offset), septum primum, aorta and pulmonary vein. Each evaluation method was used via the Internet by three randomly selected reviewers, who evaluated the same set of 80 images. Reviewers were experienced in fetal ultrasound, but were not involved in the design of the study. Interrater agreement was the main outcome. RESULTS The five-item scoring grid with online help achieved the best interrater agreement (interrater intraclass correlation coefficient = 0.7). CONCLUSIONS Evaluation of the second-trimester sonographic four-chamber view is apparently best achieved with a simple five-item scoring grid.
Collapse
Affiliation(s)
- S Jaudi
- Service de Gynécologie Obstétrique, Groupe Hospitalier Pitié-Salpêtrière, APHP, Paris, France
| | | | | | | | | | | |
Collapse
|
34
|
Kottner J, Streiner DL. The difference between reliability and agreement. J Clin Epidemiol 2011; 64:701-2; author reply 702. [PMID: 21411278 DOI: 10.1016/j.jclinepi.2010.12.001] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2010] [Accepted: 12/07/2010] [Indexed: 11/15/2022]
|
35
|
Warrens MJ. Weighted kappa is higher than Cohen’s kappa for tridiagonal agreement tables. ACTA ACUST UNITED AC 2011. [DOI: 10.1016/j.stamet.2010.09.004] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
36
|
Lee J, Imanaka Y, Sekimoto M, Nishikawa H, Ikai H, Motohashi T. Validation of a novel method to identify healthcare-associated infections. J Hosp Infect 2011; 77:316-20. [PMID: 21277647 DOI: 10.1016/j.jhin.2010.11.013] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2010] [Accepted: 11/07/2010] [Indexed: 10/18/2022]
Abstract
Despite its potential for use in large-scale analyses, previous attempts to utilise administrative data to identify healthcare-associated infections (HAI) have been shown to be unsuccessful. In this study, we validate the accuracy of a novel method of HAI identification based on antibiotic utilisation patterns derived from administrative data. We contemporaneously and independently identified HAIs using both chart review analysis and our method from four Japanese hospitals (N=584). The accuracy of our method was quantified using sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) relative to chart review analysis. We also analysed the inter-rater agreement between both identification methods using Cohen's kappa coefficient. Our method showed a sensitivity of 0.93 (95% CI: 0.87-0.96), specificity of 0.91 (0.89-0.94), PPV of 0.75 (0.68-0.81) and NPV of 0.98 (0.96-0.99). A kappa coefficient of 0.78 indicated a relatively high level of agreement between the two methods. Our results show that our method has sufficient validity for identification of HAIs in large groups of patients, though the relatively lower PPV may imply limited utilisation in the pinpointing of individual infections. Our method may have applications in large-scale HAI identification, risk-adjusted multicentre studies involving cost of illness, or even as the starting point of future cost-effectiveness analyses of HAI control measures.
Collapse
Affiliation(s)
- J Lee
- Department of Healthcare Economics and Quality Management, School of Public Health, Graduate School of Medicine, Kyoto University, Japan
| | | | | | | | | | | | | |
Collapse
|
37
|
|
38
|
Viehweger E, Pfund LZ, Hélix M, Rohon MA, Jacquemier M, Scavarda D, Jouve JL, Bollini G, Loundou A, Simeoni MC. Influence of clinical and gait analysis experience on reliability of observational gait analysis (Edinburgh Gait Score Reliability). Ann Phys Rehabil Med 2010; 53:535-46. [DOI: 10.1016/j.rehab.2010.09.002] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2010] [Revised: 09/04/2010] [Accepted: 09/14/2010] [Indexed: 10/19/2022]
|
39
|
Warrens MJ. Cohen’s kappa can always be increased and decreased by combining categories. ACTA ACUST UNITED AC 2010. [DOI: 10.1016/j.stamet.2010.05.003] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
40
|
Cernicchiaro N, Pearl DL, McEwen SA, LeJeune JT. Assessment of diagnostic tools for identifying cattle shedding and super-shedding Escherichia coli O157:H7 in a longitudinal study of naturally infected feedlot steers in Ohio. Foodborne Pathog Dis 2010; 8:239-48. [PMID: 21034264 DOI: 10.1089/fpd.2010.0666] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The objectives of this study were to compare the performance of different diagnostic protocols (rectoanal mucosal swabs and immunomagnetic separation [RAMS-IMS], fecal samples and IMS [fecal-IMS], and direct plating) to determine the prevalence of Escherichia coli O157:H7 and to evaluate the pattern of E. coli O157:H7 shedding and super-shedding (defined as having a direct plating count equal to or >10(4) colony forming units of E. coli O157:H7 per gram of feces) in a longitudinal study of naturally infected feedlot steers. RAMS and fecal grab samples were obtained at 14-day intervals from 168 Angus-cross beef steers over a period of 22 weeks. Fecal samples were assessed by direct plating and IMS, whereas RAMS were tested only by enrichment followed by IMS to recover E. coli O157:H7. The period prevalence for shedding was high (62%) among feedlot steers and super-shedding was higher (23%) than anticipated. Although direct plating was the least sensitive method to detect E. coli O157:H7-positive samples, over 20% of high bacterial load samples were not detected by RAMS-IMS and/or fecal-IMS. The sensitivity of RAMS-IMS, fecal-IMS, and direct plating protocols was estimated using simple and multilevel mixed-effects logistic regression models, in which the dependent variable was the dichotomous results of each test and gold standard (i.e., parallel interpretation of the three protocols)-positive individuals were included as an independent variable along with other factors such as dietary supplements, time of sampling, and being exposed to a super-shedding pen-mate. The associations between these factors and the sensitivity of the diagnostic protocols were not statistically significant. In conclusion, differences in the reported impact of diet and probiotics on the shedding of E. coli O157:H7 in previous studies using RAMS-IMS or fecal-IMS were unlikely due to their impact on test performance.
Collapse
Affiliation(s)
- Natalia Cernicchiaro
- Department of Population Medicine, Ontario Veterinary College, University of Guelph, Guelph, Canada.
| | | | | | | |
Collapse
|
41
|
|
42
|
Peirce's iand Cohen's κfor 2×2Measures of Rater Reliability. JOURNAL OF PROBABILITY AND STATISTICS 2010. [DOI: 10.1155/2010/480364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
This study examined a historical mixture model approach to the evaluation of ratings made in “gold standard” and two-rater2×2contingency tables. Peirce'siand the derivediaverage were discussed in relation to a widely used index of reliability in the behavioral sciences, Cohen'sκ. Sample size, population base rate of occurrence, the true “science of the method”, and guessing rates were manipulated across simulations. In “gold standard” situations, Peirce'sitended to recover the true reliability of ratings as well as better thanκ. In two-rater situations,iavetended to recover the true reliability as well as better thanκin most situations. The empirical utility and potential theoretical benefits of mixture model methods in estimating reliability are discussed, as are the associations between theistatistics and other modern mixture model approaches.
Collapse
|
43
|
Mielke PW, Berry KJ, Johnston JE. Resampling Probability Values for Weighted Kappa with Multiple Raters. Psychol Rep 2008; 102:606-13. [DOI: 10.2466/pr0.102.2.606-613] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
A new procedure to compute weighted kappa with multiple raters is described. A resampling procedure to compute approximate probability values for weighted kappa with multiple raters is presented. Applications of weighted kappa are illustrated with an example analysis of classifications by three independent raters.
Collapse
Affiliation(s)
| | | | - Janis E. Johnston
- AAAS Science and Technology Policy Fellow at U.S. EPA Homeland Security Research Center
| |
Collapse
|
44
|
Grayson DA. Latent trait models for validity and reliability with 2×2 tables. AUSTRALIAN JOURNAL OF PSYCHOLOGY 2007. [DOI: 10.1080/00049539808258797] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
45
|
|
46
|
Kujan O, Oliver RJ, Khattab A, Roberts SA, Thakker N, Sloan P. Evaluation of a new binary system of grading oral epithelial dysplasia for prediction of malignant transformation. Oral Oncol 2006; 42:987-93. [PMID: 16731030 DOI: 10.1016/j.oraloncology.2005.12.014] [Citation(s) in RCA: 260] [Impact Index Per Article: 14.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2005] [Accepted: 12/09/2005] [Indexed: 11/21/2022]
Abstract
The aim of this paper is to assess the reproducibility of a novel binary grading system (high/low risk) of oral epithelial dysplasia and to compare it with the WHO classification 2005. The accuracy of the new system for predicting malignant transformation was also assessed. Ninety-six consecutive oral epithelial dysplasia biopsies with known clinical outcomes were retrieved from the Oral Pathology archives. A pilot study was conducted on 28 cases to determine the process of classification. Four observers then reviewed the same set of H&E stained slides of 68 oral dysplastic lesions using the two grading systems blinded to the clinical outcomes. The overall inter-observer unweighted and weighted kappa agreements for the WHO grading system were Ks = 0.22 (95% CI: 0.11-0.35), Kw = 0.63 (95% CI: 0.42-0.78), respectively, versus K = 0.50 (95% CI: 0.35-0.67) for the new binary system. Interestingly, all pathologists showed satisfactory agreement on the distinction of mild dysplasia from severe dysplasia and from carcinoma in situ using the new WHO classification. However, assessment of moderate dysplasia remains problematic. The sensitivity and specificity of the new binary grading system for predicting malignant transformation in oral epithelial dysplasia were 85% and 80%, respectively and the accuracy was 82%. The new binary grading system complemented the WHO Classification 2005 and may have merit in helping clinicians to make critical clinical decisions particularly for the cases of moderate dysplasia. Histological grading of dysplasia using established criteria is a reproducible prognosticator in oral epithelial dysplasia. Furthermore, the present study showed that more consensus scoring on either the degree of dysplasia, assessment of risk or the presence of each morphological characteristic by a panel should be encouraged.
Collapse
Affiliation(s)
- Omar Kujan
- School of Dentistry, The University of Manchester, Manchester M15 6FH, United Kingdom
| | | | | | | | | | | |
Collapse
|
47
|
Malek IA, Machani B, Mevcha AM, Hyder NH. Inter-observer reliability and intra-observer reproducibility of the Weber classification of ankle fractures. ACTA ACUST UNITED AC 2006; 88:1204-6. [PMID: 16943473 DOI: 10.1302/0301-620x.88b9.17954] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
Our aim was to assess the reproducibility and the reliability of the Weber classification system for fractures of the ankle based on anteroposterior and lateral radiographs. Five observers with varying clinical experience reviewed 50 sets of blinded radiographs. The same observers reviewed the same radiographs again after an interval of four weeks. Inter- and intra-observer agreement was assessed based on the proportion of agreement and the values of the kappa coefficient. For inter-observer agreement, the mean kappa value was 0.61 (0.59 to 0.63) and the proportion of agreement was 78% (76% to 79%) and for intra-observer agreement the mean kappa value was 0.74 (0.39 to 0.86) with an 85% (60% to 93%) observed agreement. These results show that the Weber classification of fractures of the ankle based on two radiological views has substantial inter-observer reliability and intra-observer reproducibility.
Collapse
Affiliation(s)
- I A Malek
- Leighton Hospital, Middlewich Road, Crewe, CW1 4QJ, UK.
| | | | | | | |
Collapse
|
48
|
White RE, Thornhill S, Hampson E. Entrepreneurs and evolutionary biology: The relationship between testosterone and new venture creation. ORGANIZATIONAL BEHAVIOR AND HUMAN DECISION PROCESSES 2006. [DOI: 10.1016/j.obhdp.2005.11.001] [Citation(s) in RCA: 73] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
49
|
|
50
|
Abstract
For comparing the validity of rating methods, the adjusted kappa (S coefficient) and Yule's Y index are better than Cohen's kappa which is affected by marginal probabilities. We consider a validity study in which a subject is assessed as exposed or not-exposed by two competing rating methods and the gold standard. We are interested in one of the methods, which is closer in agreement with the gold standard. We present statistical methods taking correlations into account for comparing the validity of the rating methods using S coefficient and Y index. We show how the S coefficient and Yule's Y index are related to sensitivity and specificity. In comparing the two rating methods, the preference is clear when the inference is the same for both S and Y. If the inference using S differs from that using Y, then it is not obvious how to decide a preference. This may occur when one rating method is better than the other in sensitivity but not in specificity. Numerical examples for comparing asbestos-exposure assessment methods are illustrated.
Collapse
Affiliation(s)
- Jun-mo Nam
- Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Department of Health & Human Services, Rockville, MD 20892-7240, USA.
| |
Collapse
|