1
|
Schauber SK, Olsen AO, Werner EL, Magelssen M. Inconsistencies in rater-based assessments mainly affect borderline candidates: but using simple heuristics might improve pass-fail decisions. ADVANCES IN HEALTH SCIENCES EDUCATION : THEORY AND PRACTICE 2024; 29:1749-1767. [PMID: 38649529 PMCID: PMC11549209 DOI: 10.1007/s10459-024-10328-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/20/2023] [Accepted: 03/24/2024] [Indexed: 04/25/2024]
Abstract
INTRODUCTION Research in various areas indicates that expert judgment can be highly inconsistent. However, expert judgment is indispensable in many contexts. In medical education, experts often function as examiners in rater-based assessments. Here, disagreement between examiners can have far-reaching consequences. The literature suggests that inconsistencies in ratings depend on the level of performance a to-be-evaluated candidate shows. This possibility has not been addressed deliberately and with appropriate statistical methods. By adopting the theoretical lens of ecological rationality, we evaluate if easily implementable strategies can enhance decision making in real-world assessment contexts. METHODS We address two objectives. First, we investigate the dependence of rater-consistency on performance levels. We recorded videos of mock-exams and had examiners (N=10) evaluate four students' performances and compare inconsistencies in performance ratings between examiner-pairs using a bootstrapping procedure. Our second objective is to provide an approach that aids decision making by implementing simple heuristics. RESULTS We found that discrepancies were largely a function of the level of performance the candidates showed. Lower performances were rated more inconsistently than excellent performances. Furthermore, our analyses indicated that the use of simple heuristics might improve decisions in examiner pairs. DISCUSSION Inconsistencies in performance judgments continue to be a matter of concern, and we provide empirical evidence for them to be related to candidate performance. We discuss implications for research and the advantages of adopting the perspective of ecological rationality. We point to directions both for further research and for development of assessment practices.
Collapse
Affiliation(s)
- Stefan K Schauber
- Centre for Health Sciences Education, Faculty of Medicine, University of Oslo, Oslo, Norway.
- Centre for Educational Measurement (CEMO), Faculty of Educational Sciences, University of Oslo, Oslo, Norway.
| | - Anne O Olsen
- Department of Community Medicine and Global Health, Institute of Health and Society, University of Oslo, Oslo, Norway
| | - Erik L Werner
- Department of General Practice, Institute of Health and Society, University of Oslo, Oslo, Norway
| | - Morten Magelssen
- Centre for Medical Ethics, Institute of Health and Society, University of Oslo, Oslo, Norway
| |
Collapse
|
2
|
Sharp S, Snowden A, Stables I, Paterson R. Ensuring robust OSCE assessments: A reflective account from a Scottish school of nursing. Nurse Educ Pract 2024; 78:104021. [PMID: 38917560 DOI: 10.1016/j.nepr.2024.104021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2024] [Revised: 05/21/2024] [Accepted: 06/05/2024] [Indexed: 06/27/2024]
Abstract
AIM This paper reflects on the experience of one Scottish University in conducting a face-to-face Objective Structured Examination (OSCE) for large cohorts of student nurses. It outlines the challenges experienced and learning gained. Borton's model of reflection frames this work due to its simplicity, ease of application and cyclical nature. BACKGROUND The theoretical framework for the OSCE is critical thinking, enabling students to apply those skills authentically. OSCE's are designed to transfer classroom knowledge to clinical practice and offer an authentic work-based assessment. DESIGN Validity and robustness are key considerations in any assessment and in OSCE, the number of stations that students encounter is important and debated. We used a case-study based OSCE approach initially over four stations and following reflection, changed to one long station with four phases. RESULTS In OSCE examinations, interrater reliability is a necessity, and students expect equity of approach. We identified that despite clear marking criteria, marks were polarised, with students achieving high or low marks with little middle ground. Review of examination papers highlighted that although students' overall performance was good, some had failed in at least one station, suggesting a four-station approach may skew results. On reflection we hypothesised that using a one station case study-based, phased approach enabled the examiner to build up a more holistic picture of student knowledge and skills. It also provided the student opportunity to develop a rapport with the examiner and standardised patient, thereby putting them more at ease. We argue that this approach is holistic, authentic and student centred. CONCLUSIONS Our experience highlights that a single station, four phase OSCE is preferrable, enabling students to integrate all aspects of the assessment and provides a holistic view of clinical skills and knowledge.
Collapse
Affiliation(s)
- Sandra Sharp
- Edinburgh Napier University, School of Health and social Care, 11 Sighthill Court, Edinburgh EH11 45BN, UK.
| | - Austyn Snowden
- Edinburgh Napier University, School of Health and social Care, 11 Sighthill Court, Edinburgh EH11 45BN, UK
| | - Ian Stables
- Edinburgh Napier University, School of Health and social Care, 11 Sighthill Court, Edinburgh EH11 45BN, UK
| | - Ruth Paterson
- Edinburgh Napier University, School of Health and social Care, 11 Sighthill Court, Edinburgh EH11 45BN, UK
| |
Collapse
|
3
|
Wong WYA, Thistlethwaite J, Moni K, Roberts C. Using cultural historical activity theory to reflect on the sociocultural complexities in OSCE examiners' judgements. ADVANCES IN HEALTH SCIENCES EDUCATION : THEORY AND PRACTICE 2023; 28:27-46. [PMID: 35943605 PMCID: PMC9992227 DOI: 10.1007/s10459-022-10139-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/03/2021] [Accepted: 06/28/2022] [Indexed: 06/15/2023]
Abstract
Examiners' judgements play a critical role in competency-based assessments such as objective structured clinical examinations (OSCEs). The standardised nature of OSCEs and their alignment with regulatory accountability assure their wide use as high-stakes assessment in medical education. Research into examiner behaviours has predominantly explored the desirable psychometric characteristics of OSCEs, or investigated examiners' judgements from a cognitive rather than a sociocultural perspective. This study applies cultural historical activity theory (CHAT) to address this gap in exploring examiners' judgements in a high-stakes OSCE. Based on the idea that OSCE examiners' judgements are socially constructed and mediated by their clinical roles, the objective was to explore the sociocultural factors that influenced examiners' judgements of student competence and use the findings to inform examiner training to enhance assessment practice. Seventeen semi-structured interviews were conducted with examiners who assessed medical student competence in progressing to the next stage of training in a large-scale OSCE at one Australian university. The initial thematic analysis provided a basis for applying CHAT iteratively to explore the sociocultural factors and, specifically, the contradictions created by interactions between different elements such as examiners and rules, thus highlighting the factors influencing examiners' judgements. The findings indicated four key factors that influenced examiners' judgements: examiners' contrasting beliefs about the purpose of the OSCE; their varying perceptions of the marking criteria; divergent expectations of student competence; and idiosyncratic judgement practices. These factors were interrelated with the activity systems of the medical school's assessment practices and the examiners' clinical work contexts. Contradictions were identified through the guiding principles of multi-voicedness and historicity. The exploration of the sociocultural factors that may influence the consistency of examiners' judgements was facilitated by applying CHAT as an analytical framework. Reflecting upon these factors at organisational and system levels generated insights for creating fit-for-purpose examiner training to enhance assessment practice.
Collapse
Affiliation(s)
- Wai Yee Amy Wong
- School of Education and Faculty of Medicine, The University of Queensland, Brisbane, QLD, 4072, Australia.
- School of Nursing and Midwifery, Queen's University Belfast, Belfast, BT9 7BL, UK.
| | - Jill Thistlethwaite
- Faculty of Health, The University of Technology Sydney, Sydney, NSW, 2007, Australia
| | - Karen Moni
- School of Education, The University of Queensland, Brisbane, QLD, 4072, Australia
| | - Chris Roberts
- Sydney Medical School, Faculty of Medicine and Health, The University of Sydney, Sydney, NSW, 2006, Australia
| |
Collapse
|
4
|
Ibrahim MS, Naing NN, Abd Aziz A, Makhtar M, Mohamed Yusoff H, Esa NK, A Rahman NI, Thwe Aung MM, Oo SS, Ismail S, Ramli RA. Medical Experts' Agreement on Risk Assessment Based on All Possible Combinations of the COVID-19 Predictors-A Novel Approach for Public Health Screening and Surveillance. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2022; 19:16601. [PMID: 36554487 PMCID: PMC9779080 DOI: 10.3390/ijerph192416601] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/25/2022] [Revised: 11/29/2022] [Accepted: 12/08/2022] [Indexed: 06/17/2023]
Abstract
During the initial phase of the coronavirus disease 2019 (COVID-19) pandemic, there was a critical need to create a valid and reliable screening and surveillance for university staff and students. Consequently, 11 medical experts participated in this cross-sectional study to judge three risk categories of either low, medium, or high, for all 1536 possible combinations of 11 key COVID-19 predictors. The independent experts' judgement on each combination was recorded via a novel dashboard-based rating method which presented combinations of these predictors in a dynamic display within Microsoft Excel. The validated instrument also incorporated an innovative algorithm-derived deduction for efficient rating tasks. The results of the study revealed an ordinal-weighted agreement coefficient of 0.81 (0.79 to 0.82, p-value < 0.001) that reached a substantial class of inferential benchmarking. Meanwhile, on average, the novel algorithm eliminated 76.0% of rating tasks by deducing risk categories based on experts' ratings for prior combinations. As a result, this study reported a valid, complete, practical, and efficient method for COVID-19 health screening via a reliable combinatorial-based experts' judgement. The new method to risk assessment may also prove applicable for wider fields of practice whenever a high-stakes decision-making relies on experts' agreement on combinations of important criteria.
Collapse
Affiliation(s)
- Mohd Salami Ibrahim
- Faculty of Medicine, Medical Campus, Universiti Sultan Zainal Abidin, Kuala Terengganu 20400, Terengganu, Malaysia
| | - Nyi Nyi Naing
- Faculty of Medicine, Medical Campus, Universiti Sultan Zainal Abidin, Kuala Terengganu 20400, Terengganu, Malaysia
| | - Aniza Abd Aziz
- Faculty of Medicine, Medical Campus, Universiti Sultan Zainal Abidin, Kuala Terengganu 20400, Terengganu, Malaysia
| | - Mokhairi Makhtar
- Faculty of Informatics and Computation, Gong Badak Campus, Universiti Sultan Zainal Abidin, Kuala Terengganu 20300, Terengganu, Malaysia
| | - Harmy Mohamed Yusoff
- Faculty of Medicine, Medical Campus, Universiti Sultan Zainal Abidin, Kuala Terengganu 20400, Terengganu, Malaysia
| | - Nor Kamaruzaman Esa
- Faculty of Medicine, Medical Campus, Universiti Sultan Zainal Abidin, Kuala Terengganu 20400, Terengganu, Malaysia
| | - Nor Iza A Rahman
- Faculty of Medicine, Medical Campus, Universiti Sultan Zainal Abidin, Kuala Terengganu 20400, Terengganu, Malaysia
| | - Myat Moe Thwe Aung
- Faculty of Medicine, Medical Campus, Universiti Sultan Zainal Abidin, Kuala Terengganu 20400, Terengganu, Malaysia
| | - San San Oo
- Faculty of Medicine, Medical Campus, Universiti Sultan Zainal Abidin, Kuala Terengganu 20400, Terengganu, Malaysia
| | - Samhani Ismail
- Faculty of Medicine, Medical Campus, Universiti Sultan Zainal Abidin, Kuala Terengganu 20400, Terengganu, Malaysia
| | - Ras Azira Ramli
- Faculty of Medicine, Medical Campus, Universiti Sultan Zainal Abidin, Kuala Terengganu 20400, Terengganu, Malaysia
| |
Collapse
|
5
|
Yeates P, Maluf A, Kinston R, Cope N, McCray G, Cullen K, O'Neill V, Cole A, Goodfellow R, Vallender R, Chung CW, McKinley RK, Fuller R, Wong G. Enhancing authenticity, diagnosticity and equivalence (AD-Equiv) in multicentre OSCE exams in health professionals education: protocol for a complex intervention study. BMJ Open 2022; 12:e064387. [PMID: 36600366 PMCID: PMC9730346 DOI: 10.1136/bmjopen-2022-064387] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/04/2022] [Accepted: 10/12/2022] [Indexed: 12/12/2022] Open
Abstract
INTRODUCTION Objective structured clinical exams (OSCEs) are a cornerstone of assessing the competence of trainee healthcare professionals, but have been criticised for (1) lacking authenticity, (2) variability in examiners' judgements which can challenge assessment equivalence and (3) for limited diagnosticity of trainees' focal strengths and weaknesses. In response, this study aims to investigate whether (1) sharing integrated-task OSCE stations across institutions can increase perceived authenticity, while (2) enhancing assessment equivalence by enabling comparison of the standard of examiners' judgements between institutions using a novel methodology (video-based score comparison and adjustment (VESCA)) and (3) exploring the potential to develop more diagnostic signals from data on students' performances. METHODS AND ANALYSIS The study will use a complex intervention design, developing, implementing and sharing an integrated-task (research) OSCE across four UK medical schools. It will use VESCA to compare examiner scoring differences between groups of examiners and different sites, while studying how, why and for whom the shared OSCE and VESCA operate across participating schools. Quantitative analysis will use Many Facet Rasch Modelling to compare the influence of different examiners groups and sites on students' scores, while the operation of the two interventions (shared integrated task OSCEs; VESCA) will be studied through the theory-driven method of Realist evaluation. Further exploratory analyses will examine diagnostic performance signals within data. ETHICS AND DISSEMINATION The study will be extra to usual course requirements and all participation will be voluntary. We will uphold principles of informed consent, the right to withdraw, confidentiality with pseudonymity and strict data security. The study has received ethical approval from Keele University Research Ethics Committee. Findings will be academically published and will contribute to good practice guidance on (1) the use of VESCA and (2) sharing and use of integrated-task OSCE stations.
Collapse
Affiliation(s)
- Peter Yeates
- School of Medicine, Keele University, Keele, Staffordshire, UK
| | - Adriano Maluf
- School of Medicine, Keele University, Keele, Staffordshire, UK
| | - Ruth Kinston
- School of Medicine, Keele University, Keele, Staffordshire, UK
| | - Natalie Cope
- School of Medicine, Keele University, Keele, Staffordshire, UK
| | - Gareth McCray
- School of Medicine, Keele University, Keele, Staffordshire, UK
| | - Kathy Cullen
- School of Medicine, Dentistry and Biomedical Sciences, Queen's University Belfast, Belfast, UK
| | - Vikki O'Neill
- School of Medicine, Dentistry and Biomedical Sciences, Queen's University Belfast, Belfast, UK
| | - Aidan Cole
- School of Medicine, Dentistry and Biomedical Sciences, Queen's University Belfast, Belfast, UK
| | | | | | - Ching-Wa Chung
- School of Medicine, Medical Sciences and Nutrition, University of Aberdeen, Aberdeen, Scotland, UK
| | | | - Richard Fuller
- School of Medicine, University of Liverpool Faculty of Health and Life Sciences, Liverpool, UK
| | - Geoff Wong
- Nuffield Department of Primary Care Health Sciences, University of Oxford Division of Public Health and Primary Health Care, Oxford, Oxfordshire, UK
| |
Collapse
|
6
|
McGown PJ, Brown CA, Sebastian A, Le R, Amin A, Greenland A, Sam AH. Is the assumption of equal distances between global assessment categories used in borderline regression valid? BMC MEDICAL EDUCATION 2022; 22:708. [PMID: 36199083 PMCID: PMC9536020 DOI: 10.1186/s12909-022-03753-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/26/2022] [Accepted: 09/12/2022] [Indexed: 06/16/2023]
Abstract
BACKGROUND Standard setting for clinical examinations typically uses the borderline regression method to set the pass mark. An assumption made in using this method is that there are equal intervals between global ratings (GR) (e.g. Fail, Borderline Pass, Clear Pass, Good and Excellent). However, this assumption has never been tested in the medical literature to the best of our knowledge. We examine if the assumption of equal intervals between GR is met, and the potential implications for student outcomes. METHODS Clinical finals examiners were recruited across two institutions to place the typical 'Borderline Pass', 'Clear Pass' and 'Good' candidate on a continuous slider scale between a typical 'Fail' candidate at point 0 and a typical 'Excellent' candidate at point 1. Results were analysed using one-sample t-testing of each interval to an equal interval size of 0.25. Secondary data analysis was performed on summative assessment scores for 94 clinical stations and 1191 medical student examination outcomes in the final 2 years of study at a single centre. RESULTS On a scale from 0.00 (Fail) to 1.00 (Excellent), mean examiner GRs for 'Borderline Pass', 'Clear Pass' and 'Good' were 0.33, 0.55 and 0.77 respectively. All of the four intervals between GRs (Fail-Borderline Pass, Borderline Pass-Clear Pass, Clear Pass-Good, Good-Excellent) were statistically significantly different to the expected value of 0.25 (all p-values < 0.0125). An ordinal linear regression using mean examiner GRs was performed for each of the 94 stations, to determine pass marks out of 24. This increased pass marks for all 94 stations compared with the original GR locations (mean increase 0.21), and caused one additional fail by overall exam pass mark (out of 1191 students) and 92 additional station fails (out of 11,346 stations). CONCLUSIONS Although the current assumption of equal intervals between GRs across the performance spectrum is not met, and an adjusted regression equation causes an increase in station pass marks, the effect on overall exam pass/fail outcomes is modest.
Collapse
Affiliation(s)
- Patrick J McGown
- Imperial College School of Medicine, Imperial College London, London, UK
| | - Celia A Brown
- Warwick Medical School, University of Warwick, Warwick, UK
| | - Ann Sebastian
- Imperial College School of Medicine, Imperial College London, London, UK
| | - Ricardo Le
- Warwick Medical School, University of Warwick, Warwick, UK
| | - Anjali Amin
- Imperial College School of Medicine, Imperial College London, London, UK
| | - Andrew Greenland
- Imperial College School of Medicine, Imperial College London, London, UK
| | - Amir H Sam
- Imperial College School of Medicine, Imperial College London, London, UK.
| |
Collapse
|
7
|
Homer M. Pass/fail decisions and standards: the impact of differential examiner stringency on OSCE outcomes. ADVANCES IN HEALTH SCIENCES EDUCATION : THEORY AND PRACTICE 2022; 27:457-473. [PMID: 35230590 PMCID: PMC9117341 DOI: 10.1007/s10459-022-10096-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/20/2021] [Accepted: 01/23/2022] [Indexed: 06/14/2023]
Abstract
Variation in examiner stringency is a recognised problem in many standardised summative assessments of performance such as the OSCE. The stated strength of the OSCE is that such error might largely balance out over the exam as a whole. This study uses linear mixed models to estimate the impact of different factors (examiner, station, candidate and exam) on station-level total domain score and, separately, on a single global grade. The exam data is from 442 separate administrations of an 18 station OSCE for international medical graduates who want to work in the National Health Service in the UK. We find that variation due to examiner is approximately twice as large for domain scores as it is for grades (16% vs. 8%), with smaller residual variance in the former (67% vs. 76%). Combined estimates of exam-level (relative) reliability across all data are 0.75 and 0.69 for domains scores and grades respectively. The correlation between two separate estimates of stringency for individual examiners (one for grades and one for domain scores) is relatively high (r=0.76) implying that examiners are generally quite consistent in their stringency between these two assessments of performance. Cluster analysis indicates that examiners fall into two broad groups characterised as hawks or doves on both measures. At the exam level, correcting for examiner stringency produces systematically lower cut-scores under borderline regression standard setting than using the raw marks. In turn, such a correction would produce higher pass rates-although meaningful direct comparisons are challenging to make. As in other studies, this work shows that OSCEs and other standardised performance assessments are subject to substantial variation in examiner stringency, and require sufficient domain sampling to ensure quality of pass/fail decision-making is at least adequate. More, perhaps qualitative, work is needed to understand better how examiners might score similarly (or differently) between the awarding of station-level domain scores and global grades. The issue of the potential systematic bias of borderline regression evidenced for the first time here, with sources of error producing cut-scores higher than they should be, also needs more investigation.
Collapse
Affiliation(s)
- Matt Homer
- School of Medicine, Leeds Institute of Medical Education, University of Leeds, LS29JT, Leeds, UK.
| |
Collapse
|
8
|
Yeates P, Moult A, Cope N, McCray G, Fuller R, McKinley R. Determining influence, interaction and causality of contrast and sequence effects in objective structured clinical exams. MEDICAL EDUCATION 2022; 56:292-302. [PMID: 34893998 PMCID: PMC9304241 DOI: 10.1111/medu.14713] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/23/2021] [Revised: 11/03/2021] [Accepted: 12/08/2021] [Indexed: 06/14/2023]
Abstract
INTRODUCTION Differential rater function over time (DRIFT) and contrast effects (examiners' scores biased away from the standard of preceding performances) both challenge the fairness of scoring in objective structured clinical exams (OSCEs). This is important as, under some circumstances, these effects could alter whether some candidates pass or fail assessments. Benefitting from experimental control, this study investigated the causality, operation and interaction of both effects simultaneously for the first time in an OSCE setting. METHODS We used secondary analysis of data from an OSCE in which examiners scored embedded videos of student performances interspersed between live students. Embedded video position varied between examiners (early vs. late) whilst the standard of preceding performances naturally varied (previous high or low). We examined linear relationships suggestive of DRIFT and contrast effects in all within-OSCE data before comparing the influence and interaction of 'early' versus 'late' and 'previous high' versus 'previous low' conditions on embedded video scores. RESULTS Linear relationships data did not support the presence of DRIFT or contrast effects. Embedded videos were scored higher early (19.9 [19.4-20.5]) versus late (18.6 [18.1-19.1], p < 0.001), but scores did not differ between previous high and previous low conditions. The interaction term was non-significant. CONCLUSIONS In this instance, the small DRIFT effect we observed on embedded videos can be causally attributed to examiner behaviour. Contrast effects appear less ubiquitous than some prior research suggests. Possible mediators of these finding include the following: OSCE context, detail of task specification, examiners' cognitive load and the distribution of learners' ability. As the operation of these effects appears to vary across contexts, further research is needed to determine the prevalence and mechanisms of contrast and DRIFT effects, so that assessments may be designed in ways that are likely to avoid their occurrence. Quality assurance should monitor for these contextually variable effects in order to ensure OSCE equivalence.
Collapse
Affiliation(s)
- Peter Yeates
- School of MedicineKeele UniversityKeeleUK
- Fairfield General HospitalPennine Acute Hospitals NHS TrustBuryUK
| | | | | | | | | | | |
Collapse
|