1
|
Hoffman KR, Swanson D, Lane S, Nickson C, Brand P, Ryan AT. The reliability of the College of Intensive Care Medicine of Australia and New Zealand "Hot Case" examination. BMC MEDICAL EDUCATION 2024; 24:527. [PMID: 38734603 PMCID: PMC11088756 DOI: 10.1186/s12909-024-05516-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/21/2024] [Accepted: 05/03/2024] [Indexed: 05/13/2024]
Abstract
BACKGROUND High stakes examinations used to credential trainees for independent specialist practice should be evaluated periodically to ensure defensible decisions are made. This study aims to quantify the College of Intensive Care Medicine of Australia and New Zealand (CICM) Hot Case reliability coefficient and evaluate contributions to variance from candidates, cases and examiners. METHODS This retrospective, de-identified analysis of CICM examination data used descriptive statistics and generalisability theory to evaluate the reliability of the Hot Case examination component. Decision studies were used to project generalisability coefficients for alternate examination designs. RESULTS Examination results from 2019 to 2022 included 592 Hot Cases, totalling 1184 individual examiner scores. The mean examiner Hot Case score was 5.17 (standard deviation 1.65). The correlation between candidates' two Hot Case scores was low (0.30). The overall reliability coefficient for the Hot Case component consisting of two cases observed by two separate pairs of examiners was 0.42. Sources of variance included candidate proficiency (25%), case difficulty and case specificity (63.4%), examiner stringency (3.5%) and other error (8.2%). To achieve a reliability coefficient of > 0.8 a candidate would need to perform 11 Hot Cases observed by two examiners. CONCLUSION The reliability coefficient for the Hot Case component of the CICM second part examination is below the generally accepted value for a high stakes examination. Modifications to case selection and introduction of a clear scoring rubric to mitigate the effects of variation in case difficulty may be helpful. Increasing the number of cases and overall assessment time appears to be the best way to increase the overall reliability. Further research is required to assess the combined reliability of the Hot Case and viva components.
Collapse
Affiliation(s)
- Kenneth R Hoffman
- Intensive Care Unit, The Alfred Hospital, Melbourne, Australia.
- Department of Epidemiology and Preventative Medicine, School of Public Health, Monash University, Melbourne, Australia.
| | - David Swanson
- Department of Medical Education, Melbourne Medical School, University of Melbourne, Melbourne, Australia
| | - Stuart Lane
- Sydney Medical School, The University of Sydney, Sydney, Australia
| | - Chris Nickson
- Intensive Care Unit, The Alfred Hospital, Melbourne, Australia
- Department of Epidemiology and Preventative Medicine, School of Public Health, Monash University, Melbourne, Australia
| | - Paul Brand
- College of Intensive Care Medicine of Australia and New Zealand, Melbourne, Australia
| | - Anna T Ryan
- Department of Medical Education, Melbourne Medical School, University of Melbourne, Melbourne, Australia
| |
Collapse
|
2
|
Staudenmann D, Waldner N, Lörwald A, Huwendiek S. Medical specialty certification exams studied according to the Ottawa Quality Criteria: a systematic review. BMC MEDICAL EDUCATION 2023; 23:619. [PMID: 37649019 PMCID: PMC10466740 DOI: 10.1186/s12909-023-04600-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/28/2023] [Accepted: 08/18/2023] [Indexed: 09/01/2023]
Abstract
BACKGROUND Medical specialty certification exams are high-stakes summative assessments used to determine which doctors have the necessary skills, knowledge, and attitudes to treat patients independently. Such exams are crucial for patient safety, candidates' career progression and accountability to the public, yet vary significantly among medical specialties and countries. It is therefore of paramount importance that the quality of specialty certification exams is studied in the scientific literature. METHODS In this systematic literature review we used the PICOS framework and searched for papers concerning medical specialty certification exams published in English between 2000 and 2020 in seven databases using a diverse set of search term variations. Papers were screened by two researchers independently and scored regarding their methodological quality and relevance to this review. Finally, they were categorized by country, medical specialty and the following seven Ottawa Criteria of good assessment: validity, reliability, equivalence, feasibility, acceptability, catalytic and educational effect. RESULTS After removal of duplicates, 2852 papers were screened for inclusion, of which 66 met all relevant criteria. Over 43 different exams and more than 28 different specialties from 18 jurisdictions were studied. Around 77% of all eligible papers were based in English-speaking countries, with 55% of publications centered on just the UK and USA. General Practice was the most frequently studied specialty among certification exams with the UK General Practice exam having been particularly broadly analyzed. Papers received an average of 4.2/6 points on the quality score. Eligible studies analyzed 2.1/7 Ottawa Criteria on average, with the most frequently studied criteria being reliability, validity, and acceptability. CONCLUSIONS The present systematic review shows a growing number of studies analyzing medical specialty certification exams over time, encompassing a wider range of medical specialties, countries, and Ottawa Criteria. Due to their reliance on multiple assessment methods and data-points, aspects of programmatic assessment suggest a promising way forward in the development of medical specialty certification exams which fulfill all seven Ottawa Criteria. Further research is needed to confirm these results, particularly analyses of examinations held outside the Anglosphere as well as studies analyzing entire certification exams or comparing multiple examination methods.
Collapse
Affiliation(s)
| | - Noemi Waldner
- University of Bern, Institute for Medical Education, Bern, Switzerland
| | - Andrea Lörwald
- University of Bern, Institute for Medical Education, Bern, Switzerland
| | - Sören Huwendiek
- University of Bern, Institute for Medical Education, Bern, Switzerland
| |
Collapse
|
3
|
Rivière E, Aubin E, Tremblay SL, Lortie G, Chiniara G. A new tool for assessing short debriefings after immersive simulation: validity of the SHORT scale. BMC MEDICAL EDUCATION 2019; 19:82. [PMID: 30871505 PMCID: PMC6419351 DOI: 10.1186/s12909-019-1503-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/16/2018] [Accepted: 02/22/2019] [Indexed: 05/05/2023]
Abstract
BACKGROUND Simulation is being increasingly used worldwide in healthcare education. However, it is costly both in terms of finances and human resources. As a consequence, several institutions have designed programs offering several short immersive simulation sessions, each followed by short debriefings. Although debriefing is recommended, no tool exists to assess appropriateness of short debriefings after such simulation sessions. We have developed the Simulation in Healthcare retrOaction Rating Tool (SHORT) to assess short debriefings, and provide some validity evidence for its use. METHODS We designed this scale based on our experience and previously published instruments, and tested it by assessing short debriefings of simulation sessions offered to emergency medicine residents at Laval University (Canada) from 2015 to 2016. Analysis of its reliability and validity was done using Standards for educational and psychological testing. Generalizability theory was used for testing internal structure evidence for validity. RESULTS Two raters independently assessed 22 filmed short debriefings. Mean debriefing length was 10:35 (min 7:21; max 14:32). Calculated generalizability (reliability) coefficients are φ = 0.80 and φ-λ3 = 0.82. The generalizability coefficient for a single rater assessing three debriefings is φ = 0.84. CONCLUSIONS The G study shows a high generalizability coefficient (φ ≥ 0.80), which demonstrates a high reliability. The response process evidence for validity provides evidence that no errors were associated with using the instrument. Further studies should be done to demonstrate validity of the English version of the instrument and to validate its use by novice raters trained in the use of the SHORT.
Collapse
Affiliation(s)
- Etienne Rivière
- Department of Internal Medicine, Haut-Leveque Hospital, University Hospital Centre of Bordeaux, Pessac, France
- Medical Faculty, Bordeaux University, Bordeaux, France
- SimBA-S Simulation Centre, University and Hospital of Bordeaux, Quebec, Canada
- Apprentiss Centre (simulation centre), Laval University, Quebec, Canada
| | | | - Samuel-Lessard Tremblay
- Apprentiss Centre (simulation centre), Laval University, Quebec, Canada
- University Institute of cardiology and pneumology of Quebec, Quebec, Canada
| | - Gilles Lortie
- Apprentiss Centre (simulation centre), Laval University, Quebec, Canada
- Emergency Unit, Levis Hotel-Dieu Hospital, University Hospital of Quebec, Lévis, Canada
| | - Gilles Chiniara
- Apprentiss Centre (simulation centre), Laval University, Quebec, Canada
- Department of Anesthesiology and Intensive Care, Laval University, Quebec, Canada
| |
Collapse
|
4
|
Guldbrand Nielsen D, Jensen SL, O'Neill L. Clinical assessment of transthoracic echocardiography skills: a generalizability study. BMC MEDICAL EDUCATION 2015; 15:9. [PMID: 25638012 PMCID: PMC4334848 DOI: 10.1186/s12909-015-0294-5] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/25/2014] [Accepted: 01/15/2015] [Indexed: 05/17/2023]
Abstract
BACKGROUND Transthoracic echocardiography (TTE) is a widely used cardiac imaging technique that all cardiologists should be able to perform competently. Traditionally, TTE competence has been assessed by unstructured observation or in test situations separated from daily clinical practice. An instrument for assessment of clinical TTE technical proficiency including a global rating score and a checklist score has previously shown reliability and validity in a standardised setting. As clinical test situations typically have several sources of error giving rise to variance in scores, a more thorough examination of the generalizability of the assessment instrument is needed. METHODS Nine physicians performed a TTE scan on the same three patients. Then, two raters rated all 27 TTE scans using the TTE technical assessment instrument in a fully crossed, all random generalizability study. Estimated variance components were calculated for both the global rating and checklist scores. Finally, dependability (phi) coefficients were also calculated for both outcomes in a decision study. RESULTS For global rating scores, 66.6% of score variance can be ascribed to true differences in performance. For checklist scores this was 88.8%. The difference was primarily due to physician-rater interaction. Four random cases rated by one random rater resulted in a phi value of 0.81 for global ratings and two random cases rated by one random rater showed a phi value of 0.92 for checklist scores. CONCLUSIONS Using the TTE checklist as opposed to the TTE global rating score had the effect of minimising the largest source of error variance in test scores. Two cases rated by one rater using the TTE checklist are sufficiently reliable for high stakes examinations. As global rating is less time consuming it could be considered performing four global rating assessments in addition to the checklist assessments to account for both reliability and content validity of the assessment.
Collapse
Affiliation(s)
| | | | - Lotte O'Neill
- Center for Medical Education, Aarhus University, Aarhus, Denmark.
| |
Collapse
|
5
|
An Item Analysis of Written Multiple-Choice Questions: Kashan University of Medical Sciences. Nurs Midwifery Stud 2012. [DOI: 10.5812/nms.8738] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
|
6
|
Dijkstra J, Galbraith R, Hodges BD, McAvoy PA, McCrorie P, Southgate LJ, Van der Vleuten CPM, Wass V, Schuwirth LWT. Expert validation of fit-for-purpose guidelines for designing programmes of assessment. BMC MEDICAL EDUCATION 2012; 12:20. [PMID: 22510502 PMCID: PMC3676146 DOI: 10.1186/1472-6920-12-20] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/10/2012] [Accepted: 04/17/2012] [Indexed: 05/11/2023]
Abstract
BACKGROUND An assessment programme, a purposeful mix of assessment activities, is necessary to achieve a complete picture of assessee competence. High quality assessment programmes exist, however, design requirements for such programmes are still unclear. We developed guidelines for design based on an earlier developed framework which identified areas to be covered. A fitness-for-purpose approach defining quality was adopted to develop and validate guidelines. METHODS First, in a brainstorm, ideas were generated, followed by structured interviews with 9 international assessment experts. Then, guidelines were fine-tuned through analysis of the interviews. Finally, validation was based on expert consensus via member checking. RESULTS In total 72 guidelines were developed and in this paper the most salient guidelines are discussed. The guidelines are related and grouped per layer of the framework. Some guidelines were so generic that these are applicable in any design consideration. These are: the principle of proportionality, rationales should underpin each decisions, and requirement of expertise. Logically, many guidelines focus on practical aspects of assessment. Some guidelines were found to be clear and concrete, others were less straightforward and were phrased more as issues for contemplation. CONCLUSIONS The set of guidelines is comprehensive and not bound to a specific context or educational approach. From the fitness-for-purpose principle, guidelines are eclectic, requiring expertise judgement to use them appropriately in different contexts. Further validation studies to test practicality are required.
Collapse
Affiliation(s)
- Joost Dijkstra
- Department of Educational Development and Research, Maastricht University,
Maastricht, The Netherlands
| | - Robert Galbraith
- Center for Innovation, National Board of Medical Examiners, Philadelphia,
USA
| | - Brian D Hodges
- Wilson Centre for Research in Education, Faculty of Medicine, University of
Toronto, Toronto, ON, Canada
| | - Pauline A McAvoy
- Assessment Development, National Clinical Assessment Service (NCAS), London,
UK
| | - Peter McCrorie
- Centre for Medical and Healthcare Education, St George’s, University of
London, London, UK
| | - Lesley J Southgate
- Centre for Medical and Healthcare Education, St George’s, University of
London, London, UK
| | - Cees PM Van der Vleuten
- Department of Educational Development and Research, Maastricht University,
Maastricht, The Netherlands
| | - Val Wass
- Keele University, School of Medicine, Staffordshire, UK
| | - Lambert WT Schuwirth
- Department of Educational Development and Research, Maastricht University,
Maastricht, The Netherlands
- Flinders Innovation in Clinical Education, Flinders University, Bedford Park,
SA, Australia
| |
Collapse
|
7
|
Le Roux P, Podgorski C, Rosenberg T, Watson WH, McDaniel S. Developing an outcome-based assessment for family therapy training: the Rochester Objective Structured Clinical Evaluation (ROSCE). FAMILY PROCESS 2011; 50:544-560. [PMID: 22145725 DOI: 10.1111/j.1545-5300.2011.01375.x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
This paper addresses a growing need for cost-effective, outcome-based assessment in family therapy training. We describe the ROSCE, a structured, evidence-informed, learner-centered approach to the assessment of clinical skills developed at the University of Rochester Medical Center. The ROSCE emphasizes direct observation of trainees demonstrating clinical competencies. The format integrates both formative and summative assessment methods. It can readily be adapted to a wide variety of educational and training settings.
Collapse
Affiliation(s)
- Pieter Le Roux
- Department of Psychiatry, Institute for the Family, University of Rochester, Rochester, NY 14642, USA.
| | | | | | | | | |
Collapse
|
8
|
Dijkstra J, Van der Vleuten CPM, Schuwirth LWT. A new framework for designing programmes of assessment. ADVANCES IN HEALTH SCIENCES EDUCATION : THEORY AND PRACTICE 2010; 15:379-93. [PMID: 19821042 PMCID: PMC2940030 DOI: 10.1007/s10459-009-9205-z] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/09/2009] [Accepted: 09/28/2009] [Indexed: 05/15/2023]
Abstract
Research on assessment in medical education has strongly focused on individual measurement instruments and their psychometric quality. Without detracting from the value of this research, such an approach is not sufficient to high quality assessment of competence as a whole. A programmatic approach is advocated which presupposes criteria for designing comprehensive assessment programmes and for assuring their quality. The paucity of research with relevance to programmatic assessment, and especially its development, prompted us to embark on a research project to develop design principles for programmes of assessment. We conducted focus group interviews to explore the experiences and views of nine assessment experts concerning good practices and new ideas about theoretical and practical issues in programmes of assessment. The discussion was analysed, mapping all aspects relevant for design onto a framework, which was iteratively adjusted to fit the data until saturation was reached. The overarching framework for designing programmes of assessment consists of six assessment programme dimensions: Goals, Programme in Action, Support, Documenting, Improving and Accounting. The model described in this paper can help to frame programmes of assessment; it not only provides a common language, but also a comprehensive picture of the dimensions to be covered when formulating design principles. It helps identifying areas concerning assessment in which ample research and development has been done. But, more importantly, it also helps to detect underserved areas. A guiding principle in design of assessment programmes is fitness for purpose. High quality assessment can only be defined in terms of its goals.
Collapse
Affiliation(s)
- J Dijkstra
- Department of Educational Development and Research, Maastricht University, The Netherlands.
| | | | | |
Collapse
|