1
|
Cook DA, Overgaard J, Pankratz VS, Del Fiol G, Aakre CA. Virtual Patients Using Large Language Models: Scalable, Contextualized Simulation of Clinician-Patient Dialogue With Feedback. J Med Internet Res 2025; 27:e68486. [PMID: 39854611 PMCID: PMC12008702 DOI: 10.2196/68486] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2024] [Revised: 01/03/2025] [Accepted: 01/13/2025] [Indexed: 01/26/2025] Open
Abstract
BACKGROUND Virtual patients (VPs) are computer screen-based simulations of patient-clinician encounters. VP use is limited by cost and low scalability. OBJECTIVE We aimed to show that VPs powered by large language models (LLMs) can generate authentic dialogues, accurately represent patient preferences, and provide personalized feedback on clinical performance. We also explored using LLMs to rate the quality of dialogues and feedback. METHODS We conducted an intrinsic evaluation study rating 60 VP-clinician conversations. We used carefully engineered prompts to direct OpenAI's generative pretrained transformer (GPT) to emulate a patient and provide feedback. Using 2 outpatient medicine topics (chronic cough diagnosis and diabetes management), each with permutations representing different patient preferences, we created 60 conversations (dialogues plus feedback): 48 with a human clinician and 12 "self-chat" dialogues with GPT role-playing both the VP and clinician. Primary outcomes were dialogue authenticity and feedback quality, rated using novel instruments for which we conducted a validation study collecting evidence of content, internal structure (reproducibility), relations with other variables, and response process. Each conversation was rated by 3 physicians and by GPT. Secondary outcomes included user experience, bias, patient preferences represented in the dialogues, and conversation features that influenced authenticity. RESULTS The average cost per conversation was US $0.51 for GPT-4.0-Turbo and US $0.02 for GPT-3.5-Turbo. Mean (SD) conversation ratings, maximum 6, were overall dialogue authenticity 4.7 (0.7), overall user experience 4.9 (0.7), and average feedback quality 4.7 (0.6). For dialogues created using GPT-4.0-Turbo, physician ratings of patient preferences aligned with intended preferences in 20 to 47 of 48 dialogues (42%-98%). Subgroup comparisons revealed higher ratings for dialogues using GPT-4.0-Turbo versus GPT-3.5-Turbo and for human-generated versus self-chat dialogues. Feedback ratings were similar for human-generated versus GPT-generated ratings, whereas authenticity ratings were lower. We did not perceive bias in any conversation. Dialogue features that detracted from authenticity included that GPT was verbose or used atypical vocabulary (93/180, 51.7% of conversations), was overly agreeable (n=56, 31%), repeated the question as part of the response (n=47, 26%), was easily convinced by clinician suggestions (n=35, 19%), or was not disaffected by poor clinician performance (n=32, 18%). For feedback, detractors included excessively positive feedback (n=42, 23%), failure to mention important weaknesses or strengths (n=41, 23%), or factual inaccuracies (n=39, 22%). Regarding validation of dialogue and feedback scores, items were meticulously developed (content evidence), and we confirmed expected relations with other variables (higher ratings for advanced LLMs and human-generated dialogues). Reproducibility was suboptimal, due largely to variation in LLM performance rather than rater idiosyncrasies. CONCLUSIONS LLM-powered VPs can simulate patient-clinician dialogues, demonstrably represent patient preferences, and provide personalized performance feedback. This approach is scalable, globally accessible, and inexpensive. LLM-generated ratings of feedback quality are similar to human ratings.
Collapse
Affiliation(s)
- David A Cook
- Division of General Internal Medicine, Mayo Clinic College of Medicine and Science, Rochester, MN, United States
- Multidisciplinary Simulation Center, Mayo Clinic College of Medicine and Science, Rochester, MN, United States
| | - Joshua Overgaard
- Division of General Internal Medicine, Mayo Clinic College of Medicine and Science, Rochester, MN, United States
| | - V Shane Pankratz
- Health Sciences Center, University of New Mexico, Albuquerque, NM, United States
| | - Guilherme Del Fiol
- Department of Biomedical Informatics, University of Utah School of Medicine, Salt Lake City, UT, United States
| | - Chris A Aakre
- Division of General Internal Medicine, Mayo Clinic College of Medicine and Science, Rochester, MN, United States
| |
Collapse
|
2
|
Hansen A, Klute RM, Yadav M, Bansal S, Bond WF. How Do Learners Receive Feedback on Note Writing? A Scoping Review. ACADEMIC MEDICINE : JOURNAL OF THE ASSOCIATION OF AMERICAN MEDICAL COLLEGES 2024; 99:683-690. [PMID: 38306581 DOI: 10.1097/acm.0000000000005653] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2024]
Abstract
PURPOSE The literature assessing the process of note-writing based on gathered information is scant. This scoping review investigates methods of providing feedback on learners' note-writing abilities. METHOD Scopus and Web of Science were searched for studies that investigated feedback on student notes or reviewed notes written on an information or data-gathering activity in health care and other fields in August 2022. Of 426 articles screened, 23 met the inclusion criteria. Data were extracted on the article title, publication year, study location, study aim, study design, number of participants, participant demographics, level of education, type of note written, field of study, form of feedback given, source of the feedback, and student or participant rating of feedback method from the included articles. Then possible themes were identified and a final consensus-based thematic analysis was performed. RESULTS Themes identified in the 23 included articles were as follows: (1) learners found faculty and peer feedback beneficial; (2) direct written comments and evaluation tools, such as rubrics or checklists, were the most common feedback methods; (3) reports on notes in real clinical settings were limited (simulated clinical scenarios in preclinical curriculum were the most studied); (4) feedback providers and recipients benefit from having prior training on providing and receiving feedback; (5) sequential or iterative feedback was beneficial for learners but can be time intensive for faculty and confounded by maturation effects; and (6) use of technology and validated assessment tools facilitate the feedback process through ease of communication and improved organization. CONCLUSIONS The various factors influencing impact and perception of feedback include the source, structure, setting, use of technology, and amount of feedback provided. As the utility of note-writing in health care expands, studies are needed to clarify the value of note feedback in learning and the role of innovative technologies in facilitating note feedback.
Collapse
|
3
|
Patino GA, Amiel JM, Brown M, Lypson ML, Chan TM. The Promise and Perils of Artificial Intelligence in Health Professions Education Practice and Scholarship. ACADEMIC MEDICINE : JOURNAL OF THE ASSOCIATION OF AMERICAN MEDICAL COLLEGES 2024; 99:477-481. [PMID: 38266214 DOI: 10.1097/acm.0000000000005636] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/26/2024]
Abstract
ABSTRACT Artificial intelligence (AI) methods, especially machine learning and natural language processing, are increasingly affecting health professions education (HPE), including the medical school application and selection processes, assessment, and scholarship production. The rise of large language models over the past 18 months, such as ChatGPT, has raised questions about how best to incorporate these methods into HPE. The lack of training in AI among most HPE faculty and scholars poses an important challenge in facilitating such discussions. In this commentary, the authors provide a primer on the AI methods most often used in the practice and scholarship of HPE, discuss the most pressing challenges and opportunities these tools afford, and underscore that these methods should be understood as part of the larger set of statistical tools available.Despite their ability to process huge amounts of data and their high performance completing some tasks, AI methods are only as good as the data on which they are trained. Of particular importance is that these models can perpetuate the biases that are present in those training datasets, and they can be applied in a biased manner by human users. A minimum set of expectations for the application of AI methods in HPE practice and scholarship is discussed in this commentary, including the interpretability of the models developed and the transparency needed into the use and characteristics of such methods.The rise of AI methods is affecting multiple aspects of HPE including raising questions about how best to incorporate these models into HPE practice and scholarship. In this commentary, we provide a primer on the AI methods most often used in HPE and discuss the most pressing challenges and opportunities these tools afford.
Collapse
|
4
|
Gordon M, Daniel M, Ajiboye A, Uraiby H, Xu NY, Bartlett R, Hanson J, Haas M, Spadafore M, Grafton-Clarke C, Gasiea RY, Michie C, Corral J, Kwan B, Dolmans D, Thammasitboon S. A scoping review of artificial intelligence in medical education: BEME Guide No. 84. MEDICAL TEACHER 2024; 46:446-470. [PMID: 38423127 DOI: 10.1080/0142159x.2024.2314198] [Citation(s) in RCA: 59] [Impact Index Per Article: 59.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/04/2023] [Accepted: 01/31/2024] [Indexed: 03/02/2024]
Abstract
BACKGROUND Artificial Intelligence (AI) is rapidly transforming healthcare, and there is a critical need for a nuanced understanding of how AI is reshaping teaching, learning, and educational practice in medical education. This review aimed to map the literature regarding AI applications in medical education, core areas of findings, potential candidates for formal systematic review and gaps for future research. METHODS This rapid scoping review, conducted over 16 weeks, employed Arksey and O'Malley's framework and adhered to STORIES and BEME guidelines. A systematic and comprehensive search across PubMed/MEDLINE, EMBASE, and MedEdPublish was conducted without date or language restrictions. Publications included in the review spanned undergraduate, graduate, and continuing medical education, encompassing both original studies and perspective pieces. Data were charted by multiple author pairs and synthesized into various thematic maps and charts, ensuring a broad and detailed representation of the current landscape. RESULTS The review synthesized 278 publications, with a majority (68%) from North American and European regions. The studies covered diverse AI applications in medical education, such as AI for admissions, teaching, assessment, and clinical reasoning. The review highlighted AI's varied roles, from augmenting traditional educational methods to introducing innovative practices, and underscores the urgent need for ethical guidelines in AI's application in medical education. CONCLUSION The current literature has been charted. The findings underscore the need for ongoing research to explore uncharted areas and address potential risks associated with AI use in medical education. This work serves as a foundational resource for educators, policymakers, and researchers in navigating AI's evolving role in medical education. A framework to support future high utility reporting is proposed, the FACETS framework.
Collapse
Affiliation(s)
- Morris Gordon
- School of Medicine and Dentistry, University of Central Lancashire, Preston, UK
- Blackpool Hospitals NHS Foundation Trust, Blackpool, UK
| | - Michelle Daniel
- School of Medicine, University of California, San Diego, SanDiego, CA, USA
| | - Aderonke Ajiboye
- School of Medicine and Dentistry, University of Central Lancashire, Preston, UK
| | - Hussein Uraiby
- Department of Cellular Pathology, University Hospitals of Leicester NHS Trust, Leicester, UK
| | - Nicole Y Xu
- School of Medicine, University of California, San Diego, SanDiego, CA, USA
| | - Rangana Bartlett
- Department of Cognitive Science, University of California, San Diego, CA, USA
| | - Janice Hanson
- Department of Medicine and Office of Education, School of Medicine, Washington University in Saint Louis, Saint Louis, MO, USA
| | - Mary Haas
- Department of Emergency Medicine, University of Michigan Medical School, Ann Arbor, MI, USA
| | - Maxwell Spadafore
- Department of Emergency Medicine, University of Michigan Medical School, Ann Arbor, MI, USA
| | | | | | - Colin Michie
- School of Medicine and Dentistry, University of Central Lancashire, Preston, UK
| | - Janet Corral
- Department of Medicine, University of Nevada Reno, School of Medicine, Reno, NV, USA
| | - Brian Kwan
- School of Medicine, University of California, San Diego, SanDiego, CA, USA
| | - Diana Dolmans
- School of Health Professions Education, Faculty of Health, Maastricht University, Maastricht, NL, USA
| | - Satid Thammasitboon
- Center for Research, Innovation and Scholarship in Health Professions Education, Baylor College of Medicine, Houston, TX, USA
| |
Collapse
|
5
|
Artsi Y, Sorin V, Konen E, Glicksberg BS, Nadkarni G, Klang E. Large language models for generating medical examinations: systematic review. BMC MEDICAL EDUCATION 2024; 24:354. [PMID: 38553693 PMCID: PMC10981304 DOI: 10.1186/s12909-024-05239-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/17/2024] [Accepted: 02/28/2024] [Indexed: 04/01/2024]
Abstract
BACKGROUND Writing multiple choice questions (MCQs) for the purpose of medical exams is challenging. It requires extensive medical knowledge, time and effort from medical educators. This systematic review focuses on the application of large language models (LLMs) in generating medical MCQs. METHODS The authors searched for studies published up to November 2023. Search terms focused on LLMs generated MCQs for medical examinations. Non-English, out of year range and studies not focusing on AI generated multiple-choice questions were excluded. MEDLINE was used as a search database. Risk of bias was evaluated using a tailored QUADAS-2 tool. RESULTS Overall, eight studies published between April 2023 and October 2023 were included. Six studies used Chat-GPT 3.5, while two employed GPT 4. Five studies showed that LLMs can produce competent questions valid for medical exams. Three studies used LLMs to write medical questions but did not evaluate the validity of the questions. One study conducted a comparative analysis of different models. One other study compared LLM-generated questions with those written by humans. All studies presented faulty questions that were deemed inappropriate for medical exams. Some questions required additional modifications in order to qualify. CONCLUSIONS LLMs can be used to write MCQs for medical examinations. However, their limitations cannot be ignored. Further study in this field is essential and more conclusive evidence is needed. Until then, LLMs may serve as a supplementary tool for writing medical examinations. 2 studies were at high risk of bias. The study followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines.
Collapse
Affiliation(s)
- Yaara Artsi
- Azrieli Faculty of Medicine, Bar-Ilan University, Ha'Hadas St. 1, Rishon Le Zion, Zefat, 7550598, Israel.
| | - Vera Sorin
- Department of Diagnostic Imaging, Chaim Sheba Medical Center, Ramat Gan, Israel
- Tel-Aviv University School of Medicine, Tel Aviv, Israel
- DeepVision Lab, Chaim Sheba Medical Center, Ramat Gan, Israel
| | - Eli Konen
- Department of Diagnostic Imaging, Chaim Sheba Medical Center, Ramat Gan, Israel
- Tel-Aviv University School of Medicine, Tel Aviv, Israel
| | - Benjamin S Glicksberg
- Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Girish Nadkarni
- Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA
- The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Eyal Klang
- Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA
- The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| |
Collapse
|
6
|
Sukhera J, Ölveczky D, Colbert-Getz J, Fernandez A, Ho MJ, Ryan MS, Young ME. Digging Deeper, Zooming Out: Reimagining Legacies in Medical Education. ACADEMIC MEDICINE : JOURNAL OF THE ASSOCIATION OF AMERICAN MEDICAL COLLEGES 2023; 98:S6-S9. [PMID: 37983391 DOI: 10.1097/acm.0000000000005372] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/02/2023]
Abstract
Although the wide-scale disruption precipitated by the COVID-19 pandemic has somewhat subsided, there are many questions about the implications of such disruptions for the road ahead. This year's Research in Medical Education (RIME) supplement may provide a window of insight. Now, more than ever, researchers are poised to question long-held assumptions while reimagining long-established legacies. Themes regarding the boundaries of professional identity, approaches to difficult conversations, challenges of power and hierarchy, intricacies of selection processes, and complexities of learning climates appear to be the most salient and critical to understand. In this commentary, the authors use the relationship between legacies and assumptions as a framework to gain a deeper understanding about the past, present, and future of RIME.
Collapse
Affiliation(s)
- Javeed Sukhera
- J. Sukhera is chair/chief of psychiatry, Hartford Hospital and the Institute of Living, and associate clinical professor of psychiatry, Yale School of Medicine, Hartford, Connecticut; ORCID: https://orcid.org/0000-0001-8146-4947
| | - Daniele Ölveczky
- D. Ölveczky is assistant professor of medicine and codirector, Health Equity and Anti-Racism Theme, Harvard Medical School, and physician director, Office of Diversity, Equity and Inclusion, Beth Israel Deaconess Medical Center, Boston, Massachusetts; ORCID: https://orcid.org/0000-0001-8972-4483
| | - Jorie Colbert-Getz
- J. Colbert-Getz is assistant dean of education quality improvement and associate professor, Department of Internal Medicine, Spencer Fox Eccles School of Medicine at the University of Utah, Salt Lake City, Utah; ORCID: https://orcid.org/0000-0001-7419-7588
| | - Andres Fernandez
- A. Fernandez is assistant professor, Department of Neurology, Thomas Jefferson University, Philadelphia, Pennsylvania, and a PhD student, School of Health Professions Education, Maastricht University, Maastricht, the Netherlands; ORCID: https://orcid.org/0000-0001-5389-6232
| | - Ming-Jung Ho
- M.-J. Ho is professor of family medicine, associate director, Center for Innovation and Leadership in Education, and director of education research, MedStar Health, Georgetown University, Washington, DC; ORCID: https://orcid.org/0000-0003-1415-8282
| | - Michael S Ryan
- M.S. Ryan is associate dean for assessment, evaluation, research and scholarly innovation, and professor, Department of Pediatrics, University of Virginia School of Medicine, Charlottesville, Virginia, and a PhD student, School of Health Professions Education, Maastricht University, Maastricht, the Netherlands; ORCID: https://orcid.org/0000-0003-3266-9289
| | - Meredith E Young
- M.E. Young is associate professor, Institute of Health Sciences Education, McGill University, Montreal, Quebec, Canada; ORCID: https://orcid.org/0000-0002-2036-2119
| |
Collapse
|