Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Ismail A, Ghorashi NS, Javan R. New Horizons: The Potential Role of OpenAI's ChatGPT in Clinical Radiology. J Am Coll Radiol 2023;20:696-698. [PMID: 36972862 DOI: 10.1016/j.jacr.2023.02.025] [Citation(s) in RCA: 13] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2023] [Revised: 02/14/2023] [Accepted: 02/16/2023] [Indexed: 03/28/2023]

Number

Cited by Other Article(s)

Keshavarz P, Bagherieh S, Nabipoorashrafi SA, Chalian H, Rahsepar AA, Kim GHJ, Hassani C, Raman SS, Bedayat A. ChatGPT in radiology: A systematic review of performance, pitfalls, and future perspectives. Diagn Interv Imaging 2024:S2211-5684(24)00105-0. [PMID: 38679540 DOI: 10.1016/j.diii.2024.04.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 03/11/2024] [Accepted: 04/16/2024] [Indexed: 05/01/2024]

Abstract

PURPOSE

The purpose of this study was to systematically review the reported performances of ChatGPT, identify potential limitations, and explore future directions for its integration, optimization, and ethical considerations in radiology applications.

MATERIALS AND METHODS

After a comprehensive review of PubMed, Web of Science, Embase, and Google Scholar databases, a cohort of published studies was identified up to January 1, 2024, utilizing ChatGPT for clinical radiology applications.

RESULTS

Out of 861 studies derived, 44 studies evaluated the performance of ChatGPT; among these, 37 (37/44; 84.1%) demonstrated high performance, and seven (7/44; 15.9%) indicated it had a lower performance in providing information on diagnosis and clinical decision support (6/44; 13.6%) and patient communication and educational content (1/44; 2.3%). Twenty-four (24/44; 54.5%) studies reported the proportion of ChatGPT's performance. Among these, 19 (19/24; 79.2%) studies recorded a median accuracy of 70.5%, and in five (5/24; 20.8%) studies, there was a median agreement of 83.6% between ChatGPT outcomes and reference standards [radiologists' decision or guidelines], generally confirming ChatGPT's high accuracy in these studies. Eleven studies compared two recent ChatGPT versions, and in ten (10/11; 90.9%), ChatGPTv4 outperformed v3.5, showing notable enhancements in addressing higher-order thinking questions, better comprehension of radiology terms, and improved accuracy in describing images. Risks and concerns about using ChatGPT included biased responses, limited originality, and the potential for inaccurate information leading to misinformation, hallucinations, improper citations and fake references, cybersecurity vulnerabilities, and patient privacy risks.

CONCLUSION

Although ChatGPT's effectiveness has been shown in 84.1% of radiology studies, there are still multiple pitfalls and limitations to address. It is too soon to confirm its complete proficiency and accuracy, and more extensive multicenter studies utilizing diverse datasets and pre-training techniques are required to verify ChatGPT's role in radiology.

Collapse

Shieh A, Tran B, He G, Kumar M, Freed JA, Majety P. Assessing ChatGPT 4.0's test performance and clinical diagnostic accuracy on USMLE STEP 2 CK and clinical case reports. Sci Rep 2024;14:9330. [PMID: 38654011 DOI: 10.1038/s41598-024-58760-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2023] [Accepted: 04/02/2024] [Indexed: 04/25/2024] Open

Beaulieu-Jones BR, Berrigan MT, Shah S, Marwaha JS, Lai SL, Brat GA. Evaluating capabilities of large language models: Performance of GPT-4 on surgical knowledge assessments. Surgery 2024;175:936-942. [PMID: 38246839 PMCID: PMC10947829 DOI: 10.1016/j.surg.2023.12.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Revised: 12/09/2023] [Accepted: 12/15/2023] [Indexed: 01/23/2024]

Abstract

BACKGROUND

Artificial intelligence has the potential to dramatically alter health care by enhancing how we diagnose and treat disease. One promising artificial intelligence model is ChatGPT, a general-purpose large language model trained by OpenAI. ChatGPT has shown human-level performance on several professional and academic benchmarks. We sought to evaluate its performance on surgical knowledge questions and assess the stability of this performance on repeat queries.

METHODS

We evaluated the performance of ChatGPT-4 on questions from the Surgical Council on Resident Education question bank and a second commonly used surgical knowledge assessment, referred to as Data-B. Questions were entered in 2 formats: open-ended and multiple-choice. ChatGPT outputs were assessed for accuracy and insights by surgeon evaluators. We categorized reasons for model errors and the stability of performance on repeat queries.

RESULTS

A total of 167 Surgical Council on Resident Education and 112 Data-B questions were presented to the ChatGPT interface. ChatGPT correctly answered 71.3% and 67.9% of multiple choice and 47.9% and 66.1% of open-ended questions for Surgical Council on Resident Education and Data-B, respectively. For both open-ended and multiple-choice questions, approximately two-thirds of ChatGPT responses contained nonobvious insights. Common reasons for incorrect responses included inaccurate information in a complex question (n = 16, 36.4%), inaccurate information in a fact-based question (n = 11, 25.0%), and accurate information with circumstantial discrepancy (n = 6, 13.6%). Upon repeat query, the answer selected by ChatGPT varied for 36.4% of questions answered incorrectly on the first query; the response accuracy changed for 6/16 (37.5%) questions.

CONCLUSION

Consistent with findings in other academic and professional domains, we demonstrate near or above human-level performance of ChatGPT on surgical knowledge questions from 2 widely used question banks. ChatGPT performed better on multiple-choice than open-ended questions, prompting questions regarding its potential for clinical application. Unique to this study, we demonstrate inconsistency in ChatGPT responses on repeat queries. This finding warrants future consideration including efforts at training large language models to provide the safe and consistent responses required for clinical application. Despite near or above human-level performance on question banks and given these observations, it is unclear whether large language models such as ChatGPT are able to safely assist clinicians in providing care.

Collapse

Temperley HC, O'Sullivan NJ, Mac Curtain BM, Corr A, Meaney JF, Kelly ME, Brennan I. Current applications and future potential of ChatGPT in radiology: A systematic review. J Med Imaging Radiat Oncol 2024;68:257-264. [PMID: 38243605 DOI: 10.1111/1754-9485.13621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2023] [Accepted: 12/29/2023] [Indexed: 01/21/2024]

Ismail A, Javan R. Reply to "New Horizons: The Potential Role of OpenAI's ChatGPT in Clinical Radiology". J Am Coll Radiol 2024;21:547-548. [PMID: 38052353 DOI: 10.1016/j.jacr.2023.11.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2023] [Revised: 10/17/2023] [Accepted: 11/12/2023] [Indexed: 12/07/2023]

Bera K, O'Connor G, Jiang S, Tirumani SH, Ramaiya N. Analysis of ChatGPT publications in radiology: Literature so far. Curr Probl Diagn Radiol 2024;53:215-225. [PMID: 37891083 DOI: 10.1067/j.cpradiol.2023.10.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Accepted: 10/18/2023] [Indexed: 10/29/2023]

Abstract

OBJECTIVE

To perform a detailed qualitative and quantitative analysis of the published literature on ChatGPT and radiology in the nine months since its public release, detailing the scope of the work in the short timeframe.

METHODS

A systematic literature search was carried out of the MEDLINE, EMBASE databases through August 15, 2023 for articles that were focused on ChatGPT and imaging/radiology. Articles were classified into original research and reviews/perspectives. Quantitative analysis was carried out by two experienced radiologists using objective scoring systems for evaluating original and non-original research.

RESULTS

51 articles were published involving ChatGPT and radiology/imaging dating from 26 Jan 2023 to the last article published on 14 Aug 2023. 23 articles were original research while the rest included reviews/perspectives or brief communications. For quantitative analysis scored by two readers, we included 23 original research and 17 non-original research articles (after excluding 11 letters as responses to previous articles). Mean score for original research was 3.20 out of 5 (across five questions), while mean score for non-original research was 1.17 out of 2 (across six questions). Mean score grading performance of ChatGPT in original research was 3.20 out of five (across two questions).

DISCUSSION

While it is early days for ChatGPT and its impact in radiology, there has already been a plethora of articles talking about the multifaceted nature of the tool and how it can impact every aspect of radiology from patient education, pre-authorization, protocol selection, generating differentials, to structuring radiology reports. Most articles show impressive performance of ChatGPT which can only improve with more research and improvements in the tool itself. There have also been several articles which have highlighted the limitations of ChatGPT in its current iteration, which will allow radiologists and researchers to improve these areas.

Collapse

Haver HL, Gupta AK, Ambinder EB, Bahl M, Oluyemi ET, Jeudy J, Yi PH. Evaluating the Use of ChatGPT to Accurately Simplify Patient-centered Information about Breast Cancer Prevention and Screening. Radiol Imaging Cancer 2024;6:e230086. [PMID: 38305716 PMCID: PMC10988327 DOI: 10.1148/rycan.230086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Revised: 11/28/2023] [Accepted: 12/26/2023] [Indexed: 02/03/2024]

Abstract

Purpose To evaluate the use of ChatGPT as a tool to simplify answers to common questions about breast cancer prevention and screening. Materials and Methods In this retrospective, exploratory study, ChatGPT was requested to simplify responses to 25 questions about breast cancer to a sixth-grade reading level in March and August 2023. Simplified responses were evaluated for clinical appropriateness. All original and simplified responses were assessed for reading ease on the Flesch Reading Ease Index and for readability on five scales: Flesch-Kincaid Grade Level, Gunning Fog Index, Coleman-Liau Index, Automated Readability Index, and the Simple Measure of Gobbledygook (ie, SMOG) Index. Mean reading ease, readability, and word count were compared between original and simplified responses using paired t tests. McNemar test was used to compare the proportion of responses with adequate reading ease (score of 60 or greater) and readability (sixth-grade level). Results ChatGPT improved mean reading ease (original responses, 46 vs simplified responses, 70; P < .001) and readability (original, grade 13 vs simplified, grade 8.9; P < .001) and decreased word count (original, 193 vs simplified, 173; P < .001). Ninety-two percent (23 of 25) of simplified responses were considered clinically appropriate. All 25 (100%) simplified responses met criteria for adequate reading ease, compared with only two of 25 original responses (P < .001). Two of the 25 simplified responses (8%) met criteria for adequate readability. Conclusion ChatGPT simplified answers to common breast cancer screening and prevention questions by improving the readability by four grade levels, though the potential to produce incorrect information necessitates physician oversight when using this tool. Keywords: Mammography, Screening, Informatics, Breast, Education, Health Policy and Practice, Oncology, Technology Assessment Supplemental material is available for this article. © RSNA, 2023.

Collapse

Affiliation(s)

Hana L. Haver From the University of Maryland Medical Intelligent Imaging (UM2ii) Center, Department of Diagnostic Radiology and Nuclear Medicine, University of Maryland School of Medicine, 670 W Baltimore St, First Floor, Rm 1172, Baltimore, MD 21201 (H.L.H., A.K.G., J.J., P.H.Y.); The Russell H. Morgan Department of Radiology and Radiological Science, Johns Hopkins University School of Medicine, Baltimore, Md (E.B.A., E.T.O.); Department of Radiology, Division of Breast Imaging, Massachusetts General Hospital, Boston, Mass (M.B.); Malone Center for Engineering in Healthcare, Whiting School of Engineering, Johns Hopkins University, Baltimore, Md (P.H.Y.); and Fischell Department of Bioengineering, A. James Clark School of Engineering, University of Maryland–College Park, College Park, Md (P.H.Y.)
Anuj K. Gupta From the University of Maryland Medical Intelligent Imaging (UM2ii) Center, Department of Diagnostic Radiology and Nuclear Medicine, University of Maryland School of Medicine, 670 W Baltimore St, First Floor, Rm 1172, Baltimore, MD 21201 (H.L.H., A.K.G., J.J., P.H.Y.); The Russell H. Morgan Department of Radiology and Radiological Science, Johns Hopkins University School of Medicine, Baltimore, Md (E.B.A., E.T.O.); Department of Radiology, Division of Breast Imaging, Massachusetts General Hospital, Boston, Mass (M.B.); Malone Center for Engineering in Healthcare, Whiting School of Engineering, Johns Hopkins University, Baltimore, Md (P.H.Y.); and Fischell Department of Bioengineering, A. James Clark School of Engineering, University of Maryland–College Park, College Park, Md (P.H.Y.)
Emily B. Ambinder From the University of Maryland Medical Intelligent Imaging (UM2ii) Center, Department of Diagnostic Radiology and Nuclear Medicine, University of Maryland School of Medicine, 670 W Baltimore St, First Floor, Rm 1172, Baltimore, MD 21201 (H.L.H., A.K.G., J.J., P.H.Y.); The Russell H. Morgan Department of Radiology and Radiological Science, Johns Hopkins University School of Medicine, Baltimore, Md (E.B.A., E.T.O.); Department of Radiology, Division of Breast Imaging, Massachusetts General Hospital, Boston, Mass (M.B.); Malone Center for Engineering in Healthcare, Whiting School of Engineering, Johns Hopkins University, Baltimore, Md (P.H.Y.); and Fischell Department of Bioengineering, A. James Clark School of Engineering, University of Maryland–College Park, College Park, Md (P.H.Y.)
Manisha Bahl From the University of Maryland Medical Intelligent Imaging (UM2ii) Center, Department of Diagnostic Radiology and Nuclear Medicine, University of Maryland School of Medicine, 670 W Baltimore St, First Floor, Rm 1172, Baltimore, MD 21201 (H.L.H., A.K.G., J.J., P.H.Y.); The Russell H. Morgan Department of Radiology and Radiological Science, Johns Hopkins University School of Medicine, Baltimore, Md (E.B.A., E.T.O.); Department of Radiology, Division of Breast Imaging, Massachusetts General Hospital, Boston, Mass (M.B.); Malone Center for Engineering in Healthcare, Whiting School of Engineering, Johns Hopkins University, Baltimore, Md (P.H.Y.); and Fischell Department of Bioengineering, A. James Clark School of Engineering, University of Maryland–College Park, College Park, Md (P.H.Y.)
Eniola T. Oluyemi From the University of Maryland Medical Intelligent Imaging (UM2ii) Center, Department of Diagnostic Radiology and Nuclear Medicine, University of Maryland School of Medicine, 670 W Baltimore St, First Floor, Rm 1172, Baltimore, MD 21201 (H.L.H., A.K.G., J.J., P.H.Y.); The Russell H. Morgan Department of Radiology and Radiological Science, Johns Hopkins University School of Medicine, Baltimore, Md (E.B.A., E.T.O.); Department of Radiology, Division of Breast Imaging, Massachusetts General Hospital, Boston, Mass (M.B.); Malone Center for Engineering in Healthcare, Whiting School of Engineering, Johns Hopkins University, Baltimore, Md (P.H.Y.); and Fischell Department of Bioengineering, A. James Clark School of Engineering, University of Maryland–College Park, College Park, Md (P.H.Y.)
Jean Jeudy From the University of Maryland Medical Intelligent Imaging (UM2ii) Center, Department of Diagnostic Radiology and Nuclear Medicine, University of Maryland School of Medicine, 670 W Baltimore St, First Floor, Rm 1172, Baltimore, MD 21201 (H.L.H., A.K.G., J.J., P.H.Y.); The Russell H. Morgan Department of Radiology and Radiological Science, Johns Hopkins University School of Medicine, Baltimore, Md (E.B.A., E.T.O.); Department of Radiology, Division of Breast Imaging, Massachusetts General Hospital, Boston, Mass (M.B.); Malone Center for Engineering in Healthcare, Whiting School of Engineering, Johns Hopkins University, Baltimore, Md (P.H.Y.); and Fischell Department of Bioengineering, A. James Clark School of Engineering, University of Maryland–College Park, College Park, Md (P.H.Y.)
Paul H. Yi From the University of Maryland Medical Intelligent Imaging (UM2ii) Center, Department of Diagnostic Radiology and Nuclear Medicine, University of Maryland School of Medicine, 670 W Baltimore St, First Floor, Rm 1172, Baltimore, MD 21201 (H.L.H., A.K.G., J.J., P.H.Y.); The Russell H. Morgan Department of Radiology and Radiological Science, Johns Hopkins University School of Medicine, Baltimore, Md (E.B.A., E.T.O.); Department of Radiology, Division of Breast Imaging, Massachusetts General Hospital, Boston, Mass (M.B.); Malone Center for Engineering in Healthcare, Whiting School of Engineering, Johns Hopkins University, Baltimore, Md (P.H.Y.); and Fischell Department of Bioengineering, A. James Clark School of Engineering, University of Maryland–College Park, College Park, Md (P.H.Y.)

Collapse

Ray PP, Majumder P. Evaluating the Limitations of ChatGPT in Generating Competent Radiology Reports for Distal Radius Fractures. Curr Probl Diagn Radiol 2024;53:166-167. [PMID: 37925239 DOI: 10.1067/j.cpradiol.2023.10.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2023] [Revised: 08/29/2023] [Accepted: 10/18/2023] [Indexed: 11/06/2023]

Kim W. Reply to, "New Horizons: The Potential Role of OpenAI's ChatGPT in Clinical Radiology". J Am Coll Radiol 2024;21:3-4. [PMID: 37944878 DOI: 10.1016/j.jacr.2023.10.020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2023] [Accepted: 10/09/2023] [Indexed: 11/12/2023]

Kuang YR, Zou MX, Niu HQ, Zheng BY, Zhang TL, Zheng BW. ChatGPT encounters multiple opportunities and challenges in neurosurgery. Int J Surg 2023;109:2886-2891. [PMID: 37352529 PMCID: PMC10583932 DOI: 10.1097/js9.0000000000000571] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Accepted: 06/10/2023] [Indexed: 06/25/2023]

Abstract

BACKGROUND

ChatGPT, powered by the GPT model and Transformer architecture, has demonstrated remarkable performance in the domains of medicine and healthcare, providing customized and informative responses. In our study, we investigated the potential of ChatGPT in the field of neurosurgery, focusing on its applications at the patient, neurosurgery student/resident, and neurosurgeon levels.

METHOD

The authors conducted inquiries with ChatGPT from the viewpoints of patients, neurosurgery students/residents, and neurosurgeons, covering a range of topics, such as disease diagnosis, treatment options, prognosis, rehabilitation, and patient care. The authors also explored concepts related to neurosurgery, including fundamental principles and clinical aspects, as well as tools and techniques to enhance the skills of neurosurgery students/residents. Additionally, the authors examined disease-specific medical interventions and the decision-making processes involved in clinical practice.

RESULTS

The authors received individual responses from ChatGPT, but they tended to be shallow and repetitive, lacking depth and personalization. Furthermore, ChatGPT may struggle to discern a patient's emotional state, hindering the establishment of rapport and the delivery of appropriate care. The language used in the medical field is influenced by technical and cultural factors, and biases in the training data can result in skewed or inaccurate responses. Additionally, ChatGPT's limitations include the inability to conduct physical examinations or interpret diagnostic images, potentially overlooking complex details and individual nuances in each patient's case. Moreover, its absence in the surgical setting limits its practical utility.

CONCLUSION

Although ChatGPT is a powerful language model, it cannot substitute for the expertise and experience of trained medical professionals. It lacks the capability to perform physical examinations, make diagnoses, administer treatments, establish trust, provide emotional support, and assist in the recovery process. Moreover, the implementation of Artificial Intelligence in healthcare necessitates careful consideration of legal and ethical concerns. While recognizing the potential of ChatGPT, additional training with comprehensive data is necessary to fully maximize its capabilities.

Collapse

Perera Molligoda Arachchige AS. New Horizons: The Potential Role of OpenAI's ChatGPT in Clinical Radiology. J Am Coll Radiol 2023;20:943. [PMID: 37517771 DOI: 10.1016/j.jacr.2023.06.028] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Accepted: 06/29/2023] [Indexed: 08/01/2023]

Ray PP, Majumder P. ChatGPT in Radiology: Transforming Patient Care With Artificial Intelligence Chatbots. J Am Coll Radiol 2023;20:943-944. [PMID: 37517769 DOI: 10.1016/j.jacr.2023.06.029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2023] [Accepted: 06/14/2023] [Indexed: 08/01/2023]

Currie GM. Academic integrity and artificial intelligence: is ChatGPT hype, hero or heresy? Semin Nucl Med 2023;53:719-730. [PMID: 37225599 DOI: 10.1053/j.semnuclmed.2023.04.008] [Citation(s) in RCA: 11] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2023] [Accepted: 04/30/2023] [Indexed: 05/26/2023]

Beaulieu-Jones BR, Shah S, Berrigan MT, Marwaha JS, Lai SL, Brat GA. Evaluating Capabilities of Large Language Models: Performance of GPT4 on Surgical Knowledge Assessments. medRxiv 2023:2023.07.16.23292743. [PMID: 37502981 PMCID: PMC10371188 DOI: 10.1101/2023.07.16.23292743] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]

Abstract

Background

Artificial intelligence (AI) has the potential to dramatically alter healthcare by enhancing how we diagnosis and treat disease. One promising AI model is ChatGPT, a large general-purpose language model trained by OpenAI. The chat interface has shown robust, human-level performance on several professional and academic benchmarks. We sought to probe its performance and stability over time on surgical case questions.

Methods

We evaluated the performance of ChatGPT-4 on two surgical knowledge assessments: the Surgical Council on Resident Education (SCORE) and a second commonly used knowledge assessment, referred to as Data-B. Questions were entered in two formats: open-ended and multiple choice. ChatGPT output were assessed for accuracy and insights by surgeon evaluators. We categorized reasons for model errors and the stability of performance on repeat encounters.

Results

A total of 167 SCORE and 112 Data-B questions were presented to the ChatGPT interface. ChatGPT correctly answered 71% and 68% of multiple-choice SCORE and Data-B questions, respectively. For both open-ended and multiple-choice questions, approximately two-thirds of ChatGPT responses contained non-obvious insights. Common reasons for inaccurate responses included: inaccurate information in a complex question (n=16, 36.4%); inaccurate information in fact-based question (n=11, 25.0%); and accurate information with circumstantial discrepancy (n=6, 13.6%). Upon repeat query, the answer selected by ChatGPT varied for 36.4% of inaccurate questions; the response accuracy changed for 6/16 questions.

Conclusion

Consistent with prior findings, we demonstrate robust near or above human-level performance of ChatGPT within the surgical domain. Unique to this study, we demonstrate a substantial inconsistency in ChatGPT responses with repeat query. This finding warrants future consideration and presents an opportunity to further train these models to provide safe and consistent responses. Without mental and/or conceptual models, it is unclear whether language models such as ChatGPT would be able to safely assist clinicians in providing care.

Collapse

Javan R, Kim T, Mostaghni N, Sarin S. ChatGPT's Potential Role in Interventional Radiology. Cardiovasc Intervent Radiol 2023:10.1007/s00270-023-03448-4. [PMID: 37127733 DOI: 10.1007/s00270-023-03448-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/02/2023] [Accepted: 04/12/2023] [Indexed: 05/03/2023]