1
|
Keshavarz P, Bagherieh S, Nabipoorashrafi SA, Chalian H, Rahsepar AA, Kim GHJ, Hassani C, Raman SS, Bedayat A. ChatGPT in radiology: A systematic review of performance, pitfalls, and future perspectives. Diagn Interv Imaging 2024:S2211-5684(24)00105-0. [PMID: 38679540 DOI: 10.1016/j.diii.2024.04.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 03/11/2024] [Accepted: 04/16/2024] [Indexed: 05/01/2024]
Abstract
PURPOSE The purpose of this study was to systematically review the reported performances of ChatGPT, identify potential limitations, and explore future directions for its integration, optimization, and ethical considerations in radiology applications. MATERIALS AND METHODS After a comprehensive review of PubMed, Web of Science, Embase, and Google Scholar databases, a cohort of published studies was identified up to January 1, 2024, utilizing ChatGPT for clinical radiology applications. RESULTS Out of 861 studies derived, 44 studies evaluated the performance of ChatGPT; among these, 37 (37/44; 84.1%) demonstrated high performance, and seven (7/44; 15.9%) indicated it had a lower performance in providing information on diagnosis and clinical decision support (6/44; 13.6%) and patient communication and educational content (1/44; 2.3%). Twenty-four (24/44; 54.5%) studies reported the proportion of ChatGPT's performance. Among these, 19 (19/24; 79.2%) studies recorded a median accuracy of 70.5%, and in five (5/24; 20.8%) studies, there was a median agreement of 83.6% between ChatGPT outcomes and reference standards [radiologists' decision or guidelines], generally confirming ChatGPT's high accuracy in these studies. Eleven studies compared two recent ChatGPT versions, and in ten (10/11; 90.9%), ChatGPTv4 outperformed v3.5, showing notable enhancements in addressing higher-order thinking questions, better comprehension of radiology terms, and improved accuracy in describing images. Risks and concerns about using ChatGPT included biased responses, limited originality, and the potential for inaccurate information leading to misinformation, hallucinations, improper citations and fake references, cybersecurity vulnerabilities, and patient privacy risks. CONCLUSION Although ChatGPT's effectiveness has been shown in 84.1% of radiology studies, there are still multiple pitfalls and limitations to address. It is too soon to confirm its complete proficiency and accuracy, and more extensive multicenter studies utilizing diverse datasets and pre-training techniques are required to verify ChatGPT's role in radiology.
Collapse
Affiliation(s)
- Pedram Keshavarz
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA; School of Science and Technology, The University of Georgia, Tbilisi 0171, Georgia
| | - Sara Bagherieh
- Independent Clinical Radiology Researcher, Los Angeles, CA 90024, USA
| | | | - Hamid Chalian
- Department of Radiology, Cardiothoracic Imaging, University of Washington, Seattle, WA 98195, USA
| | - Amir Ali Rahsepar
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA
| | - Grace Hyun J Kim
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA; Department of Radiological Sciences, Center for Computer Vision and Imaging Biomarkers, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA
| | - Cameron Hassani
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA
| | - Steven S Raman
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA
| | - Arash Bedayat
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA 90095, USA.
| |
Collapse
|
2
|
Shieh A, Tran B, He G, Kumar M, Freed JA, Majety P. Assessing ChatGPT 4.0's test performance and clinical diagnostic accuracy on USMLE STEP 2 CK and clinical case reports. Sci Rep 2024; 14:9330. [PMID: 38654011 DOI: 10.1038/s41598-024-58760-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2023] [Accepted: 04/02/2024] [Indexed: 04/25/2024] Open
Abstract
While there is data assessing the test performance of artificial intelligence (AI) chatbots, including the Generative Pre-trained Transformer 4.0 (GPT 4) chatbot (ChatGPT 4.0), there is scarce data on its diagnostic accuracy of clinical cases. We assessed the large language model (LLM), ChatGPT 4.0, on its ability to answer questions from the United States Medical Licensing Exam (USMLE) Step 2, as well as its ability to generate a differential diagnosis based on corresponding clinical vignettes from published case reports. A total of 109 Step 2 Clinical Knowledge (CK) practice questions were inputted into both ChatGPT 3.5 and ChatGPT 4.0, asking ChatGPT to pick the correct answer. Compared to its previous version, ChatGPT 3.5, we found improved accuracy of ChatGPT 4.0 when answering these questions, from 47.7 to 87.2% (p = 0.035) respectively. Utilizing the topics tested on Step 2 CK questions, we additionally found 63 corresponding published case report vignettes and asked ChatGPT 4.0 to come up with its top three differential diagnosis. ChatGPT 4.0 accurately created a shortlist of differential diagnoses in 74.6% of the 63 case reports (74.6%). We analyzed ChatGPT 4.0's confidence in its diagnosis by asking it to rank its top three differentials from most to least likely. Out of the 47 correct diagnoses, 33 were the first (70.2%) on the differential diagnosis list, 11 were second (23.4%), and three were third (6.4%). Our study shows the continued iterative improvement in ChatGPT's ability to answer standardized USMLE questions accurately and provides insights into ChatGPT's clinical diagnostic accuracy.
Collapse
Affiliation(s)
- Allen Shieh
- Virginia Commonwealth University School of Medicine, Richmond, VA, USA
| | - Brandon Tran
- Virginia Commonwealth University School of Medicine, Richmond, VA, USA.
| | - Gene He
- Virginia Commonwealth University School of Medicine, Richmond, VA, USA
| | - Mudit Kumar
- Division of Child and Adolescent Psychiatry, Department of Psychiatry, Virginia Commonwealth University, Richmond, VA, USA
| | - Jason A Freed
- Division of Hematology and Hematologic Malignancies, Department of Internal Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA
| | - Priyanka Majety
- Division of Endocrinology, Diabetes and Metabolism, Department of Internal Medicine, Virginia Commonwealth University, Richmond, VA, USA
| |
Collapse
|
3
|
Beaulieu-Jones BR, Berrigan MT, Shah S, Marwaha JS, Lai SL, Brat GA. Evaluating capabilities of large language models: Performance of GPT-4 on surgical knowledge assessments. Surgery 2024; 175:936-942. [PMID: 38246839 PMCID: PMC10947829 DOI: 10.1016/j.surg.2023.12.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Revised: 12/09/2023] [Accepted: 12/15/2023] [Indexed: 01/23/2024]
Abstract
BACKGROUND Artificial intelligence has the potential to dramatically alter health care by enhancing how we diagnose and treat disease. One promising artificial intelligence model is ChatGPT, a general-purpose large language model trained by OpenAI. ChatGPT has shown human-level performance on several professional and academic benchmarks. We sought to evaluate its performance on surgical knowledge questions and assess the stability of this performance on repeat queries. METHODS We evaluated the performance of ChatGPT-4 on questions from the Surgical Council on Resident Education question bank and a second commonly used surgical knowledge assessment, referred to as Data-B. Questions were entered in 2 formats: open-ended and multiple-choice. ChatGPT outputs were assessed for accuracy and insights by surgeon evaluators. We categorized reasons for model errors and the stability of performance on repeat queries. RESULTS A total of 167 Surgical Council on Resident Education and 112 Data-B questions were presented to the ChatGPT interface. ChatGPT correctly answered 71.3% and 67.9% of multiple choice and 47.9% and 66.1% of open-ended questions for Surgical Council on Resident Education and Data-B, respectively. For both open-ended and multiple-choice questions, approximately two-thirds of ChatGPT responses contained nonobvious insights. Common reasons for incorrect responses included inaccurate information in a complex question (n = 16, 36.4%), inaccurate information in a fact-based question (n = 11, 25.0%), and accurate information with circumstantial discrepancy (n = 6, 13.6%). Upon repeat query, the answer selected by ChatGPT varied for 36.4% of questions answered incorrectly on the first query; the response accuracy changed for 6/16 (37.5%) questions. CONCLUSION Consistent with findings in other academic and professional domains, we demonstrate near or above human-level performance of ChatGPT on surgical knowledge questions from 2 widely used question banks. ChatGPT performed better on multiple-choice than open-ended questions, prompting questions regarding its potential for clinical application. Unique to this study, we demonstrate inconsistency in ChatGPT responses on repeat queries. This finding warrants future consideration including efforts at training large language models to provide the safe and consistent responses required for clinical application. Despite near or above human-level performance on question banks and given these observations, it is unclear whether large language models such as ChatGPT are able to safely assist clinicians in providing care.
Collapse
Affiliation(s)
- Brendin R Beaulieu-Jones
- Department of Surgery, Beth Israel Deaconess Medical Center, Boston, MA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA. https://twitter.com/bratogram
| | | | - Sahaj Shah
- Geisinger Commonwealth School of Medicine, Scranton, PA
| | - Jayson S Marwaha
- Division of Colorectal Surgery, National Taiwan University Hospital, Taipei, Taiwan
| | - Shuo-Lun Lai
- Division of Colorectal Surgery, National Taiwan University Hospital, Taipei, Taiwan
| | - Gabriel A Brat
- Department of Surgery, Beth Israel Deaconess Medical Center, Boston, MA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA.
| |
Collapse
|
4
|
Temperley HC, O'Sullivan NJ, Mac Curtain BM, Corr A, Meaney JF, Kelly ME, Brennan I. Current applications and future potential of ChatGPT in radiology: A systematic review. J Med Imaging Radiat Oncol 2024; 68:257-264. [PMID: 38243605 DOI: 10.1111/1754-9485.13621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2023] [Accepted: 12/29/2023] [Indexed: 01/21/2024]
Abstract
This study aimed to comprehensively evaluate the current utilization and future potential of ChatGPT, an AI-based chat model, in the field of radiology. The primary focus is on its role in enhancing decision-making processes, optimizing workflow efficiency, and fostering interdisciplinary collaboration and teaching within healthcare. A systematic search was conducted in PubMed, EMBASE and Web of Science databases. Key aspects, such as its impact on complex decision-making, workflow enhancement and collaboration, were assessed. Limitations and challenges associated with ChatGPT implementation were also examined. Overall, six studies met the inclusion criteria and were included in our analysis. All studies were prospective in nature. A total of 551 chatGPT (version 3.0 to 4.0) assessment events were included in our analysis. Considering the generation of academic papers, ChatGPT was found to output data inaccuracies 80% of the time. When ChatGPT was asked questions regarding common interventional radiology procedures, it contained entirely incorrect information 45% of the time. ChatGPT was seen to better answer US board-style questions when lower order thinking was required (P = 0.002). Improvements were seen between chatGPT 3.5 and 4.0 in regard to imaging questions with accuracy rates of 61 versus 85%(P = 0.009). ChatGPT was observed to have an average translational ability score of 4.27/5 on the Likert scale regarding CT and MRI findings. ChatGPT demonstrates substantial potential to augment decision-making and optimizing workflow. While ChatGPT's promise is evident, thorough evaluation and validation are imperative before widespread adoption in the field of radiology.
Collapse
Affiliation(s)
- Hugo C Temperley
- Department of Radiology, St. James's Hospital, Dublin, Ireland
- Department of Surgery, St. James's Hospital, Dublin, Ireland
| | | | | | - Alison Corr
- Department of Radiology, St. James's Hospital, Dublin, Ireland
| | - James F Meaney
- Department of Radiology, St. James's Hospital, Dublin, Ireland
| | - Michael E Kelly
- Department of Surgery, St. James's Hospital, Dublin, Ireland
| | - Ian Brennan
- Department of Radiology, St. James's Hospital, Dublin, Ireland
| |
Collapse
|
5
|
Ismail A, Javan R. Reply to "New Horizons: The Potential Role of OpenAI's ChatGPT in Clinical Radiology". J Am Coll Radiol 2024; 21:547-548. [PMID: 38052353 DOI: 10.1016/j.jacr.2023.11.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2023] [Revised: 10/17/2023] [Accepted: 11/12/2023] [Indexed: 12/07/2023]
Affiliation(s)
- Ahmed Ismail
- George Washington University of Health Sciences and Medicine, Washington, DC
| | - Ramin Javan
- Director of Advanced Brain Imaging and 3D Innovations Lab and Medical Student Radiology Clerkship Director.
| |
Collapse
|
6
|
Bera K, O'Connor G, Jiang S, Tirumani SH, Ramaiya N. Analysis of ChatGPT publications in radiology: Literature so far. Curr Probl Diagn Radiol 2024; 53:215-225. [PMID: 37891083 DOI: 10.1067/j.cpradiol.2023.10.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Accepted: 10/18/2023] [Indexed: 10/29/2023]
Abstract
OBJECTIVE To perform a detailed qualitative and quantitative analysis of the published literature on ChatGPT and radiology in the nine months since its public release, detailing the scope of the work in the short timeframe. METHODS A systematic literature search was carried out of the MEDLINE, EMBASE databases through August 15, 2023 for articles that were focused on ChatGPT and imaging/radiology. Articles were classified into original research and reviews/perspectives. Quantitative analysis was carried out by two experienced radiologists using objective scoring systems for evaluating original and non-original research. RESULTS 51 articles were published involving ChatGPT and radiology/imaging dating from 26 Jan 2023 to the last article published on 14 Aug 2023. 23 articles were original research while the rest included reviews/perspectives or brief communications. For quantitative analysis scored by two readers, we included 23 original research and 17 non-original research articles (after excluding 11 letters as responses to previous articles). Mean score for original research was 3.20 out of 5 (across five questions), while mean score for non-original research was 1.17 out of 2 (across six questions). Mean score grading performance of ChatGPT in original research was 3.20 out of five (across two questions). DISCUSSION While it is early days for ChatGPT and its impact in radiology, there has already been a plethora of articles talking about the multifaceted nature of the tool and how it can impact every aspect of radiology from patient education, pre-authorization, protocol selection, generating differentials, to structuring radiology reports. Most articles show impressive performance of ChatGPT which can only improve with more research and improvements in the tool itself. There have also been several articles which have highlighted the limitations of ChatGPT in its current iteration, which will allow radiologists and researchers to improve these areas.
Collapse
Affiliation(s)
- Kaustav Bera
- Department of Radiology, University Hospitals Cleveland Medical Center, 11000 Euclid Avenue, Cleveland, OH, 44106, USA.
| | - Gregory O'Connor
- Department of Radiology, University Hospitals Cleveland Medical Center, 11000 Euclid Avenue, Cleveland, OH, 44106, USA
| | - Sirui Jiang
- Department of Radiology, University Hospitals Cleveland Medical Center, 11000 Euclid Avenue, Cleveland, OH, 44106, USA
| | - Sree Harsha Tirumani
- Department of Radiology, University Hospitals Cleveland Medical Center, 11000 Euclid Avenue, Cleveland, OH, 44106, USA
| | - Nikhil Ramaiya
- Department of Radiology, University Hospitals Cleveland Medical Center, 11000 Euclid Avenue, Cleveland, OH, 44106, USA
| |
Collapse
|
7
|
Haver HL, Gupta AK, Ambinder EB, Bahl M, Oluyemi ET, Jeudy J, Yi PH. Evaluating the Use of ChatGPT to Accurately Simplify Patient-centered Information about Breast Cancer Prevention and Screening. Radiol Imaging Cancer 2024; 6:e230086. [PMID: 38305716 PMCID: PMC10988327 DOI: 10.1148/rycan.230086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Revised: 11/28/2023] [Accepted: 12/26/2023] [Indexed: 02/03/2024]
Abstract
Purpose To evaluate the use of ChatGPT as a tool to simplify answers to common questions about breast cancer prevention and screening. Materials and Methods In this retrospective, exploratory study, ChatGPT was requested to simplify responses to 25 questions about breast cancer to a sixth-grade reading level in March and August 2023. Simplified responses were evaluated for clinical appropriateness. All original and simplified responses were assessed for reading ease on the Flesch Reading Ease Index and for readability on five scales: Flesch-Kincaid Grade Level, Gunning Fog Index, Coleman-Liau Index, Automated Readability Index, and the Simple Measure of Gobbledygook (ie, SMOG) Index. Mean reading ease, readability, and word count were compared between original and simplified responses using paired t tests. McNemar test was used to compare the proportion of responses with adequate reading ease (score of 60 or greater) and readability (sixth-grade level). Results ChatGPT improved mean reading ease (original responses, 46 vs simplified responses, 70; P < .001) and readability (original, grade 13 vs simplified, grade 8.9; P < .001) and decreased word count (original, 193 vs simplified, 173; P < .001). Ninety-two percent (23 of 25) of simplified responses were considered clinically appropriate. All 25 (100%) simplified responses met criteria for adequate reading ease, compared with only two of 25 original responses (P < .001). Two of the 25 simplified responses (8%) met criteria for adequate readability. Conclusion ChatGPT simplified answers to common breast cancer screening and prevention questions by improving the readability by four grade levels, though the potential to produce incorrect information necessitates physician oversight when using this tool. Keywords: Mammography, Screening, Informatics, Breast, Education, Health Policy and Practice, Oncology, Technology Assessment Supplemental material is available for this article. © RSNA, 2023.
Collapse
Affiliation(s)
- Hana L. Haver
- From the University of Maryland Medical Intelligent Imaging (UM2ii)
Center, Department of Diagnostic Radiology and Nuclear Medicine, University of
Maryland School of Medicine, 670 W Baltimore St, First Floor, Rm 1172,
Baltimore, MD 21201 (H.L.H., A.K.G., J.J., P.H.Y.); The Russell H. Morgan
Department of Radiology and Radiological Science, Johns Hopkins University
School of Medicine, Baltimore, Md (E.B.A., E.T.O.); Department of Radiology,
Division of Breast Imaging, Massachusetts General Hospital, Boston, Mass (M.B.);
Malone Center for Engineering in Healthcare, Whiting School of Engineering,
Johns Hopkins University, Baltimore, Md (P.H.Y.); and Fischell Department of
Bioengineering, A. James Clark School of Engineering, University of
Maryland–College Park, College Park, Md (P.H.Y.)
| | - Anuj K. Gupta
- From the University of Maryland Medical Intelligent Imaging (UM2ii)
Center, Department of Diagnostic Radiology and Nuclear Medicine, University of
Maryland School of Medicine, 670 W Baltimore St, First Floor, Rm 1172,
Baltimore, MD 21201 (H.L.H., A.K.G., J.J., P.H.Y.); The Russell H. Morgan
Department of Radiology and Radiological Science, Johns Hopkins University
School of Medicine, Baltimore, Md (E.B.A., E.T.O.); Department of Radiology,
Division of Breast Imaging, Massachusetts General Hospital, Boston, Mass (M.B.);
Malone Center for Engineering in Healthcare, Whiting School of Engineering,
Johns Hopkins University, Baltimore, Md (P.H.Y.); and Fischell Department of
Bioengineering, A. James Clark School of Engineering, University of
Maryland–College Park, College Park, Md (P.H.Y.)
| | - Emily B. Ambinder
- From the University of Maryland Medical Intelligent Imaging (UM2ii)
Center, Department of Diagnostic Radiology and Nuclear Medicine, University of
Maryland School of Medicine, 670 W Baltimore St, First Floor, Rm 1172,
Baltimore, MD 21201 (H.L.H., A.K.G., J.J., P.H.Y.); The Russell H. Morgan
Department of Radiology and Radiological Science, Johns Hopkins University
School of Medicine, Baltimore, Md (E.B.A., E.T.O.); Department of Radiology,
Division of Breast Imaging, Massachusetts General Hospital, Boston, Mass (M.B.);
Malone Center for Engineering in Healthcare, Whiting School of Engineering,
Johns Hopkins University, Baltimore, Md (P.H.Y.); and Fischell Department of
Bioengineering, A. James Clark School of Engineering, University of
Maryland–College Park, College Park, Md (P.H.Y.)
| | - Manisha Bahl
- From the University of Maryland Medical Intelligent Imaging (UM2ii)
Center, Department of Diagnostic Radiology and Nuclear Medicine, University of
Maryland School of Medicine, 670 W Baltimore St, First Floor, Rm 1172,
Baltimore, MD 21201 (H.L.H., A.K.G., J.J., P.H.Y.); The Russell H. Morgan
Department of Radiology and Radiological Science, Johns Hopkins University
School of Medicine, Baltimore, Md (E.B.A., E.T.O.); Department of Radiology,
Division of Breast Imaging, Massachusetts General Hospital, Boston, Mass (M.B.);
Malone Center for Engineering in Healthcare, Whiting School of Engineering,
Johns Hopkins University, Baltimore, Md (P.H.Y.); and Fischell Department of
Bioengineering, A. James Clark School of Engineering, University of
Maryland–College Park, College Park, Md (P.H.Y.)
| | - Eniola T. Oluyemi
- From the University of Maryland Medical Intelligent Imaging (UM2ii)
Center, Department of Diagnostic Radiology and Nuclear Medicine, University of
Maryland School of Medicine, 670 W Baltimore St, First Floor, Rm 1172,
Baltimore, MD 21201 (H.L.H., A.K.G., J.J., P.H.Y.); The Russell H. Morgan
Department of Radiology and Radiological Science, Johns Hopkins University
School of Medicine, Baltimore, Md (E.B.A., E.T.O.); Department of Radiology,
Division of Breast Imaging, Massachusetts General Hospital, Boston, Mass (M.B.);
Malone Center for Engineering in Healthcare, Whiting School of Engineering,
Johns Hopkins University, Baltimore, Md (P.H.Y.); and Fischell Department of
Bioengineering, A. James Clark School of Engineering, University of
Maryland–College Park, College Park, Md (P.H.Y.)
| | - Jean Jeudy
- From the University of Maryland Medical Intelligent Imaging (UM2ii)
Center, Department of Diagnostic Radiology and Nuclear Medicine, University of
Maryland School of Medicine, 670 W Baltimore St, First Floor, Rm 1172,
Baltimore, MD 21201 (H.L.H., A.K.G., J.J., P.H.Y.); The Russell H. Morgan
Department of Radiology and Radiological Science, Johns Hopkins University
School of Medicine, Baltimore, Md (E.B.A., E.T.O.); Department of Radiology,
Division of Breast Imaging, Massachusetts General Hospital, Boston, Mass (M.B.);
Malone Center for Engineering in Healthcare, Whiting School of Engineering,
Johns Hopkins University, Baltimore, Md (P.H.Y.); and Fischell Department of
Bioengineering, A. James Clark School of Engineering, University of
Maryland–College Park, College Park, Md (P.H.Y.)
| | - Paul H. Yi
- From the University of Maryland Medical Intelligent Imaging (UM2ii)
Center, Department of Diagnostic Radiology and Nuclear Medicine, University of
Maryland School of Medicine, 670 W Baltimore St, First Floor, Rm 1172,
Baltimore, MD 21201 (H.L.H., A.K.G., J.J., P.H.Y.); The Russell H. Morgan
Department of Radiology and Radiological Science, Johns Hopkins University
School of Medicine, Baltimore, Md (E.B.A., E.T.O.); Department of Radiology,
Division of Breast Imaging, Massachusetts General Hospital, Boston, Mass (M.B.);
Malone Center for Engineering in Healthcare, Whiting School of Engineering,
Johns Hopkins University, Baltimore, Md (P.H.Y.); and Fischell Department of
Bioengineering, A. James Clark School of Engineering, University of
Maryland–College Park, College Park, Md (P.H.Y.)
| |
Collapse
|
8
|
Ray PP, Majumder P. Evaluating the Limitations of ChatGPT in Generating Competent Radiology Reports for Distal Radius Fractures. Curr Probl Diagn Radiol 2024; 53:166-167. [PMID: 37925239 DOI: 10.1067/j.cpradiol.2023.10.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2023] [Revised: 08/29/2023] [Accepted: 10/18/2023] [Indexed: 11/06/2023]
|
9
|
Kim W. Reply to, "New Horizons: The Potential Role of OpenAI's ChatGPT in Clinical Radiology". J Am Coll Radiol 2024; 21:3-4. [PMID: 37944878 DOI: 10.1016/j.jacr.2023.10.020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2023] [Accepted: 10/09/2023] [Indexed: 11/12/2023]
|
10
|
Kuang YR, Zou MX, Niu HQ, Zheng BY, Zhang TL, Zheng BW. ChatGPT encounters multiple opportunities and challenges in neurosurgery. Int J Surg 2023; 109:2886-2891. [PMID: 37352529 PMCID: PMC10583932 DOI: 10.1097/js9.0000000000000571] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Accepted: 06/10/2023] [Indexed: 06/25/2023]
Abstract
BACKGROUND ChatGPT, powered by the GPT model and Transformer architecture, has demonstrated remarkable performance in the domains of medicine and healthcare, providing customized and informative responses. In our study, we investigated the potential of ChatGPT in the field of neurosurgery, focusing on its applications at the patient, neurosurgery student/resident, and neurosurgeon levels. METHOD The authors conducted inquiries with ChatGPT from the viewpoints of patients, neurosurgery students/residents, and neurosurgeons, covering a range of topics, such as disease diagnosis, treatment options, prognosis, rehabilitation, and patient care. The authors also explored concepts related to neurosurgery, including fundamental principles and clinical aspects, as well as tools and techniques to enhance the skills of neurosurgery students/residents. Additionally, the authors examined disease-specific medical interventions and the decision-making processes involved in clinical practice. RESULTS The authors received individual responses from ChatGPT, but they tended to be shallow and repetitive, lacking depth and personalization. Furthermore, ChatGPT may struggle to discern a patient's emotional state, hindering the establishment of rapport and the delivery of appropriate care. The language used in the medical field is influenced by technical and cultural factors, and biases in the training data can result in skewed or inaccurate responses. Additionally, ChatGPT's limitations include the inability to conduct physical examinations or interpret diagnostic images, potentially overlooking complex details and individual nuances in each patient's case. Moreover, its absence in the surgical setting limits its practical utility. CONCLUSION Although ChatGPT is a powerful language model, it cannot substitute for the expertise and experience of trained medical professionals. It lacks the capability to perform physical examinations, make diagnoses, administer treatments, establish trust, provide emotional support, and assist in the recovery process. Moreover, the implementation of Artificial Intelligence in healthcare necessitates careful consideration of legal and ethical concerns. While recognizing the potential of ChatGPT, additional training with comprehensive data is necessary to fully maximize its capabilities.
Collapse
Affiliation(s)
- Yi-Rui Kuang
- Department of Spine Surgery, The First Affiliated Hospital, Hengyang medical school, University of South China, Hengyang, China
- Department of Neurosurgery, Xiangya Hospital, Central South University, Changsha, China
- National Clinical Research Center for Geriatric Disorders, Xiangya Hospital, Central South University, Changsha, China
| | - Ming-Xiang Zou
- Department of Spine Surgery, The First Affiliated Hospital, Hengyang medical school, University of South China, Hengyang, China
| | - Hua-Qing Niu
- Department of Ophthalmology, The Second Affiliated Hospital of Zhengzhou University, Zhengzhou, China
| | - Bo-Yv Zheng
- Department of Orthopedics Surgery, General Hospital of the Central Theater Command, Wuhan, China
| | - Tao-Lan Zhang
- Department of Spine Surgery, The First Affiliated Hospital, Hengyang medical school, University of South China, Hengyang, China
- Department of Pharmacy, The First Affiliated Hospital, Hengyang Medical School, University of South China, Hengyang, China
| | - Bo-Wen Zheng
- Department of Musculoskeletal Tumor Center, People’s Hospital, Peking University, Beijing Key Laboratory of Musculoskeletal Tumor. Beijing, China
| |
Collapse
|
11
|
Perera Molligoda Arachchige AS. New Horizons: The Potential Role of OpenAI's ChatGPT in Clinical Radiology. J Am Coll Radiol 2023; 20:943. [PMID: 37517771 DOI: 10.1016/j.jacr.2023.06.028] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Accepted: 06/29/2023] [Indexed: 08/01/2023]
|
12
|
Ray PP, Majumder P. ChatGPT in Radiology: Transforming Patient Care With Artificial Intelligence Chatbots. J Am Coll Radiol 2023; 20:943-944. [PMID: 37517769 DOI: 10.1016/j.jacr.2023.06.029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2023] [Accepted: 06/14/2023] [Indexed: 08/01/2023]
Affiliation(s)
| | - Poulami Majumder
- Maulana Abul Kalam Azad University of Technology, Kolkata, India
| |
Collapse
|
13
|
Abstract
Academic integrity in both higher education and scientific writing has been challenged by developments in artificial intelligence. The limitations associated with algorithms have been largely overcome by the recently released ChatGPT; a chatbot powered by GPT-3.5 capable of producing accurate and human-like responses to questions in real-time. Despite the potential benefits, ChatGPT confronts significant limitations to its usefulness in nuclear medicine and radiology. Most notably, ChatGPT is prone to errors and fabrication of information which poses a risk to professionalism, ethics and integrity. These limitations simultaneously undermine the value of ChatGPT to the user by not producing outcomes at the expected standard. Nonetheless, there are a number of exciting applications of ChatGPT in nuclear medicine across education, clinical and research sectors. Assimilation of ChatGPT into practice requires redefining of norms, and re-engineering of information expectations.
Collapse
Affiliation(s)
- Geoffrey M Currie
- Charles Sturt University, Wagga Wagga, NSW, Australia; Baylor College of Medicine, Houston, TX.
| |
Collapse
|
14
|
Beaulieu-Jones BR, Shah S, Berrigan MT, Marwaha JS, Lai SL, Brat GA. Evaluating Capabilities of Large Language Models: Performance of GPT4 on Surgical Knowledge Assessments. medRxiv 2023:2023.07.16.23292743. [PMID: 37502981 PMCID: PMC10371188 DOI: 10.1101/2023.07.16.23292743] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
Background Artificial intelligence (AI) has the potential to dramatically alter healthcare by enhancing how we diagnosis and treat disease. One promising AI model is ChatGPT, a large general-purpose language model trained by OpenAI. The chat interface has shown robust, human-level performance on several professional and academic benchmarks. We sought to probe its performance and stability over time on surgical case questions. Methods We evaluated the performance of ChatGPT-4 on two surgical knowledge assessments: the Surgical Council on Resident Education (SCORE) and a second commonly used knowledge assessment, referred to as Data-B. Questions were entered in two formats: open-ended and multiple choice. ChatGPT output were assessed for accuracy and insights by surgeon evaluators. We categorized reasons for model errors and the stability of performance on repeat encounters. Results A total of 167 SCORE and 112 Data-B questions were presented to the ChatGPT interface. ChatGPT correctly answered 71% and 68% of multiple-choice SCORE and Data-B questions, respectively. For both open-ended and multiple-choice questions, approximately two-thirds of ChatGPT responses contained non-obvious insights. Common reasons for inaccurate responses included: inaccurate information in a complex question (n=16, 36.4%); inaccurate information in fact-based question (n=11, 25.0%); and accurate information with circumstantial discrepancy (n=6, 13.6%). Upon repeat query, the answer selected by ChatGPT varied for 36.4% of inaccurate questions; the response accuracy changed for 6/16 questions. Conclusion Consistent with prior findings, we demonstrate robust near or above human-level performance of ChatGPT within the surgical domain. Unique to this study, we demonstrate a substantial inconsistency in ChatGPT responses with repeat query. This finding warrants future consideration and presents an opportunity to further train these models to provide safe and consistent responses. Without mental and/or conceptual models, it is unclear whether language models such as ChatGPT would be able to safely assist clinicians in providing care.
Collapse
Affiliation(s)
- Brendin R Beaulieu-Jones
- Department of Surgery, Beth Israel Deaconess Medical Center, Boston, MA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA
| | - Sahaj Shah
- Geisinger Commonwealth School of Medicine, Scranton, PA
| | | | - Jayson S Marwaha
- Division of Colorectal Surgery, National Taiwan University Hospital, Taipei, Taiwan
| | - Shuo-Lun Lai
- Division of Colorectal Surgery, National Taiwan University Hospital, Taipei, Taiwan
| | - Gabriel A Brat
- Department of Surgery, Beth Israel Deaconess Medical Center, Boston, MA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA
| |
Collapse
|
15
|
Javan R, Kim T, Mostaghni N, Sarin S. ChatGPT's Potential Role in Interventional Radiology. Cardiovasc Intervent Radiol 2023:10.1007/s00270-023-03448-4. [PMID: 37127733 DOI: 10.1007/s00270-023-03448-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/02/2023] [Accepted: 04/12/2023] [Indexed: 05/03/2023]
Affiliation(s)
- Ramin Javan
- Department of Radiology, George Washington University Hospital, 900 23Rd St NW, Suite G2092, Washington, DC, 20037, USA.
| | - Theodore Kim
- George Washington University School of Medicine and Health Sciences, Washington, DC, 20037, USA
| | - Navid Mostaghni
- School of Medicine, California University of Science and Medicine, Colton, CA, 92324, USA
| | - Shawn Sarin
- Department of Interventional Radiology, George Washington University Hospital, Washington, DC, 20037, USA
| |
Collapse
|